Wednesday, September 7, 2016

Use PGStrom with PostgreSQL on Ubuntu 14.04 LTS

Software Installation

Install useful utilities
sudo apt-get install wget git rpm 

Install Cuda
  • Download Cuda local install package from https://developer.nvidia.com/cuda-downloads

  • Run the commands below to install Cuda and reboot
  • $sudo dpkg -i cuda-repo-ubuntu1404-7-5-local_7.5-18_amd64.deb
    $sudo apt-get update
    $sudo apt-get install cuda
    

  • Create/Edit /etc/ld.so.conf.d/cuda-lib64.conf with Cuda library path
    $echo /usr/local/cuda/lib64 > /etc/ld.so.conf.d/cuda-lib64.conf
    
  • Build the deviceQuery sample from Cuda installation and verify that the Cuda driver works
  • $cp -r /usr/local/cuda/samples .
    $cd samples/1_Utilities/deviceQuery 
    $make
    $./deviceQuery 
    ./deviceQuery Starting...
    
     CUDA Device Query (Runtime API) version (CUDART static linking)
    
    Detected 1 CUDA Capable device(s)
    
    Device 0: "Quadro K1000M"
      CUDA Driver Version / Runtime Version          7.5 / 7.5
      CUDA Capability Major/Minor version number:    3.0
      Total amount of global memory:                 2047 MBytes (2146762752 bytes)
      ( 1) Multiprocessors, (192) CUDA Cores/MP:     192 CUDA Cores
      GPU Max Clock rate:                            851 MHz (0.85 GHz)
      Memory Clock rate:                             900 Mhz
      Memory Bus Width:                              128-bit
      L2 Cache Size:                                 262144 bytes
      Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
      Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
      Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
      Total amount of constant memory:               65536 bytes
      Total amount of shared memory per block:       49152 bytes
      Total number of registers available per block: 65536
      Warp size:                                     32
      Maximum number of threads per multiprocessor:  2048
      Maximum number of threads per block:           1024
      Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
      Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
      Maximum memory pitch:                          2147483647 bytes
      Texture alignment:                             512 bytes
      Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
      Run time limit on kernels:                     Yes
      Integrated GPU sharing Host Memory:            No
      Support host page-locked memory mapping:       Yes
      Alignment requirement for Surfaces:            Yes
      Device has ECC support:                        Disabled
      Device supports Unified Addressing (UVA):      Yes
      Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
      Compute Mode:
         < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
    
    deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Version = 7.5, NumDevs = 1, Device0 = Quadro K1000M
    Result = PASS
Install PostgreSQL
  • Import PostgreSQL repository key
    $wget -O - http://apt.postgresql.org/pub/repos/apt/ACCC4CF8.asc | sudo apt-key add -
    

  • Create or edit /etc/apt/sources.list.d/pgdg.list with the line below
    deb http://apt.postgresql.org/pub/repos/apt/ trusty-pgdg main
    

  • Configure apt's package pinning to prefer the PGDG packages over the Debian ones in /etc/apt/preferences.d/pgdg.pref. this will replace all your Debian/Ubuntu packages with available packages from the PGDG repository.
    Package: *
    Pin: release o=apt.postgresql.org
    Pin-Priority: 500
    
  • Update the package lists, and install the pgdg-keyring package to automatically get repository key updates
    $sudo apt-get update
    $sudo apt-get install pgdg-keyring
    
  • Install PostgreSQL 9.5
    $sudo apt-get install -y postgresql-common
    $sudo apt-get install -y postgresql-9.5 postgresql-contrib-9.5 postgresql-server-dev-9.5 libpq-dev
    
  • The PostgreSQL server will automatically be started after installation. Run the commands below to manually start or restart the server if needed.
    $sudo service postgresql restart
    $sudo service postgresql start
    
  • Run the command below to start psql terminal and connect to the server
    $sudo -u postgre psql
    

Install PGStrom [PGStrom Wiki]
  • Download and install PGStrom
    $git clone https://github.com/pg-strom/devel pg_strom
    $cd pg_strom
    $which pg_config
    /usr/local/pgsql/bin/pg_config
    $make
    $sudo make install
    
  • Edit /etc/postgresql/9.5/main/postgresql.conf to update configuration for PGStrom
    • Add /usr/lib/postgresql/9.5/lib/pg_strom.so to shared_preload_libraries
  • Restart PostgreSQL service
    $sudo service postgresql restart
    
  • PostgreSQL log file /var/log/postgresql/postgresql-9.5-main.log shows PGStrom is loaded with supported GPU device
    $ cat /var/log/postgresql/postgresql-9.5-main.log 
    LOG:  PG-Strom version 1.0devel built for PostgreSQL 9.5
    LOG:  CUDA Runtime version: 7.5.0
    LOG:  NVIDIA driver version: 352.63
    LOG:  GPU0 Quadro K1000M (192 CUDA cores, 850MHz), L2 256KB, RAM 2047MB (128bits, 900MHz), capability 3.0
    LOG:  NVRTC - CUDA Runtime Compilation vertion 7.5
    

Test PGStrom Integration with PostgreSQL

  • Start psql terminal
    sudo -u postgres psql
    
  • Create pg_strom extension
    postgres=#CREATE EXTENSION pg_strom;
    CREATE EXTENSION
    
  • Create a test DB and connect to it
    postgres=# create database testdb;
    CREATE DATABASE
    postgres=# \c testdb
    You are now connected to database "testdb" as user "postgres".
    
  • Create test tables from SQL script in PGStrom installation. It creates one 100M rows table and 9 of 100K rows table.
    postgres=# \i pgstrom/test/testdb.sql
    
  • Check pg_strom parameters used by postgresql:
    postgres=# show all;
    pg_strom.bulkexec                   | on                                       | Enables the bulk-execution mode of PG-Strom
    pg_strom.chunk_limit                | 79360kB                                  | limit size of pgstrom_data_store
    pg_strom.chunk_size                 | 15872kB                                  | default size of pgstrom_data_store
    pg_strom.chunk_size_margin          | 1.25                                     | margin of chunk size if not predictable exactly
    pg_strom.cpu_fallback               | on                                       | Enables CPU fallback if GPU is 
    pg_strom.cuda_visible_devices       |                                          | CUDA_VISIBLE_DEVICES of CUDA runtime
    pg_strom.debug_cuda_coredump        | off                                      | Turn on/off GPU coredump feature
    pg_strom.debug_force_gpupreagg      | off                                      | Force GpuPreAgg regardless of the cost (debug)
    pg_strom.debug_force_gpusort        | off                                      | Force GpuSort regardless of the cost (debug)
    pg_strom.debug_kernel_source        | off                                      | Turn on/off to display the kernel source path
    pg_strom.enable_gpuhashjoin         | on                                       | Enables the use of GpuHashJoin logic
    pg_strom.enable_gpunestloop         | on                                       | Enables the use of GpuNestLoop logic
    pg_strom.enable_gpupreagg           | off                                      | Enables the use of GPU preprocessed aggregate
    pg_strom.enable_gpuscan             | on                                       | Enables the use of GPU accelerated full-scan
    pg_strom.enable_gpusort             | off                                      | Enables the use of GPU accelerated sorting
    pg_strom.enabled                    | off                                      | Enables the planner's use of PG-Strom
    pg_strom.gpu_dma_cost               | 10                                       | Cost to send/recv data via DMA
    pg_strom.gpu_operator_cost          | 7.8125e-05                               | Cost of processing each operators by GPU
    pg_strom.gpu_setup_cost             | 4000                                     | Cost to setup GPU device to run
    pg_strom.gpu_tuple_cost             | 0.0003125                                | Cost of processing each tuple for GPU
    pg_strom.max_async_tasks            | 32                                       | max number of GPU tasks to be run asynchronously
    pg_strom.num_threads_margin         | 1.1                                      | margin of number of CUDA threads if not predictable exactly
    pg_strom.perfmon                    | off                                      | Enables the performance monitor of PG-Strom
    pg_strom.program_cache_size         | 48MB                                     | size of shared program cache
    pg_strom.pullup_outer_scan          | on                                       | Enables to pull up simple outer scan
    
  • Review the execution plan of a query and confirm GPU offload is enabled (GpuPreAgg, GpuJoin, etc)
    testdb=# EXPLAIN                 
    SELECT cat, avg(ax) FROM t0 NATURAL JOIN t1 NATURAL JOIN t2 GROUP BY cat;
                                             QUERY PLAN                                          
    ---------------------------------------------------------------------------------------------
     HashAggregate  (cost=3349749.08..3349749.40 rows=26 width=12)
       Group Key: t0.cat
       ->  Custom Scan (GpuPreAgg)  (cost=16325.86..2879097.75 rows=234 width=44)
             Reduction: Local + Global
             GPU Projection: cat, ax
             ->  Custom Scan (GpuJoin) on t0  (cost=12325.86..2851751.79 rows=99599458 width=12)
                   GPU Projection: t0.cat, t1.ax
                   Depth 1: GpuHashJoin, HashKeys: (t0.aid)
                            JoinQuals: (t0.aid = t1.aid)
                            Nrows (in/out: 99.60%), KDS-Hash (size: 13.47MB, nbatches: 1)
                   Depth 2: GpuHashJoin, HashKeys: (t0.bid)
                            JoinQuals: (t0.bid = t2.bid)
                            Nrows (in/out: 100.00%), KDS-Hash (size: 13.47MB, nbatches: 1)
                   ->  Seq Scan on t1  (cost=0.00..1935.00 rows=100000 width=12)
                   ->  Seq Scan on t2  (cost=0.00..1935.00 rows=100000 width=4)
    (15 rows)
    
  • Turn off pgstrom and review the same execution plan (default join, hash, etc).
    testdb=# set pg_strom.enabled=off; 
    SET
    testdb=# EXPLAIN                   
    SELECT cat, avg(ax) FROM t0 NATURAL JOIN t1 NATURAL JOIN t2 GROUP BY cat;
                                        QUERY PLAN                                    
    ----------------------------------------------------------------------------------
     HashAggregate  (cost=6843172.91..6843173.24 rows=26 width=12)
       Group Key: t0.cat
       ->  Hash Join  (cost=7250.00..6345175.62 rows=99599458 width=12)
             Hash Cond: (t0.aid = t1.aid)
             ->  Hash Join  (cost=3576.00..4188792.08 rows=99997856 width=8)
                   Hash Cond: (t0.bid = t2.bid)
                   ->  Seq Scan on t0  (cost=0.00..1833312.56 rows=99997856 width=12)
                   ->  Hash  (cost=1935.00..1935.00 rows=100000 width=4)
                         ->  Seq Scan on t2  (cost=0.00..1935.00 rows=100000 width=4)
             ->  Hash  (cost=1935.00..1935.00 rows=100000 width=12)
                   ->  Seq Scan on t1  (cost=0.00..1935.00 rows=100000 width=12)
    (11 rows)