Jim Wu's FPGA Blog: September 2016

Software Installation

Install useful utilities
sudo apt-get install wget git rpm

Install Cuda

Download Cuda local install package from https://developer.nvidia.com/cuda-downloads

Run the commands below to install Cuda and reboot

$sudo dpkg -i cuda-repo-ubuntu1404-7-5-local_7.5-18_amd64.deb
$sudo apt-get update
$sudo apt-get install cuda

Create/Edit /etc/ld.so.conf.d/cuda-lib64.conf with Cuda library path
```
$echo /usr/local/cuda/lib64 > /etc/ld.so.conf.d/cuda-lib64.conf
```
Build the deviceQuery sample from Cuda installation and verify that the Cuda driver works

$cp -r /usr/local/cuda/samples .
$cd samples/1_Utilities/deviceQuery 
$make
$./deviceQuery 
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Quadro K1000M"
  CUDA Driver Version / Runtime Version          7.5 / 7.5
  CUDA Capability Major/Minor version number:    3.0
  Total amount of global memory:                 2047 MBytes (2146762752 bytes)
  ( 1) Multiprocessors, (192) CUDA Cores/MP:     192 CUDA Cores
  GPU Max Clock rate:                            851 MHz (0.85 GHz)
  Memory Clock rate:                             900 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 262144 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Version = 7.5, NumDevs = 1, Device0 = Quadro K1000M
Result = PASS

Install PostgreSQL

Import PostgreSQL repository key

$wget -O - http://apt.postgresql.org/pub/repos/apt/ACCC4CF8.asc | sudo apt-key add -

Create or edit /etc/apt/sources.list.d/pgdg.list with the line below
```
deb http://apt.postgresql.org/pub/repos/apt/ trusty-pgdg main
```

Configure apt's package pinning to prefer the PGDG packages over the Debian ones in /etc/apt/preferences.d/pgdg.pref. this will replace all your Debian/Ubuntu packages with available packages from the PGDG repository.
```
Package: *
Pin: release o=apt.postgresql.org
Pin-Priority: 500
```
Update the package lists, and install the pgdg-keyring package to automatically get repository key updates
```
$sudo apt-get update
$sudo apt-get install pgdg-keyring
```

Install PostgreSQL 9.5

$sudo apt-get install -y postgresql-common
$sudo apt-get install -y postgresql-9.5 postgresql-contrib-9.5 postgresql-server-dev-9.5 libpq-dev

The PostgreSQL server will automatically be started after installation. Run the commands below to manually start or restart the server if needed.
```
$sudo service postgresql restart
$sudo service postgresql start
```
Run the command below to start psql terminal and connect to the server
```
$sudo -u postgre psql
```

Install PGStrom [PGStrom Wiki]

Download and install PGStrom

$git clone https://github.com/pg-strom/devel pg_strom
$cd pg_strom
$which pg_config
/usr/local/pgsql/bin/pg_config
$make
$sudo make install

Edit /etc/postgresql/9.5/main/postgresql.conf to update configuration for PGStrom
- Add /usr/lib/postgresql/9.5/lib/pg_strom.so to shared_preload_libraries
Restart PostgreSQL service
```
$sudo service postgresql restart
```

PostgreSQL log file /var/log/postgresql/postgresql-9.5-main.log shows PGStrom is loaded with supported GPU device

$ cat /var/log/postgresql/postgresql-9.5-main.log 
LOG:  PG-Strom version 1.0devel built for PostgreSQL 9.5
LOG:  CUDA Runtime version: 7.5.0
LOG:  NVIDIA driver version: 352.63
LOG:  GPU0 Quadro K1000M (192 CUDA cores, 850MHz), L2 256KB, RAM 2047MB (128bits, 900MHz), capability 3.0
LOG:  NVRTC - CUDA Runtime Compilation vertion 7.5

Test PGStrom Integration with PostgreSQL

Start psql terminal
```
sudo -u postgres psql
```

Create pg_strom extension

postgres=#CREATE EXTENSION pg_strom;
CREATE EXTENSION

Create a test DB and connect to it

postgres=# create database testdb;
CREATE DATABASE
postgres=# \c testdb
You are now connected to database "testdb" as user "postgres".

Create test tables from SQL script in PGStrom installation. It creates one 100M rows table and 9 of 100K rows table.
```
postgres=# \i pgstrom/test/testdb.sql
```

Check pg_strom parameters used by postgresql:

postgres=# show all;
pg_strom.bulkexec                   | on                                       | Enables the bulk-execution mode of PG-Strom
pg_strom.chunk_limit                | 79360kB                                  | limit size of pgstrom_data_store
pg_strom.chunk_size                 | 15872kB                                  | default size of pgstrom_data_store
pg_strom.chunk_size_margin          | 1.25                                     | margin of chunk size if not predictable exactly
pg_strom.cpu_fallback               | on                                       | Enables CPU fallback if GPU is 
pg_strom.cuda_visible_devices       |                                          | CUDA_VISIBLE_DEVICES of CUDA runtime
pg_strom.debug_cuda_coredump        | off                                      | Turn on/off GPU coredump feature
pg_strom.debug_force_gpupreagg      | off                                      | Force GpuPreAgg regardless of the cost (debug)
pg_strom.debug_force_gpusort        | off                                      | Force GpuSort regardless of the cost (debug)
pg_strom.debug_kernel_source        | off                                      | Turn on/off to display the kernel source path
pg_strom.enable_gpuhashjoin         | on                                       | Enables the use of GpuHashJoin logic
pg_strom.enable_gpunestloop         | on                                       | Enables the use of GpuNestLoop logic
pg_strom.enable_gpupreagg           | off                                      | Enables the use of GPU preprocessed aggregate
pg_strom.enable_gpuscan             | on                                       | Enables the use of GPU accelerated full-scan
pg_strom.enable_gpusort             | off                                      | Enables the use of GPU accelerated sorting
pg_strom.enabled                    | off                                      | Enables the planner's use of PG-Strom
pg_strom.gpu_dma_cost               | 10                                       | Cost to send/recv data via DMA
pg_strom.gpu_operator_cost          | 7.8125e-05                               | Cost of processing each operators by GPU
pg_strom.gpu_setup_cost             | 4000                                     | Cost to setup GPU device to run
pg_strom.gpu_tuple_cost             | 0.0003125                                | Cost of processing each tuple for GPU
pg_strom.max_async_tasks            | 32                                       | max number of GPU tasks to be run asynchronously
pg_strom.num_threads_margin         | 1.1                                      | margin of number of CUDA threads if not predictable exactly
pg_strom.perfmon                    | off                                      | Enables the performance monitor of PG-Strom
pg_strom.program_cache_size         | 48MB                                     | size of shared program cache
pg_strom.pullup_outer_scan          | on                                       | Enables to pull up simple outer scan

Review the execution plan of a query and confirm GPU offload is enabled (GpuPreAgg, GpuJoin, etc)

testdb=# EXPLAIN                 
SELECT cat, avg(ax) FROM t0 NATURAL JOIN t1 NATURAL JOIN t2 GROUP BY cat;
                                         QUERY PLAN                                          
---------------------------------------------------------------------------------------------
 HashAggregate  (cost=3349749.08..3349749.40 rows=26 width=12)
   Group Key: t0.cat
   ->  Custom Scan (GpuPreAgg)  (cost=16325.86..2879097.75 rows=234 width=44)
         Reduction: Local + Global
         GPU Projection: cat, ax
         ->  Custom Scan (GpuJoin) on t0  (cost=12325.86..2851751.79 rows=99599458 width=12)
               GPU Projection: t0.cat, t1.ax
               Depth 1: GpuHashJoin, HashKeys: (t0.aid)
                        JoinQuals: (t0.aid = t1.aid)
                        Nrows (in/out: 99.60%), KDS-Hash (size: 13.47MB, nbatches: 1)
               Depth 2: GpuHashJoin, HashKeys: (t0.bid)
                        JoinQuals: (t0.bid = t2.bid)
                        Nrows (in/out: 100.00%), KDS-Hash (size: 13.47MB, nbatches: 1)
               ->  Seq Scan on t1  (cost=0.00..1935.00 rows=100000 width=12)
               ->  Seq Scan on t2  (cost=0.00..1935.00 rows=100000 width=4)
(15 rows)

Turn off pgstrom and review the same execution plan (default join, hash, etc).

testdb=# set pg_strom.enabled=off; 
SET
testdb=# EXPLAIN                   
SELECT cat, avg(ax) FROM t0 NATURAL JOIN t1 NATURAL JOIN t2 GROUP BY cat;
                                    QUERY PLAN                                    
----------------------------------------------------------------------------------
 HashAggregate  (cost=6843172.91..6843173.24 rows=26 width=12)
   Group Key: t0.cat
   ->  Hash Join  (cost=7250.00..6345175.62 rows=99599458 width=12)
         Hash Cond: (t0.aid = t1.aid)
         ->  Hash Join  (cost=3576.00..4188792.08 rows=99997856 width=8)
               Hash Cond: (t0.bid = t2.bid)
               ->  Seq Scan on t0  (cost=0.00..1833312.56 rows=99997856 width=12)
               ->  Hash  (cost=1935.00..1935.00 rows=100000 width=4)
                     ->  Seq Scan on t2  (cost=0.00..1935.00 rows=100000 width=4)
         ->  Hash  (cost=1935.00..1935.00 rows=100000 width=12)
               ->  Seq Scan on t1  (cost=0.00..1935.00 rows=100000 width=12)
(11 rows)

Wednesday, September 7, 2016

Use PGStrom with PostgreSQL on Ubuntu 14.04 LTS

Software Installation

Test PGStrom Integration with PostgreSQL