Software Installation
Install useful utilities
sudo apt-get install wget git rpm
Install Cuda
$sudo dpkg -i cuda-repo-ubuntu1404-7-5-local_7.5-18_amd64.deb
$sudo apt-get update
$sudo apt-get install cuda
Create/Edit /etc/ld.so.conf.d/cuda-lib64.conf with Cuda library path
$echo /usr/local/cuda/lib64 > /etc/ld.so.conf.d/cuda-lib64.conf
Build the deviceQuery sample from Cuda installation and verify that the Cuda driver works
$cp -r /usr/local/cuda/samples .
$cd samples/1_Utilities/deviceQuery
$make
$./deviceQuery
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "Quadro K1000M"
CUDA Driver Version / Runtime Version 7.5 / 7.5
CUDA Capability Major/Minor version number: 3.0
Total amount of global memory: 2047 MBytes (2146762752 bytes)
( 1) Multiprocessors, (192) CUDA Cores/MP: 192 CUDA Cores
GPU Max Clock rate: 851 MHz (0.85 GHz)
Memory Clock rate: 900 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 262144 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Version = 7.5, NumDevs = 1, Device0 = Quadro K1000M
Result = PASS
Install PostgreSQL
- Import PostgreSQL repository key
$wget -O - http://apt.postgresql.org/pub/repos/apt/ACCC4CF8.asc | sudo apt-key add -
- Create or edit /etc/apt/sources.list.d/pgdg.list with the line below
deb http://apt.postgresql.org/pub/repos/apt/ trusty-pgdg main
- Configure apt's package pinning to prefer the PGDG packages over the Debian ones in /etc/apt/preferences.d/pgdg.pref. this will replace all your Debian/Ubuntu packages with available packages from the PGDG repository.
Package: *
Pin: release o=apt.postgresql.org
Pin-Priority: 500
- Update the package lists, and install the pgdg-keyring package to automatically get repository key updates
$sudo apt-get update
$sudo apt-get install pgdg-keyring
- Install PostgreSQL 9.5
$sudo apt-get install -y postgresql-common
$sudo apt-get install -y postgresql-9.5 postgresql-contrib-9.5 postgresql-server-dev-9.5 libpq-dev
- The PostgreSQL server will automatically be started after installation. Run the commands below to manually start or restart the server if needed.
$sudo service postgresql restart
$sudo service postgresql start
- Run the command below to start psql terminal and connect to the server
$sudo -u postgre psql
Install PGStrom [PGStrom Wiki]
- Download and install PGStrom
$git clone https://github.com/pg-strom/devel pg_strom
$cd pg_strom
$which pg_config
/usr/local/pgsql/bin/pg_config
$make
$sudo make install
- Edit /etc/postgresql/9.5/main/postgresql.conf to update configuration for PGStrom
- Add /usr/lib/postgresql/9.5/lib/pg_strom.so to shared_preload_libraries
- Restart PostgreSQL service
$sudo service postgresql restart
- PostgreSQL log file /var/log/postgresql/postgresql-9.5-main.log shows PGStrom is loaded with supported GPU device
$ cat /var/log/postgresql/postgresql-9.5-main.log
LOG: PG-Strom version 1.0devel built for PostgreSQL 9.5
LOG: CUDA Runtime version: 7.5.0
LOG: NVIDIA driver version: 352.63
LOG: GPU0 Quadro K1000M (192 CUDA cores, 850MHz), L2 256KB, RAM 2047MB (128bits, 900MHz), capability 3.0
LOG: NVRTC - CUDA Runtime Compilation vertion 7.5
Test PGStrom Integration with PostgreSQL
- Start psql terminal
sudo -u postgres psql
- Create pg_strom extension
postgres=#CREATE EXTENSION pg_strom;
CREATE EXTENSION
- Create a test DB and connect to it
postgres=# create database testdb;
CREATE DATABASE
postgres=# \c testdb
You are now connected to database "testdb" as user "postgres".
- Create test tables from SQL script in PGStrom installation. It creates one 100M rows table and 9 of 100K rows table.
postgres=# \i pgstrom/test/testdb.sql
- Check pg_strom parameters used by postgresql:
postgres=# show all;
pg_strom.bulkexec | on | Enables the bulk-execution mode of PG-Strom
pg_strom.chunk_limit | 79360kB | limit size of pgstrom_data_store
pg_strom.chunk_size | 15872kB | default size of pgstrom_data_store
pg_strom.chunk_size_margin | 1.25 | margin of chunk size if not predictable exactly
pg_strom.cpu_fallback | on | Enables CPU fallback if GPU is
pg_strom.cuda_visible_devices | | CUDA_VISIBLE_DEVICES of CUDA runtime
pg_strom.debug_cuda_coredump | off | Turn on/off GPU coredump feature
pg_strom.debug_force_gpupreagg | off | Force GpuPreAgg regardless of the cost (debug)
pg_strom.debug_force_gpusort | off | Force GpuSort regardless of the cost (debug)
pg_strom.debug_kernel_source | off | Turn on/off to display the kernel source path
pg_strom.enable_gpuhashjoin | on | Enables the use of GpuHashJoin logic
pg_strom.enable_gpunestloop | on | Enables the use of GpuNestLoop logic
pg_strom.enable_gpupreagg | off | Enables the use of GPU preprocessed aggregate
pg_strom.enable_gpuscan | on | Enables the use of GPU accelerated full-scan
pg_strom.enable_gpusort | off | Enables the use of GPU accelerated sorting
pg_strom.enabled | off | Enables the planner's use of PG-Strom
pg_strom.gpu_dma_cost | 10 | Cost to send/recv data via DMA
pg_strom.gpu_operator_cost | 7.8125e-05 | Cost of processing each operators by GPU
pg_strom.gpu_setup_cost | 4000 | Cost to setup GPU device to run
pg_strom.gpu_tuple_cost | 0.0003125 | Cost of processing each tuple for GPU
pg_strom.max_async_tasks | 32 | max number of GPU tasks to be run asynchronously
pg_strom.num_threads_margin | 1.1 | margin of number of CUDA threads if not predictable exactly
pg_strom.perfmon | off | Enables the performance monitor of PG-Strom
pg_strom.program_cache_size | 48MB | size of shared program cache
pg_strom.pullup_outer_scan | on | Enables to pull up simple outer scan
-
Review the execution plan of a query and confirm GPU offload is enabled (GpuPreAgg, GpuJoin, etc)
testdb=# EXPLAIN
SELECT cat, avg(ax) FROM t0 NATURAL JOIN t1 NATURAL JOIN t2 GROUP BY cat;
QUERY PLAN
---------------------------------------------------------------------------------------------
HashAggregate (cost=3349749.08..3349749.40 rows=26 width=12)
Group Key: t0.cat
-> Custom Scan (GpuPreAgg) (cost=16325.86..2879097.75 rows=234 width=44)
Reduction: Local + Global
GPU Projection: cat, ax
-> Custom Scan (GpuJoin) on t0 (cost=12325.86..2851751.79 rows=99599458 width=12)
GPU Projection: t0.cat, t1.ax
Depth 1: GpuHashJoin, HashKeys: (t0.aid)
JoinQuals: (t0.aid = t1.aid)
Nrows (in/out: 99.60%), KDS-Hash (size: 13.47MB, nbatches: 1)
Depth 2: GpuHashJoin, HashKeys: (t0.bid)
JoinQuals: (t0.bid = t2.bid)
Nrows (in/out: 100.00%), KDS-Hash (size: 13.47MB, nbatches: 1)
-> Seq Scan on t1 (cost=0.00..1935.00 rows=100000 width=12)
-> Seq Scan on t2 (cost=0.00..1935.00 rows=100000 width=4)
(15 rows)
-
Turn off pgstrom and review the same execution plan (default join, hash, etc).
testdb=# set pg_strom.enabled=off;
SET
testdb=# EXPLAIN
SELECT cat, avg(ax) FROM t0 NATURAL JOIN t1 NATURAL JOIN t2 GROUP BY cat;
QUERY PLAN
----------------------------------------------------------------------------------
HashAggregate (cost=6843172.91..6843173.24 rows=26 width=12)
Group Key: t0.cat
-> Hash Join (cost=7250.00..6345175.62 rows=99599458 width=12)
Hash Cond: (t0.aid = t1.aid)
-> Hash Join (cost=3576.00..4188792.08 rows=99997856 width=8)
Hash Cond: (t0.bid = t2.bid)
-> Seq Scan on t0 (cost=0.00..1833312.56 rows=99997856 width=12)
-> Hash (cost=1935.00..1935.00 rows=100000 width=4)
-> Seq Scan on t2 (cost=0.00..1935.00 rows=100000 width=4)
-> Hash (cost=1935.00..1935.00 rows=100000 width=12)
-> Seq Scan on t1 (cost=0.00..1935.00 rows=100000 width=12)
(11 rows)