Sunday, August 28, 2016

Run PostgreSQL on Docker

Host and Containers

  • Host: Ubuntu 14.04
    • Add <username> to docker group so sudo is not required to run docker command
      sudo usermod -a -G docker username
    • Create two directories below. They will be mounted to containers to make DB and development files persist outside the containers
      /opt/db/ws
      /opt/db/postgresql/data


  • Container: postgresql : https://hub.docker.com/_/postgres/
    • This image includes EXPOSE 5432 (the postgres port), so standard container linking will make it automatically available to the linked containers. The default postgres user and database are created in the entry point with initdb.
    • The default data directory for PostgreSQL is /var/lib/postgresql/data

Run PostgreSQL Containers for Server and Client

  1. ssh <host>
  2. docker run --name pgs-server -v /opt/db/ws:/opt/db/ws -v /opt/db/postgresql/data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -d postgres 

    run
    start a new container using the requested image

    --name pgs-server
    Set container name to pgs-server.

    -v /opt/db/postgresql/data:/var/lib/postgresql/data
    -v /opt/db/ws:/opt/db/ws
    Mount host directory /opt/db/postgresql as /var/lib/postgresql on the container so DB files persist on the hard disk  
    Mount host directory /opt/db/ws as /opt/db/ws to be used for development.

    -e POSTGRES_PASSWORD=pgsdev
    Set up the POSTGRES_PASSWORD environment variable, which sets the master PostgreSQL password

    -d
    Run the container in the background (daemon mode). It will stay alive until it is removed

    postgres
    Runs postgres docker image

  3. Optional: Start a bash shell on the pgs-server container for miscellaneous tasks
    docker exec -it pgs-server bash

  4. Start a PostgreSQL client container
    docker run --name pgs-client -v /opt/db/ws:/opt/db/ws -it --rm --link pgs-server:postgres postgres psql -h postgres -U postgres

    run
    Start a new container using the requested image

    --name pgs-client
    Set container name to pgs-client. If this option is not provided, docker will assign a random name.

    -v /opt/db/ws:/opt/db/ws
    Mount host directory /opt/db/ws as /opt/db/ws to be used for development.

    -it
    Run the container in interactive mode

    --rm
    Automatically cleanup the container after exit to avoid zombie containers
    -link pgs-server:postgres
    connects to the pgs-server container from the pgs-cleint container

    postgres
    Use postgres Docker image

    psql -h postgres -U postgres
    at the end tells Docker what command to execute when the container starts. In this case, start the interactive postgres terminal

  5. Optional. Start a bash shell on the 2nd container for miscellaneous tasks
    $docker exec -it pgs-client bash

TPC-H Queries on PostgreSQL

Download TPC-H Tool

Download the source code for TPC-H tool from http://www.tpc.org/tpc_documents_current_versions/current_specifications.asp
The instructions below are based on TPC-H tool 2.17.1 and Ubuntu LTS 14.04

Build TPC-H Tool

  • $mkdir /opt/db/tpch-tool
  • Save the TCPH Tool zip to /opt/db/tpch-tool directory as tpc-h-tool_2_17_0.zip
  • $unzip tpc-h-tool_2_17_0.zip
  • $cd tpch_2_17_0/dbgen
  • $cp makefile.suite Makefile
  • Update the Makefile with the lines below
    CC=gcc
    DATABASE=ORACLE
    MACHINE=LINUX
    WORKLOAD=TPCH
  • $make
    This will build dbgen (database generation) and qgen (query generation) tools
  • Add the build directory to PATH environment variable so they can be executed without full path

Generate TPC-H Postgres DB

    • Generate TPC-H tables with scale factor 1 (1GB)
      $dbgen -s 1
      Fri Aug 26 12:28:15 PDT 2016
      TPC-H Population Generator (Version 2.17.0)
      Copyright Transaction Processing Performance Council 1994 - 2010

    • Below are the tables generated and their sizes
      $ls -l *.tbl
      -rw-r--r-- 1 use grp  24346144 Aug 26 12:28 customer.tbl
      -rw-r--r-- 1 use grp 759863287 Aug 26 12:28 lineitem.tbl
      -rw-r--r-- 1 use grp      2224 Aug 26 12:28 nation.tbl
      -rw-r--r-- 1 use grp 171952161 Aug 26 12:28 orders.tbl
      -rw-r--r-- 1 use grp 118984616 Aug 26 12:28 partsupp.tbl
      -rw-r--r-- 1 use grp  24135125 Aug 26 12:28 part.tbl
      -rw-r--r-- 1 use grp       389 Aug 26 12:28 region.tbl
      -rw-r--r-- 1 use grp   1409184 Aug 26 12:28 supplier.tbl

    • Below is the TPC-H Schema (http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-h_v2.17.1.pdf)

    • The .tbl files have a "|" character at the end of each line, which is not compatible with postgresql and causes the error below during data import:
      ERROR:  extra data after last expected column
      CONTEXT:  COPY part, line 1: "1|goldenrod lavender spring chocolate lace|Manufacturer#1|Brand#13|PROMO BURNISHED COPPER|7|JUMBO PK..."
    • Run the command below to remove the last | on each line and save the result to .csv files
      for i in `ls *.tbl`; do sed 's/|$//' $i > ${i/tbl/csv}; echo $i; done;
    • Start postgresql server and client containers.  Check Run PostgreSQL with Docker
    • Create TPCH DB by running the commands below from psql terminal on postgresal client container:
      postgres=# create database tpch;
      CREATE DATABASE
      postgres=# \c tpch;
      You are now connected to database "tpch" as user "postgres".
      tpch=# \i /opt/db/ws/tpch/sql/tpch-build-db.sql
      BEGIN
      CREATE TABLE
      COPY 200000
      COMMIT
      BEGIN
      CREATE TABLE
      COPY 5
      COMMIT
      BEGIN
      CREATE TABLE
      COPY 25
      COMMIT
      BEGIN
      CREATE TABLE
      COPY 10000
      COMMIT
      BEGIN
      CREATE TABLE
      COPY 150000
      COMMIT
      BEGIN
      CREATE TABLE
      COPY 800000
      COMMIT
      BEGIN
      CREATE TABLE
      COPY 1500000
      COMMIT
      BEGIN
      CREATE TABLE
      COPY 6001215
      COMMIT
      ALTER TABLE
      ALTER TABLE
      ALTER TABLE
      ALTER TABLE
      ALTER TABLE
      ALTER TABLE
      ALTER TABLE
      ALTER TABLE
      tpch=# \d+
                                      List of relations
       Schema |          Name          |   Type   |  Owner   |    Size    | Description 
      --------+------------------------+----------+----------+------------+-------------
       public | customer               | table    | postgres | 29 MB      | 
       public | customer_c_custkey_seq | sequence | postgres | 8192 bytes | 
       public | lineitem               | table    | postgres | 950 MB     | 
       public | nation                 | table    | postgres | 8192 bytes | 
       public | nation_n_nationkey_seq | sequence | postgres | 8192 bytes | 
       public | orders                 | table    | postgres | 216 MB     | 
       public | orders_o_orderkey_seq  | sequence | postgres | 8192 bytes | 
       public | part                   | table    | postgres | 32 MB      | 
       public | part_p_partkey_seq     | sequence | postgres | 8192 bytes | 
       public | partsupp               | table    | postgres | 143 MB     | 
       public | region                 | table    | postgres | 8192 bytes | 
       public | region_r_regionkey_seq | sequence | postgres | 8192 bytes | 
       public | supplier               | table    | postgres | 1848 kB    | 
       public | supplier_s_suppkey_seq | sequence | postgres | 8192 bytes | 
      (14 rows)

    Generate TPC-H Queries 

    • cd /opt/db/ws/tpch/sql
    • cp /opt/db/tpch-tool/tpch_2_17_0/dbgen/dists.dss .
    • Run the command below to generate all 22 TPC-H queries in separate .sql file. Note that DSS_QUERY environment variable in the command line is required for qgen to find query templates
        $for q in `seq 1 22`;do DSS_QUERY=/opt/db/tpch-tool/tpch_2_17_0/dbgen/queries qgen $q > $q.sql;done;
    • The tpc-h query templates don't work with PostgreSQL out of the box. Below is the result when running 1.sql in psql terminal:

      psql:/opt/db/ws/tpch/sql/1.sql:49: ERROR:  syntax error at or near "("
      LINE 15:  l_shipdate <= date '1998-12-01' - interval '97' day (3)
                                                                    ^
      Time: 0.227 ms
      psql:/opt/db/ws/tpch/sql/1.sql:50: ERROR:  syntax error at or near "where"
      LINE 1: where rownum <= -1;

    • Below is the result after the errors are fixed 1.sql
    •  l_returnflag | l_linestatus | sum_qty  | sum_base_price |  sum_disc_price  |     sum_charge     |       avg_qty       |     avg_price      |        avg_disc
             | count_order
      --------------+--------------+----------+----------------+------------------+--------------------+---------------------+--------------------+-----------------
      -------+-------------
       A            | F            | 37734107 | 56586554400.73 | 53758257134.8700 | 55909065222.827692 | 25.5220058532573370 | 38273.129734621672 | 0.04998529583839
      761162 |     1478493
      (1 row)

      Time: 11534.090 ms


    • The tpch-postgres.7z has all 22 queries that have been modified to run in PostgreSQL. 
      • "where rownum <=" statement is replaced with "limit" statement
      • Correlated aggregate subqueries with huge outer tables is replaced with JOIN (query 17, 20)

    TPC-H Benchmark Result

    Below is the TPC-H Benchmark Result on a Dell Precision T3610 Workstation (CPU @3.5GHz with 64GB RAM) and PostgreSQL 9.5.4 from the latest postgres Docker image (as of August 2016).


    TCP-H Query SF=1 (ms) Table Header
    1 11534.090 Table Cell
    2 398.402 Table Cell
    3 1085.500 Table Cell
    4 551.310 Table Cell
    5 650.774 Table Cell
    6 1304.219 Table Cell
    7 965.973 Table Cell
    8 1836.722 Table Cell
    9 5330.943 Table Cell
    10 1801.567 Table Cell
    11 280.123 Table Cell
    12 1889.902 Table Cell
    13 1775.164 Table Cell
    14 1339.493 Table Cell
    15 0.192 Table Cell
    16 1198.179 Table Cell
    17 4607.648 Table Cell
    18 4161.339 Table Cell
    19 1673.113 Table Cell
    20 2308.152 Table Cell
    21 2130.221 Table Cell
    22 585.929 Table Cell