Sunday, August 28, 2016

TPC-H Queries on PostgreSQL

Download TPC-H Tool

Download the source code for TPC-H tool from http://www.tpc.org/tpc_documents_current_versions/current_specifications.asp
The instructions below are based on TPC-H tool 2.17.1 and Ubuntu LTS 14.04

Build TPC-H Tool

  • $mkdir /opt/db/tpch-tool
  • Save the TCPH Tool zip to /opt/db/tpch-tool directory as tpc-h-tool_2_17_0.zip
  • $unzip tpc-h-tool_2_17_0.zip
  • $cd tpch_2_17_0/dbgen
  • $cp makefile.suite Makefile
  • Update the Makefile with the lines below
    CC=gcc
    DATABASE=ORACLE
    MACHINE=LINUX
    WORKLOAD=TPCH
  • $make
    This will build dbgen (database generation) and qgen (query generation) tools
  • Add the build directory to PATH environment variable so they can be executed without full path

Generate TPC-H Postgres DB

    • Generate TPC-H tables with scale factor 1 (1GB)
      $dbgen -s 1
      Fri Aug 26 12:28:15 PDT 2016
      TPC-H Population Generator (Version 2.17.0)
      Copyright Transaction Processing Performance Council 1994 - 2010

    • Below are the tables generated and their sizes
      $ls -l *.tbl
      -rw-r--r-- 1 use grp  24346144 Aug 26 12:28 customer.tbl
      -rw-r--r-- 1 use grp 759863287 Aug 26 12:28 lineitem.tbl
      -rw-r--r-- 1 use grp      2224 Aug 26 12:28 nation.tbl
      -rw-r--r-- 1 use grp 171952161 Aug 26 12:28 orders.tbl
      -rw-r--r-- 1 use grp 118984616 Aug 26 12:28 partsupp.tbl
      -rw-r--r-- 1 use grp  24135125 Aug 26 12:28 part.tbl
      -rw-r--r-- 1 use grp       389 Aug 26 12:28 region.tbl
      -rw-r--r-- 1 use grp   1409184 Aug 26 12:28 supplier.tbl

    • Below is the TPC-H Schema (http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-h_v2.17.1.pdf)

    • The .tbl files have a "|" character at the end of each line, which is not compatible with postgresql and causes the error below during data import:
      ERROR:  extra data after last expected column
      CONTEXT:  COPY part, line 1: "1|goldenrod lavender spring chocolate lace|Manufacturer#1|Brand#13|PROMO BURNISHED COPPER|7|JUMBO PK..."
    • Run the command below to remove the last | on each line and save the result to .csv files
      for i in `ls *.tbl`; do sed 's/|$//' $i > ${i/tbl/csv}; echo $i; done;
    • Start postgresql server and client containers.  Check Run PostgreSQL with Docker
    • Create TPCH DB by running the commands below from psql terminal on postgresal client container:
      postgres=# create database tpch;
      CREATE DATABASE
      postgres=# \c tpch;
      You are now connected to database "tpch" as user "postgres".
      tpch=# \i /opt/db/ws/tpch/sql/tpch-build-db.sql
      BEGIN
      CREATE TABLE
      COPY 200000
      COMMIT
      BEGIN
      CREATE TABLE
      COPY 5
      COMMIT
      BEGIN
      CREATE TABLE
      COPY 25
      COMMIT
      BEGIN
      CREATE TABLE
      COPY 10000
      COMMIT
      BEGIN
      CREATE TABLE
      COPY 150000
      COMMIT
      BEGIN
      CREATE TABLE
      COPY 800000
      COMMIT
      BEGIN
      CREATE TABLE
      COPY 1500000
      COMMIT
      BEGIN
      CREATE TABLE
      COPY 6001215
      COMMIT
      ALTER TABLE
      ALTER TABLE
      ALTER TABLE
      ALTER TABLE
      ALTER TABLE
      ALTER TABLE
      ALTER TABLE
      ALTER TABLE
      tpch=# \d+
                                      List of relations
       Schema |          Name          |   Type   |  Owner   |    Size    | Description 
      --------+------------------------+----------+----------+------------+-------------
       public | customer               | table    | postgres | 29 MB      | 
       public | customer_c_custkey_seq | sequence | postgres | 8192 bytes | 
       public | lineitem               | table    | postgres | 950 MB     | 
       public | nation                 | table    | postgres | 8192 bytes | 
       public | nation_n_nationkey_seq | sequence | postgres | 8192 bytes | 
       public | orders                 | table    | postgres | 216 MB     | 
       public | orders_o_orderkey_seq  | sequence | postgres | 8192 bytes | 
       public | part                   | table    | postgres | 32 MB      | 
       public | part_p_partkey_seq     | sequence | postgres | 8192 bytes | 
       public | partsupp               | table    | postgres | 143 MB     | 
       public | region                 | table    | postgres | 8192 bytes | 
       public | region_r_regionkey_seq | sequence | postgres | 8192 bytes | 
       public | supplier               | table    | postgres | 1848 kB    | 
       public | supplier_s_suppkey_seq | sequence | postgres | 8192 bytes | 
      (14 rows)

    Generate TPC-H Queries 

    • cd /opt/db/ws/tpch/sql
    • cp /opt/db/tpch-tool/tpch_2_17_0/dbgen/dists.dss .
    • Run the command below to generate all 22 TPC-H queries in separate .sql file. Note that DSS_QUERY environment variable in the command line is required for qgen to find query templates
        $for q in `seq 1 22`;do DSS_QUERY=/opt/db/tpch-tool/tpch_2_17_0/dbgen/queries qgen $q > $q.sql;done;
    • The tpc-h query templates don't work with PostgreSQL out of the box. Below is the result when running 1.sql in psql terminal:

      psql:/opt/db/ws/tpch/sql/1.sql:49: ERROR:  syntax error at or near "("
      LINE 15:  l_shipdate <= date '1998-12-01' - interval '97' day (3)
                                                                    ^
      Time: 0.227 ms
      psql:/opt/db/ws/tpch/sql/1.sql:50: ERROR:  syntax error at or near "where"
      LINE 1: where rownum <= -1;

    • Below is the result after the errors are fixed 1.sql
    •  l_returnflag | l_linestatus | sum_qty  | sum_base_price |  sum_disc_price  |     sum_charge     |       avg_qty       |     avg_price      |        avg_disc
             | count_order
      --------------+--------------+----------+----------------+------------------+--------------------+---------------------+--------------------+-----------------
      -------+-------------
       A            | F            | 37734107 | 56586554400.73 | 53758257134.8700 | 55909065222.827692 | 25.5220058532573370 | 38273.129734621672 | 0.04998529583839
      761162 |     1478493
      (1 row)

      Time: 11534.090 ms


    • The tpch-postgres.7z has all 22 queries that have been modified to run in PostgreSQL. 
      • "where rownum <=" statement is replaced with "limit" statement
      • Correlated aggregate subqueries with huge outer tables is replaced with JOIN (query 17, 20)

    TPC-H Benchmark Result

    Below is the TPC-H Benchmark Result on a Dell Precision T3610 Workstation (CPU @3.5GHz with 64GB RAM) and PostgreSQL 9.5.4 from the latest postgres Docker image (as of August 2016).


    TCP-H Query SF=1 (ms) Table Header
    1 11534.090 Table Cell
    2 398.402 Table Cell
    3 1085.500 Table Cell
    4 551.310 Table Cell
    5 650.774 Table Cell
    6 1304.219 Table Cell
    7 965.973 Table Cell
    8 1836.722 Table Cell
    9 5330.943 Table Cell
    10 1801.567 Table Cell
    11 280.123 Table Cell
    12 1889.902 Table Cell
    13 1775.164 Table Cell
    14 1339.493 Table Cell
    15 0.192 Table Cell
    16 1198.179 Table Cell
    17 4607.648 Table Cell
    18 4161.339 Table Cell
    19 1673.113 Table Cell
    20 2308.152 Table Cell
    21 2130.221 Table Cell
    22 585.929 Table Cell

    4 comments:

    1. Dear Jim, Very accurate procedures in this article. I have get success using this procedures. I tried with other tutorials, but I did not find the DDL object creation code, just how to generate the data. Please leave this DDL table-creation script ( /opt/db/ws/tpch/sql/tpch-build-db.sql) available for a community and modify postgresql queries too. Do you have any DBT-3 installation article in so linux using postgresql? best regards

      ReplyDelete
    2. Thanks a ton for the wonderful article.

      ReplyDelete
    3. THANK YOU. This article has been soooooo helpful :)

      ReplyDelete