Sunday, July 16, 2023

torch.fx Notes

make_fx

decomposition_table

The decomposition_table argument to the make_fx function in PyTorch is a dictionary that maps from ATen operators to their decomposed counterparts. This can be used to improve the performance of traced graphs by eliminating unnecessary copies and mutations. For example, the following code shows how to use the decomposition_table argument to improve the performance of the torch.nn.functional.dropout function:

Python

import torch
from torch.fx.experimental.proxy_tensor import make_fx

def make_decomposition_table():
  table = {}
  table[torch.nn.functional.dropout] = torch.jit.trace(
      lambda x, p: torch.nn.functional.dropout(x, p, training=False),
      example_inputs=(torch.randn(10), torch.tensor(0.5))
  )
  return table

decomposition_table = make_decomposition_table()
graph = make_fx(
    lambda x, p: torch.nn.functional.dropout(x, p),
    decomposition_table=decomposition_table
)

The make_decomposition_table function creates a dictionary that maps from the torch.nn.functional.dropout function to its traced counterpart. This traced counterpart is a more efficient implementation of the dropout function that eliminates unnecessary copies and mutations.
The make_fx function then uses the decomposition_table argument to trace the dropout function using the traced counterpart. This results in a traced graph that is more efficient than the original graph.
The decomposition_table argument can be used to improve the performance of any ATen operator. However, it is important to note that the traced counterparts of some operators may not be as efficient as the original operators. Therefore, it is important to benchmark the performance of the traced graphs before using them in production.

get_decompositions

The get_decompositions function in PyTorch is a function that returns a dictionary of all the decompositions that are currently registered in the torch.fx.experimental.proxy_tensor module. This can be used to see which operators have been decomposed and how they have been decomposed. For example, the following code shows how to use the get_decompositions function to see which operators have been decomposed: Python

import torch
from torch.fx.experimental.proxy_tensor import get_decompositions

decompositions = get_decompositions()
print(decompositions)

This code will print a dictionary that maps from ATen operators to their decomposed counterparts. For example, the torch.nn.functional.dropout function will be mapped to its traced counterpart.
The get_decompositions function can be used to see which operators have been decomposed and how they have been decomposed. This can be useful for understanding the performance of traced graphs and for debugging traced graphs. Here is an example of the output of the get_decompositions function:
```
{
  "aten::dropout": ,
  "aten::mul": ,
  "aten::add": ,
  ...
}
```
As you can see, the get_decompositions function returns a dictionary that maps from ATen operators to their decomposed counterparts. This can be useful for understanding the performance of traced graphs and for debugging traced graphs.

Friday, September 13, 2019

Xilinx SDx/Vitis Tips

This blog shows the live tweet stream with hashtags #sdxtips and #vitistips. Please help the Xilinx developer community by sharing your tips and experience using the same hashtags

Xilinx SDx and Vitis Tips - Curated tweets by jimwu88

Thursday, May 30, 2019

Jupyter Notebook Table of Contents Extension

Jupyter Notebook Table of Contents Extension comes really handy especially when I work on large notebooks with many different sections and want to navigate to specific sections easily and quickly. Since I had to google a few times to install it, I am posting the exact instructions here in case it may help others.

Install nbextensions Python package

pip install jupyter_contrib_nbextensions

Install javascript and css files

jupyter contrib nbextension install --user

Enable toc2 extension

jupyter nbextension enable toc2/main

Wednesday, September 7, 2016

Use PGStrom with PostgreSQL on Ubuntu 14.04 LTS

Software Installation

Install useful utilities
sudo apt-get install wget git rpm

Install Cuda

Download Cuda local install package from https://developer.nvidia.com/cuda-downloads

Run the commands below to install Cuda and reboot

$sudo dpkg -i cuda-repo-ubuntu1404-7-5-local_7.5-18_amd64.deb
$sudo apt-get update
$sudo apt-get install cuda

Create/Edit /etc/ld.so.conf.d/cuda-lib64.conf with Cuda library path
```
$echo /usr/local/cuda/lib64 > /etc/ld.so.conf.d/cuda-lib64.conf
```
Build the deviceQuery sample from Cuda installation and verify that the Cuda driver works

$cp -r /usr/local/cuda/samples .
$cd samples/1_Utilities/deviceQuery 
$make
$./deviceQuery 
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Quadro K1000M"
  CUDA Driver Version / Runtime Version          7.5 / 7.5
  CUDA Capability Major/Minor version number:    3.0
  Total amount of global memory:                 2047 MBytes (2146762752 bytes)
  ( 1) Multiprocessors, (192) CUDA Cores/MP:     192 CUDA Cores
  GPU Max Clock rate:                            851 MHz (0.85 GHz)
  Memory Clock rate:                             900 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 262144 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Version = 7.5, NumDevs = 1, Device0 = Quadro K1000M
Result = PASS

Install PostgreSQL

Import PostgreSQL repository key

$wget -O - http://apt.postgresql.org/pub/repos/apt/ACCC4CF8.asc | sudo apt-key add -

Create or edit /etc/apt/sources.list.d/pgdg.list with the line below
```
deb http://apt.postgresql.org/pub/repos/apt/ trusty-pgdg main
```

Configure apt's package pinning to prefer the PGDG packages over the Debian ones in /etc/apt/preferences.d/pgdg.pref. this will replace all your Debian/Ubuntu packages with available packages from the PGDG repository.
```
Package: *
Pin: release o=apt.postgresql.org
Pin-Priority: 500
```
Update the package lists, and install the pgdg-keyring package to automatically get repository key updates
```
$sudo apt-get update
$sudo apt-get install pgdg-keyring
```

Install PostgreSQL 9.5

$sudo apt-get install -y postgresql-common
$sudo apt-get install -y postgresql-9.5 postgresql-contrib-9.5 postgresql-server-dev-9.5 libpq-dev

The PostgreSQL server will automatically be started after installation. Run the commands below to manually start or restart the server if needed.
```
$sudo service postgresql restart
$sudo service postgresql start
```
Run the command below to start psql terminal and connect to the server
```
$sudo -u postgre psql
```

Install PGStrom [PGStrom Wiki]

Download and install PGStrom

$git clone https://github.com/pg-strom/devel pg_strom
$cd pg_strom
$which pg_config
/usr/local/pgsql/bin/pg_config
$make
$sudo make install

Edit /etc/postgresql/9.5/main/postgresql.conf to update configuration for PGStrom
- Add /usr/lib/postgresql/9.5/lib/pg_strom.so to shared_preload_libraries
Restart PostgreSQL service
```
$sudo service postgresql restart
```

PostgreSQL log file /var/log/postgresql/postgresql-9.5-main.log shows PGStrom is loaded with supported GPU device

$ cat /var/log/postgresql/postgresql-9.5-main.log 
LOG:  PG-Strom version 1.0devel built for PostgreSQL 9.5
LOG:  CUDA Runtime version: 7.5.0
LOG:  NVIDIA driver version: 352.63
LOG:  GPU0 Quadro K1000M (192 CUDA cores, 850MHz), L2 256KB, RAM 2047MB (128bits, 900MHz), capability 3.0
LOG:  NVRTC - CUDA Runtime Compilation vertion 7.5

Test PGStrom Integration with PostgreSQL

Start psql terminal
```
sudo -u postgres psql
```

Create pg_strom extension

postgres=#CREATE EXTENSION pg_strom;
CREATE EXTENSION

Create a test DB and connect to it

postgres=# create database testdb;
CREATE DATABASE
postgres=# \c testdb
You are now connected to database "testdb" as user "postgres".

Create test tables from SQL script in PGStrom installation. It creates one 100M rows table and 9 of 100K rows table.
```
postgres=# \i pgstrom/test/testdb.sql
```

Check pg_strom parameters used by postgresql:

postgres=# show all;
pg_strom.bulkexec                   | on                                       | Enables the bulk-execution mode of PG-Strom
pg_strom.chunk_limit                | 79360kB                                  | limit size of pgstrom_data_store
pg_strom.chunk_size                 | 15872kB                                  | default size of pgstrom_data_store
pg_strom.chunk_size_margin          | 1.25                                     | margin of chunk size if not predictable exactly
pg_strom.cpu_fallback               | on                                       | Enables CPU fallback if GPU is 
pg_strom.cuda_visible_devices       |                                          | CUDA_VISIBLE_DEVICES of CUDA runtime
pg_strom.debug_cuda_coredump        | off                                      | Turn on/off GPU coredump feature
pg_strom.debug_force_gpupreagg      | off                                      | Force GpuPreAgg regardless of the cost (debug)
pg_strom.debug_force_gpusort        | off                                      | Force GpuSort regardless of the cost (debug)
pg_strom.debug_kernel_source        | off                                      | Turn on/off to display the kernel source path
pg_strom.enable_gpuhashjoin         | on                                       | Enables the use of GpuHashJoin logic
pg_strom.enable_gpunestloop         | on                                       | Enables the use of GpuNestLoop logic
pg_strom.enable_gpupreagg           | off                                      | Enables the use of GPU preprocessed aggregate
pg_strom.enable_gpuscan             | on                                       | Enables the use of GPU accelerated full-scan
pg_strom.enable_gpusort             | off                                      | Enables the use of GPU accelerated sorting
pg_strom.enabled                    | off                                      | Enables the planner's use of PG-Strom
pg_strom.gpu_dma_cost               | 10                                       | Cost to send/recv data via DMA
pg_strom.gpu_operator_cost          | 7.8125e-05                               | Cost of processing each operators by GPU
pg_strom.gpu_setup_cost             | 4000                                     | Cost to setup GPU device to run
pg_strom.gpu_tuple_cost             | 0.0003125                                | Cost of processing each tuple for GPU
pg_strom.max_async_tasks            | 32                                       | max number of GPU tasks to be run asynchronously
pg_strom.num_threads_margin         | 1.1                                      | margin of number of CUDA threads if not predictable exactly
pg_strom.perfmon                    | off                                      | Enables the performance monitor of PG-Strom
pg_strom.program_cache_size         | 48MB                                     | size of shared program cache
pg_strom.pullup_outer_scan          | on                                       | Enables to pull up simple outer scan

Review the execution plan of a query and confirm GPU offload is enabled (GpuPreAgg, GpuJoin, etc)

testdb=# EXPLAIN                 
SELECT cat, avg(ax) FROM t0 NATURAL JOIN t1 NATURAL JOIN t2 GROUP BY cat;
                                         QUERY PLAN                                          
---------------------------------------------------------------------------------------------
 HashAggregate  (cost=3349749.08..3349749.40 rows=26 width=12)
   Group Key: t0.cat
   ->  Custom Scan (GpuPreAgg)  (cost=16325.86..2879097.75 rows=234 width=44)
         Reduction: Local + Global
         GPU Projection: cat, ax
         ->  Custom Scan (GpuJoin) on t0  (cost=12325.86..2851751.79 rows=99599458 width=12)
               GPU Projection: t0.cat, t1.ax
               Depth 1: GpuHashJoin, HashKeys: (t0.aid)
                        JoinQuals: (t0.aid = t1.aid)
                        Nrows (in/out: 99.60%), KDS-Hash (size: 13.47MB, nbatches: 1)
               Depth 2: GpuHashJoin, HashKeys: (t0.bid)
                        JoinQuals: (t0.bid = t2.bid)
                        Nrows (in/out: 100.00%), KDS-Hash (size: 13.47MB, nbatches: 1)
               ->  Seq Scan on t1  (cost=0.00..1935.00 rows=100000 width=12)
               ->  Seq Scan on t2  (cost=0.00..1935.00 rows=100000 width=4)
(15 rows)

Turn off pgstrom and review the same execution plan (default join, hash, etc).

testdb=# set pg_strom.enabled=off; 
SET
testdb=# EXPLAIN                   
SELECT cat, avg(ax) FROM t0 NATURAL JOIN t1 NATURAL JOIN t2 GROUP BY cat;
                                    QUERY PLAN                                    
----------------------------------------------------------------------------------
 HashAggregate  (cost=6843172.91..6843173.24 rows=26 width=12)
   Group Key: t0.cat
   ->  Hash Join  (cost=7250.00..6345175.62 rows=99599458 width=12)
         Hash Cond: (t0.aid = t1.aid)
         ->  Hash Join  (cost=3576.00..4188792.08 rows=99997856 width=8)
               Hash Cond: (t0.bid = t2.bid)
               ->  Seq Scan on t0  (cost=0.00..1833312.56 rows=99997856 width=12)
               ->  Hash  (cost=1935.00..1935.00 rows=100000 width=4)
                     ->  Seq Scan on t2  (cost=0.00..1935.00 rows=100000 width=4)
         ->  Hash  (cost=1935.00..1935.00 rows=100000 width=12)
               ->  Seq Scan on t1  (cost=0.00..1935.00 rows=100000 width=12)
(11 rows)

Sunday, August 28, 2016

Run PostgreSQL on Docker

Host and Containers

Host: Ubuntu 14.04
- Add <username> to docker group so sudo is not required to run docker command
  sudo usermod -a -G docker username
- Create two directories below. They will be mounted to containers to make DB and development files persist outside the containers
  /opt/db/ws
  /opt/db/postgresql/data
Container: postgresql : https://hub.docker.com/_/postgres/
- This image includes EXPOSE 5432 (the postgres port), so standard container linking will make it automatically available to the linked containers. The default postgres user and database are created in the entry point with initdb.
- The default data directory for PostgreSQL is /var/lib/postgresql/data

Run PostgreSQL Containers for Server and Client

ssh <host>
docker run --name pgs-server -v /opt/db/ws:/opt/db/ws -v /opt/db/postgresql/data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -d postgres

run
start a new container using the requested image

--name pgs-server
Set container name to pgs-server.

-v /opt/db/postgresql/data:/var/lib/postgresql/data
-v /opt/db/ws:/opt/db/ws
Mount host directory /opt/db/postgresql as /var/lib/postgresql on the container so DB files persist on the hard disk
Mount host directory /opt/db/ws as /opt/db/ws to be used for development.

-e POSTGRES_PASSWORD=pgsdev
Set up the POSTGRES_PASSWORD environment variable, which sets the master PostgreSQL password

-d
Run the container in the background (daemon mode). It will stay alive until it is removed

postgres
Runs postgres docker image
Optional: Start a bash shell on the pgs-server container for miscellaneous tasks
docker exec -it pgs-server bash
Start a PostgreSQL client container
docker run --name pgs-client -v /opt/db/ws:/opt/db/ws -it --rm --link pgs-server:postgres postgres psql -h postgres -U postgres

run
Start a new container using the requested image

--name pgs-client
Set container name to pgs-client. If this option is not provided, docker will assign a random name.

-v /opt/db/ws:/opt/db/ws
Mount host directory /opt/db/ws as /opt/db/ws to be used for development.

-it
Run the container in interactive mode

--rm
Automatically cleanup the container after exit to avoid zombie containers
-link pgs-server:postgres
connects to the pgs-server container from the pgs-cleint container

postgres
Use postgres Docker image

psql -h postgres -U postgres
at the end tells Docker what command to execute when the container starts. In this case, start the interactive postgres terminal
Optional. Start a bash shell on the 2nd container for miscellaneous tasks
$docker exec -it pgs-client bash