ISAAC

This is the development repository for ISAAC, an input-aware auto-tuning framework and code-generator for HPC/DL. This version is only compatible with NVIDIA hardware (it generates PTX source code). For OpenCL/CUDA compatibility, visit the Intel fork (https://github.com/intel/isaac) or the v1.0 branch (deprecated) or the

License

ISAAC is distributed under the MIT/X11 license.

Getting started - Deep Learning Inference

Execute the following commands on a python environment that contains a recent version of pytorch:

git clone https://github.com/ptillet/isaac.git
cd isaac/python;
python setup.py build;
python setup.py install;
cd examples/pytorch;
python imagenet.py --arch resnet152 /path/to/imagenet/;

This should give you 78.1% accuracy, and roughly 4x speed-up over pytorch.

Getting started - C++ API

In order to compile and use the ISAAC C++ API, only a proprietary NVIDIA driver is necessary. No CUDA SDK is required (except for testing and benchmarking against cuBLAS/cuDNN):

git clone https://github.com/ptillet/isaac.git
cd isaac; 
mkdir build; 
cd build;
cmake ../ ; make -j8;
./examples/isaac-tools --gemm --bench --suite deepbench --dtype float32
./examples/isaac-tools --conv --bench --suite deepbench --dtype float32

If you want, you can also dump the PTX source code generated by ISAAC for some shapes:

./examples/isaac-tools --gemm --dump --format ptx --shape 2048,2048,2048 --layout NT --dtype float32

If you really know what you're doing, you can also capture the tiling parameters found by ISAAC:

./examples/isaac-tools --gemm --dump --format params --shape 2048,2048,2048 --layout NT --dtype float32

You will get the following output:

Tuning parameters: 4, 16, 8, 8, 8, 8, 16, 8, 16, 8, 1, 1, 1

The parameters respectively mean: (1) that shared memory loads have a width of 4 ; (2) each block comprises 16x8 threads ; (3) each threads computes a tile of 8x8 elements; (4) Each loop iteration processes 8 elements along the K axis ; (5) threads are rearranged as a 16 x 8 block for loading A, and a 16 x 8 block for loading B; (6) the reduction is split accross 1, 1 and 1 independent batches within each thread, thread-block and grid, and the results are accumulated after the inner-loop

Benchmarks - C++ API

ISAAC often provides Tesla P100 - SGEMM: sgemm-gv100

Tesla P100 - DGEMM: sgemm-gv100

Tesla P100 - SCONV (vs cuDNN's IMPLICIT_PRECOMP_GEMM) sgemm-gv100

Acknowledgments

This work was partially supported by the National Science Foundation (IIS 1409097) and by IARPA (contract D16PC00002).

Description
Development repository for the Triton language and compiler
Readme 146 MiB
Languages
C++ 49.7%
Python 35.3%
MLIR 13.3%
CMake 1.7%