This PR clarifies which features are supported on P100 via its tests, though Pascal is not officially and fully supported by Triton. ## What this PR does - Skip unsupported tests on P100. - Atomic RMW - `tl.dot()` (perhaps not all patterns, but basically most `tl.dot()` tests do not work on P100). - Add an explicit error if shared memory size >= 64K on P100. - Otherwise it causes `Invalid CUDA argument` error at `cuLaunchKernel()`, but this error is not very straightforward to understand. Instead of this generic CUDA argument error, this PR makes Triton show an error during codegen when `sm < 70`. This check happens in C/C++ so won't add an overhead in Triton's Python runtime. - 3 tests (see below) are currently failing, but these are not marked as skipped because any codegen update in the future can change the kernel size of the other tests. - This change won't affect Triton-MLIR. Hopefully Triton-MLIR's generic `tl.dot()` implementation would support P100. Importantly, Triton passed all the other tests on P100. Though this support is not official, it is great for, for example, PyTorch's TorchDynamo/Inductor, which can use Triton (without `tl.dot()`) for its backend (https://github.com/pytorch/torchdynamo/issues/1591). ### Results on P100 (Google Cloud) ```sh $ pytest test/unit ... ================================================================================== short test summary info ================================================================================== FAILED test/unit/language/test_core.py::test_reduce2d[argmin-float32-shape99-1] - RuntimeError: Device does not support shared memory of 65536bytes FAILED test/unit/language/test_core.py::test_reduce2d[argmax-float32-shape113-1] - RuntimeError: Device does not support shared memory of 65536bytes FAILED test/unit/language/test_core.py::test_permute[float32-shape5-perm5] - RuntimeError: Device does not support shared memory of 67584bytes ================================================================== 3 failed, 3824 passed, 952 skipped in 470.90s (0:07:50) ================================================================== ``` <details><summary> <b>Environment Details (collapsed)</b></summary> <p> ### VM details (Google Cloud) https://cloud.google.com/ ``` # You need a paid account (free trial does not cover GPUs) Google Cloud -> New Project -> Compute-Engine -> VM Instance Machine: GPU: NVIDIA Tesla P100 x 1 CPU: 2 vCPUs, 7.5GB memory Boot disk: OS: Ubuntu 18.04 LTS Disk: 40GB (cannot build Triton on the default 10GB disk) - When I tried, about $1.2 per hour. - US instances were full when I tried. I used Asia or Australia. - Needed a paid account (GPU is not covered by free trial) - Needed quota request for any GPU instance (by default, no GPU instance is allowed). Needed to wait an hour for approval ``` ### Reproducer ```sh ## 1. Install CUDA and a driver # Update the apt key (https://developer.nvidia.com/blog/updating-the-cuda-linux-gpg-repository-key/) sudo apt-key del 7fa2af80 wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-keyring_1.0-1_all.deb # Download CUDA as instructed wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600 sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/ /" sudo apt-get update sudo apt-get -y install cuda # Are you using P100? nvidia-smi | grep "Tesla P100" ## 2. Setup the build environment sudo apt update sudo apt install -y build-essential wget git libz-dev wget https://repo.anaconda.com/archive/Anaconda3-2022.05-Linux-x86_64.sh bash Anaconda3-2022.05-Linux-x86_64.sh -b -p $(pwd)/anaconda3 eval "$($(pwd)/anaconda3/bin/conda shell.bash hook)" conda create -y --name triton_base conda activate triton_base conda install -y cmake setuptools ## 3. Build Triton git clone https://github.com/openai/triton.git cd triton/python pip3 install -e '.[tests]' ## 4. Test pytest test/unit ``` ### Environment ```sh $ nvidia-smi +-----------------------------------------------------------------------------+ | NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla P100-PCIE... On | 00000000:00:04.0 Off | 0 | | N/A 36C P0 25W / 250W | 0MiB / 16384MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ ``` </p></details>
Documentation |
---|
Triton
This is the development repository of Triton, a language and compiler for writing highly efficient custom Deep-Learning primitives. The aim of Triton is to provide an open-source environment for expressing tensor math workloads that offers high flexibility, developer productivity and end to end performance.
The foundations of this project are described in the following MAPL2019 publication: Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations. Please consider citing this work if you use Triton!
The official documentation contains installation instructions and tutorials.
Quick Installation
You can install the latest stable release of Triton from pip:
pip install triton
Binary wheels are available for CPython 3.6-3.9 and PyPy 3.6-3.7.
And the latest nightly release:
pip install -U --pre triton
Install from source
git clone https://github.com/openai/triton.git;
cd triton/python;
pip install cmake; # build time dependency
pip install -e .
Changelog
Version 1.1 is out! New features include:
- Many, many bugfixes
- More documentation
- Automatic on-disk caching of compiled binary objects
- Random Number Generation
- Faster (up to 2x on A100), cleaner blocksparse ops
Contributing
Community contributions are more than welcome, whether it be to fix bugs or to add new features. Feel free to open GitHub issues about your contribution ideas, and we will review them. A contributor's guide containing general guidelines is coming soon!
If you’re interested in joining our team and working on Triton & GPU kernels, we’re hiring!
Compatibility
Supported Platforms:
- Linux
Supported Hardware:
- NVIDIA GPUs (Compute Capability 7.0+)
- Under development: AMD GPUs, CPUs
Disclaimer
Triton is a fairly recent project, and it is under active development. We expect it to be pretty useful in a wide variety of cases, but don't be surprised if it's a bit rough around the edges :)