triton

Author	SHA1	Message	Date
Shintaro Iwasaki	77bc5187b5	Better NVIDIA Pascal GPU Support (#827 ) This PR clarifies which features are supported on P100 via its tests, though Pascal is not officially and fully supported by Triton. ## What this PR does - Skip unsupported tests on P100. - Atomic RMW - `tl.dot()` (perhaps not all patterns, but basically most `tl.dot()` tests do not work on P100). - Add an explicit error if shared memory size >= 64K on P100. - Otherwise it causes `Invalid CUDA argument` error at `cuLaunchKernel()`, but this error is not very straightforward to understand. Instead of this generic CUDA argument error, this PR makes Triton show an error during codegen when `sm < 70`. This check happens in C/C++ so won't add an overhead in Triton's Python runtime. - 3 tests (see below) are currently failing, but these are not marked as skipped because any codegen update in the future can change the kernel size of the other tests. - This change won't affect Triton-MLIR. Hopefully Triton-MLIR's generic `tl.dot()` implementation would support P100. Importantly, Triton passed all the other tests on P100. Though this support is not official, it is great for, for example, PyTorch's TorchDynamo/Inductor, which can use Triton (without `tl.dot()`) for its backend (https://github.com/pytorch/torchdynamo/issues/1591). ### Results on P100 (Google Cloud) ```sh $ pytest test/unit ... ================================================================================== short test summary info ================================================================================== FAILED test/unit/language/test_core.py::test_reduce2d[argmin-float32-shape99-1] - RuntimeError: Device does not support shared memory of 65536bytes FAILED test/unit/language/test_core.py::test_reduce2d[argmax-float32-shape113-1] - RuntimeError: Device does not support shared memory of 65536bytes FAILED test/unit/language/test_core.py::test_permute[float32-shape5-perm5] - RuntimeError: Device does not support shared memory of 67584bytes ================================================================== 3 failed, 3824 passed, 952 skipped in 470.90s (0:07:50) ================================================================== ``` <details><summary> <b>Environment Details (collapsed)</b></summary> <p> ### VM details (Google Cloud) https://cloud.google.com/ ``` # You need a paid account (free trial does not cover GPUs) Google Cloud -> New Project -> Compute-Engine -> VM Instance Machine: GPU: NVIDIA Tesla P100 x 1 CPU: 2 vCPUs, 7.5GB memory Boot disk: OS: Ubuntu 18.04 LTS Disk: 40GB (cannot build Triton on the default 10GB disk) - When I tried, about $1.2 per hour. - US instances were full when I tried. I used Asia or Australia. - Needed a paid account (GPU is not covered by free trial) - Needed quota request for any GPU instance (by default, no GPU instance is allowed). Needed to wait an hour for approval ``` ### Reproducer ```sh ## 1. Install CUDA and a driver # Update the apt key (https://developer.nvidia.com/blog/updating-the-cuda-linux-gpg-repository-key/) sudo apt-key del 7fa2af80 wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-keyring_1.0-1_all.deb # Download CUDA as instructed wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600 sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/ /" sudo apt-get update sudo apt-get -y install cuda # Are you using P100? nvidia-smi \| grep "Tesla P100" ## 2. Setup the build environment sudo apt update sudo apt install -y build-essential wget git libz-dev wget https://repo.anaconda.com/archive/Anaconda3-2022.05-Linux-x86_64.sh bash Anaconda3-2022.05-Linux-x86_64.sh -b -p $(pwd)/anaconda3 eval "$($(pwd)/anaconda3/bin/conda shell.bash hook)" conda create -y --name triton_base conda activate triton_base conda install -y cmake setuptools ## 3. Build Triton git clone https://github.com/openai/triton.git cd triton/python pip3 install -e '.[tests]' ## 4. Test pytest test/unit ``` ### Environment ```sh $ nvidia-smi +-----------------------------------------------------------------------------+ \| NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 \| \|-------------------------------+----------------------+----------------------+ \| GPU Name Persistence-M\| Bus-Id Disp.A \| Volatile Uncorr. ECC \| \| Fan Temp Perf Pwr:Usage/Cap\| Memory-Usage \| GPU-Util Compute M. \| \| \| \| MIG M. \| \|===============================+======================+======================\| \| 0 Tesla P100-PCIE... On \| 00000000:00:04.0 Off \| 0 \| \| N/A 36C P0 25W / 250W \| 0MiB / 16384MiB \| 0% Default \| \| \| \| N/A \| +-------------------------------+----------------------+----------------------+ ``` </p></details>	2022-11-03 00:11:52 -07:00
Philippe Tillet	4a77dfb042	[FRONTEND] Complete rewrite of the runtime (#644 ) This PR completely rewrites the runtime of Triton to be more lean and clearly separate the compilation step from the just-in-time caching logic. This should substantially reduce launch overhead.	2022-09-18 08:51:48 -07:00
Keren Zhou	4912916c11	[FRONTEND] Added support for element-wise function defined in external LLVM bitcode (e.g., libdevice) (#562 )	2022-07-13 15:52:21 -07:00
Philippe Tillet	5b4c8f221e	[BACKEND] Compiler improvements (#557 ) This PR adds several optimization capabilities in the compiler backend: - Now using inline PTX for `tl.store`, making it possible to use things like evict_last - For A100, mma layout can be directly converted to shared memory - For A100, an additional "transpose" argument in `dot` allows tensors to be loaded once and used both row- and col- major. - Fixed liveness analysis; this was broken. - Now can load/store directly mma layout without converting. Useful for when tl.dot accumulator is initialized with DRAM data inside of an inner loop. - `tl.dot` can now take LHS inputs in registers when it comes from a previous `tl.dot` instruction. Useful for e.g. fused attention.	2022-06-27 11:49:19 -07:00
Philippe Tillet	8876e53206	[BACKEND] Restored reduction bugfixes	2022-06-03 11:38:52 -07:00
Philippe Tillet	a60374a597	Revert "[BACKEND] Various bug fixes; making reductions faster (#533 )". This is a more stable commit that produce bitwise identical code to earlier versions. Using commits after this one may lead to slightly different numerics	2022-06-03 11:36:06 -07:00
Philippe Tillet	3e7500dfe6	[BACKEND] Various bug fixes; making reductions faster (#533 )	2022-05-31 17:14:44 -07:00
Philippe Tillet	2bed6fc850	[LANG] Added support for device functions (#484 )	2022-04-03 20:58:16 -07:00
daadaada	94a2e10fe5	[BACKEND] Add bf16 & tf32 mma supports (on A100) (#426 )	2022-01-11 10:20:31 -08:00
daadaada	39d4bfed83	[OPS] Add performance model for gemm/gemv (#397 ) Significantly improves the performance of `triton.ops.matmul` in memory-bound settings via the use of many more block configs coupled with a performance model to drive the auto-tuning process.	2021-12-21 09:56:10 -08:00
daadaada	858dec8372	[CODEGEN] Add cache modifier to tl.load (#351 ) * Add cache modifier to tl.load * Add comment to cache_modifier * Remove force_nc_cache * Update test	2021-10-17 22:14:04 -07:00
Philippe Tillet	94c83d30ce	[GENERAL] Removed deprecated driver files and added basic compatibility with rocm (#268 ) - Removed driver module -- accelerator runtime is handled by pytorch - Added basic support for ROCM based on @micmelesse 's PR -- now can execute empty kernel on AMD devices without any compile-time changes - Now only using PREFER_SHARED for kernels when the size of shared memory is greater than 49k. Otherwise there can be poor L1 performance for broadcast tensors	2021-09-09 00:04:28 -07:00
daadaada	274d613488	[IR] Better printer (#256 )	2021-09-01 09:55:12 -07:00
Philippe Tillet	4ff3714d61	[CODEGEN] Various bugfixes and stability improvements in compiler backend (#240 )	2021-08-30 11:50:35 -07:00
Philippe Tillet	01276b5153	[FRONTEND] Added compilation flag to force use of `.nc` cache modifier (#134 ) in DRAM loads. /!\ USE CAREFULLY - THIS CAN BREAK CORRECTNESS IF MISUSED /!\	2021-07-27 12:38:49 -07:00
Philippe Tillet	8cea583109	[IR] Preliminary support for BF16 (#129 ) This PR adds a BF16 data-type, along with FP32 <-> BF16 conversion instructions in the LLVM codegen. Other kinds of ops on bfloat16 are not yet supported.	2021-07-27 12:38:49 -07:00
daadaada	d8d6b715c8	[CODEGEN] Performance improvement on A100 (#125 ) Improved codegen for the Ampere GPUs. * Make the layout pass recognize the multistage pipelined pattern. * Now the pipeline pass can automate the multistage pipelining transformation. * Remove extra barriers (from the prefetch pass & WAR) on Ampere. * Update the code generator (generator.cc) to make Triton generate n-buffered shared memory loads/stores.	2021-07-27 12:38:49 -07:00
Philippe Tillet	5a51f3e529	[CODEGEN] Bugfix in membar pass (#124 ) Membar pass on top of master is buggy with asynchronous copy. For example, it doesn't wait for asynchronous copies to complete before recoalescing accumulator in GEMM, which leads to undefined behavior when the program doesn't enter the loop. This PR proposes	2021-07-27 12:38:49 -07:00
Philippe Tillet	80c86ecf4a	[LANG] Minor semantic changes (#121 ) * Now using unordered instead of ordered float (fixes NaN issues) * Bool -> int32 now converts to 1 rather than -1 * Reduce extend arguments to 32-bits if possible	2021-07-27 12:38:49 -07:00
Philippe Tillet	0274429429	[IR] Added IR and Codegen support for atomic_rmw (#120 )	2021-07-27 12:38:49 -07:00
daadaada	967e629c0c	[CODEGEN] Add a pass to prefetch operands of dot if applicable. (#105 ) * update membar pass when data is double buffered * Add instruction prefetch_s * prefetch tests pass (except the 1 warp case) * Fix the 1-warp bug * Add back prefetch files * Disable prefetch on a100 * Always add war barrier on sm>=80	2021-07-27 12:38:49 -07:00
Philippe Tillet	d10265f054	[CODEGEN] Bugfix for immediate offsets in inline PTX (#104 )	2021-07-27 12:38:49 -07:00
Philippe Tillet	840140bf26	[CODEGEN] Removed dedicated reassociate pass to merge it into LLVM isel (#101 ) This massively simplifies implementation of `reassociate` and also fixes a bunch of bug. The pass could still be improved, but can already be used to generate constant pointer offsets in eg the matmul epilogue	2021-07-27 12:38:49 -07:00
Philippe Tillet	39f4730305	Deprecation of Triton-C and Replacement by decorated Python functions (#86 ) This PR implements a major overhaul of the frontend for Triton, and replaces Triton-C by a pure Python API in which kernels are defined as @triton.jit decorated functions. The documentation and tutorials have also been updated to accommodate these changes. See documentations for more information on the new API	2021-07-27 12:38:49 -07:00

24 Commits