triton

Author	SHA1	Message	Date
donproc	521ff9ad74	[TRITON-MLIR][FRONTEND]fix scf.if to run through layernorm tutorial (#938 ) Co-authored-by: dongdongl <dongdongl@nvidia.com>	2022-12-02 17:45:29 +08:00
donproc	9def1bcebf	[TRITON-MLIR][FRONTEND]minor fix to run through atomic_cas test (#925 ) Co-authored-by: dongdongl <dongdongl@nvidia.com>	2022-12-01 13:43:26 +00:00
Philippe Tillet	6461254fb5	[BACKEND] Make flash attention forward pass work (#928 ) This also simplifies BroadcastOp codegen	2022-11-30 10:13:24 +00:00
Philippe Tillet	9bb54402b3	[FRONTEND][BACKEND] Small fixes to multiple_of, num_programs, axisinfo; enable block-sparse tests (#927 )	2022-11-29 20:00:34 +01:00
Crutcher Dunnavant	f98aed1258	[Triton-MLIR][RUNTIME] Add /usr/bin/ptxas as a search path (#909 ) Make `ptxas` search a bit broader to include `/usr/bin/ptxas`, installed by the lambda stack repo versions: https://lambdalabs.com/lambda-stack-deep-learning-software	2022-11-24 18:49:16 +00:00
Crutcher Dunnavant	ace7d28736	[Triton-MLIR][RUNTIME] Fix ir metadata lookup bug (#910 )	2022-11-24 09:27:23 +01:00
ben-zhang-609	07786dc932	[Triton-MLIR] Add compute capability (#902 ) add compute capability from python frontend to backend. Co-authored-by: Keren Zhou <kerenzhou@openai.com>	2022-11-22 11:08:23 -08:00
Philippe Tillet	23f71daa27	[OPTIMIZER] Fixed up order of shared layouts (#881 )	2022-11-21 06:25:02 +01:00
Philippe Tillet	4d64ffb5fe	[FRONTEND] Handle for loops with negative constant steps (#896 )	2022-11-20 11:37:38 +01:00
Keren Zhou	6c5f646f4e	[WIP][Triton-MLIR] Prefetch pass fixup (#873 ) A (potential) problem by directly adopting `tensor.extract_slice`. Long story short, `tensor.extract_slice` is not aware of swizzling. Consider the following shared memory tensor and its first three slices, where each slice includes two tile (the loading unit of LDGSTS) of elements. Currently, the tiles haven't been swizzled yet, so slicing seems to work. <img width="1219" alt="image" src="https://user-images.githubusercontent.com/2306281/201833023-a7950705-2d50-4c0a-8527-7505261c3a3c.png"> However, now consider the following figure, which is the layout after applying swizzling on the first figure. <img width="1244" alt="image" src="https://user-images.githubusercontent.com/2306281/201834824-7daae360-f5bc-4e6b-a921-20be3f294b78.png"> Note that on phase 2, all tiles have been swizzled out of their originally slices. This implies that if we use the tile index after slicing, we can no longer locate the correct tiles. For example, T3 was in slice 1 but got swapped to slice 0 after swizzling. Here's a more detailed explanation. In the current `triton-mlir` branch, we only compute the relative offset of each tile. So T3's index in Slice 1 is 1, and it will be swizzled using 1 and phase id. Whereas the correct index of T3 should be 3, which is the relative offset to the beginning of the shared memory tensor being swizzled, and T3 should be swizzled using 3 and phase id. This PR proposes a hacky solution for this problem. We restore the "correct" offset of each tile by assuming that slicing on a specific dim only happens at most once on the output of insert_slice_async. I admit it's risky and fragile. The other possible solution is adopting cutlass' swizzling logic that limits the indices being swizzled in a "bounding box" that matches the mma instruction executes. For example, in the following tensor layout, each 4x4 submatrix is a minimum swizzling unit, and the entire tensor represents the tensor layout of operand A in `mma.16816`. <img width="565" alt="image" src="https://user-images.githubusercontent.com/2306281/201836879-4ca7824b-530c-4a06-a3d5-1e74a2de1b42.png"> Co-authored-by: Phil Tillet <phil@openai.com>	2022-11-19 19:57:16 -08:00
Jun Yang	8a5647782d	[Triton-MLIR][Testing]Fix tests warning, with small code clean-up (#894 ) 1.Code clean-up to remove superfluous #includes. 2.Fix two python test warnings, in which one relates to ["#" formats](https://jira.mongodb.org/browse/PYTHON-2343), the other relates to regular expression string usage.	2022-11-19 14:33:59 +00:00
Philippe Tillet	dab4855bdf	[TESTING] Added infrastructure for executing TTGIR program and test for layout conversions (#885 )	2022-11-18 07:46:45 +01:00
Chenggang Zhao	516a241234	[Triton-MLIR] Fix some typos (#874 ) Fix some typos	2022-11-13 18:15:53 -08:00
Philippe Tillet	f40c63fb03	[Triton-MLIR][OPTIMIZER] Cleaned up swizzling (#869 ) Swizzling is no longer implemented as a separate pass. It is instead done in a specialized constructor of SharedEncodingAttr, and tested via google tests instead of triton-opt + filecheck. In the future we may want to implement it as a pass again once we have an additional dialect between TritonGPU and LLVM.	2022-11-10 12:05:46 -08:00
Da Yan	4946167241	[Triton-MLIR] `tt.dot` operands now must have DotOperand layout; also added prefetch pass prototype (#712 ) Co-authored-by: Jokeren <kerenzhou@openai.com> Co-authored-by: Phil Tillet <phil@openai.com> Co-authored-by: Superjomn <yanchunwei@outlook.com>	2022-11-10 05:57:27 +00:00
Philippe Tillet	976cf12af1	[OPTIMIZER] Fixed memory coalescing (#847 )	2022-11-07 06:22:18 -08:00
Keren Zhou	fdd59900f7	[Triton-MLIR] Replace triton.extract_slice with tensor.extract_slice and support more general tensor slicing (#837 ) ## Features - Allow taking a block of tensor slice, as long as each dimension is contiguous (unit stride). - Fix some problems in `insert_slice_async`'s semantic. - More general verification for ops that return shared layout encoding. ## Known Limitations - `insert_slice_async` still uses the old semantic. May submit another PR later to support similar semantic like `tensor.extract_slice`. - No encoding verification for `tensor.extract_slice`. - 3d tensor ops are broken. - Strided accesses are not allowed. - May cause a little performance slowdown since we are passing strides as values but not constants (e.g., int). It would be difficult to pass strides as attributes when we have control flows. A block argument is possible to accept tensors with different strides.	2022-11-06 22:59:03 -08:00
Philippe Tillet	b6dbe959f0	[RUNTIME] Re-vamped cache so users can manually patch IR / ptx / cubin files (#845 ) Also deprecates a couple of tests	2022-11-04 10:57:29 -07:00
Philippe Tillet	cb1b87a688	[FRONTEND] Made test_if/test_default pass (#823 )	2022-10-30 15:32:55 -07:00
Philippe Tillet	e61dc75942	[FRONTEND] Fixed inliner and got more tests to pass (#822 ) This adds a `DialectInlinerInterface` to the Triton dialect. This, along with a few other minor semantic changes, fixes our tests on call instructions. Also added the option to provide use an "LLVM_SYSPATH" environment variable to link against locally build of LLVM; this was useful for debugging this issue.	2022-10-30 14:10:02 -07:00
Philippe Tillet	7dfab26a39	[FRONTEND][BACKEND] Fixed various bugs (#819 ) - Fixed bugs on layout conversions for int1 data (we should use int8 internally for int1 data to prevent llvm from using vec<i1> which has different semantics) - Fixed semantics of some casts to bool in the frontend	2022-10-29 06:34:14 +00:00
ben-zhang-609	3685194456	[Triton-MLIR][BACKEND] Add elementwise ops and tests (#804 ) Co-authored-by: Keren Zhou <kerenzhou@openai.com>	2022-10-28 05:26:29 +00:00
Keren Zhou	3b80801dff	[Triton-MLIR][Backend] Fix many problems to get the pipeline working (#809 ) 1. Rewrite code generation of insert_slice_async. 2. Correct the wrong index passed to extract_slice in pipeline. 3. Add a prologue in pipeline to wait for dangling cp.asyncs. 4. Move scf to cf conversion inside TritonGPUToLLVM because we need to perform membar before scf to cf. It shouldn't be a technical limitation and could be improved by a more general membar analysis. 5. Use an attribute to memoize the shared memory size and support dynamic shared memory. 6. Prevent the combine pass to reorder insert_slice and extract_slice across async_wait Co-authored-by: Superjomn <yanchunwei@outlook.com>	2022-10-27 22:09:06 -07:00
Philippe Tillet	3e6cc6d66c	[FRONTEND] Made more tests pass (#805 )	2022-10-26 17:47:33 -07:00
Yan Chunwei	4dc2396ca0	[Triton-MLIR][BACKEND] Support $c from mma layout in dot (#798 ) This PR does 1. Support the case where $c holding a mma layout, this should be useful in forloop in k-axis in GEMM 2. Fix the `unrealized_conversion_cast` in ConvertLayout[shared->dot_op] Known issue 1. There is some IO conflict in GEMM with a k-forloop, it is temporarily solved by [adding a barrier](https://github.com/openai/triton/pull/798/files#diff-8a9a5a7f4a025fb1299af29d190d5626bd9000406d3ea47c49679272d3d6abe9R3028) in dot conversion, but we are still working on it, will get a more generic fix for it in the following PR. 2. The parallel pass will result in a buggy instruction result type ```mlir %1049 = llvm.inline_asm has_side_effects asm_dialect = att operand_attrs = [] "cp.async.commit_group ;", "" : () -> !llvm.void %1050 = builtin.unrealized_conversion_cast %1049 : !llvm.void to !llvm.ptr<f16, 3> ``` So we temporarily disable it.	2022-10-26 10:33:04 +08:00
Philippe Tillet	bb0f9235d1	[OPTIMIZER] Made layout simplification pass efficient for fused attention kernels (#790 )	2022-10-21 16:52:15 -07:00
goostavz	c4726333bf	[Triton-MLIR] Minor fixes related with scf/swizzling support (#791 ) 1, Disable static loop unrolling in the frontend by default; 2, A minor fix in axisAnalysis in order to support scf; 3, A minor fix in TritonGPUToLLVM to support swizzling.	2022-10-21 11:46:28 +08:00
Philippe Tillet	dc0588a898	[OPTIMIZER] Improved layout simplification pass so it handles swizzled layouts better (#789 ) Note: uncommented `test_gemm`, since backend has an issue with swizzling. This will get uncommented in a subsequent PR.	2022-10-20 19:03:37 -07:00
Philippe Tillet	623c99609f	[Triton-IR] Added type inference and verifier for Triton-IR operations (#767 )	2022-10-11 18:16:41 -07:00
Yan Chunwei	555f94f9b9	[triton-mlir][BACKEND] Support masked load/store (#657 ) This PR does - fix some bugs to support masked load/store, - refine frontend, and support the `and` and `or` syntax in mask(by extending the BoolOp in python ast.visitor), e.g. `tl.store(..., mask=offset<n and other_conditions)`, - add `arith.cmpI` and `arith.cmpF` op conversion in backend(required by mask), - add more test cases in vecadd.	2022-10-10 13:29:53 +08:00
goostavz	f9d7f2f126	[Triton-MLIR][Backend] Support ConvertLayout blocked->shared and a few fixes related with mma(#716 )	2022-10-03 19:33:25 +08:00
goostavz	61b61755e5	[Triton-MLIR][Backend] Support layout conversion between mmaLayout and blockedLayout (#693 )	2022-09-27 03:58:47 +00:00
Philippe Tillet	1e91ed30d0	[RUNTIME] Major code cleanup (#711 ) This PR does the following: - CUDA utilities (e.g., cuGetInfo) won't be compiled as part of libtriton.so anymore. - Refactoring driver/llvm.cc to split it between PTX codegen and python. - By extension this will also deprecate include/external so Triton won't have to live with a copy of some CUDA/Hip headers anymore. - `triton-translate` becomes a `triton.tools.aot` Python utility that re-uses functions from the triton.compile sub-module.	2022-09-26 16:38:06 -07:00
Philippe Tillet	22ec22c257	[FRONTEND] Backport new runtime from `master` (#706 ) This PR merges the new runtime back into the `triton-mlir` branch. This adds caching and just-in-time compilation functionality to the triton-mlir project, and paves the way for re-using tests from the master branch.	2022-09-23 16:09:43 -07:00
Shintaro Iwasaki	13669b46a6	[DOCS] Correct spelling (#665 ) This PR corrects spelling like #664 for Triton-MLIR. It should not break anything.	2022-09-16 15:07:34 -07:00
Shintaro Iwasaki	e9e1a4e682	[FRONTEND] Fix the implicit broadcasting rule (#663 ) This PR solves the cast issue that appears in some tutorial code.	2022-09-16 10:49:15 -07:00
Yan Chunwei	a9464f4993	[Backend] Vectorize Load/Store Ops (#86 ) This PR does the following things: - Code refactoring on Load and Store op codegen, rewrite with same logic and share much code - Support the vectorized load/store	2022-09-06 12:28:09 -07:00
Shintaro Iwasaki	3c635449e5	[Triton] Support math and libdevice ops (#91 ) This PR adds basic math ops by using `MathDialect` and `libdevice` ops by using `extern_elementwise`. This is needed to compile some tutorial code (e.g., `softmax`). This PR implements only interface till PTX (so from frontend to TritonGPU-MLIR) - Currently till TritonGPU. It cannot be lowered to PTX now. - No special optimizations (e.g., constant folding etc) are applied. - 14.x does not define folders for many operators for math ops, but 15.x seems to increase its coverage: https://github.com/llvm/llvm-project/blob/llvmorg-15.0.0-rc3/mlir/include/mlir/Dialect/Math/IR/MathOps.td - No constant folding etc for `libdevice` ops. ```py import triton import triton.language as tl import sys @triton.jit def add_kernel( x_ptr, y_ptr, BLOCK_SIZE: tl.constexpr, ): offsets = tl.arange(0, BLOCK_SIZE) x = tl.load(x_ptr + offsets) x = tl.sin(x) output = tl.libdevice.sin(x) output = tl.libdevice.fdiv_rn(output, output) output = tl.libdevice.fmaf_rd(output, output, output) tl.store(y_ptr + offsets, output) if __name__ == "__main__" and len(sys.argv) >= 2: signature = "fp32,fp32" constants = {'BLOCK_SIZE': 1024} output = triton.compile(add_kernel, signature, device=0, constants=constants, output="ttgir") print(output) ``` -> ```llvm #blocked = #triton_gpu.blocked<{sizePerThread = [1], threadsPerWarp = [32], warpsPerCTA = [4], order = [0]}> module attributes {"triton_gpu.num-warps" = 4 : i32} { func @add_kernel__Pfp32_Pfp32__2c1024(%arg0: !tt.ptr<f32>, %arg1: !tt.ptr<f32>) { %0 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32, #blocked> %1 = tt.splat %arg0 : (!tt.ptr<f32>) -> tensor<1024x!tt.ptr<f32>, #blocked> %2 = tt.getelementptr %1, %0 : tensor<1024x!tt.ptr<f32>, #blocked> %3 = tt.load %2 {cache = 1 : i32, evict = 1 : i32, isVolatile = false} : tensor<1024xf32, #blocked> %4 = math.sin %3 : tensor<1024xf32, #blocked> %5 = tt.ext_elemwise %4 {libname = "libdevice", libpath = "/home/siwasaki/triton/python/triton/language/libdevice.10.bc", symbol = "__nv_sinf"} : tensor<1024xf32, #blocked> -> tensor<1024xf32, #blocked> %6 = tt.ext_elemwise %5, %5 {libname = "libdevice", libpath = "/home/siwasaki/triton/python/triton/language/libdevice.10.bc", symbol = "__nv_fdiv_rn"} : tensor<1024xf32, #blocked>, tensor<1024xf32, #blocked> -> tensor<1024xf32, #blocked> %7 = tt.ext_elemwise %6, %6, %6 {libname = "libdevice", libpath = "/home/siwasaki/triton/python/triton/language/libdevice.10.bc", symbol = "__nv_fmaf_rd"} : tensor<1024xf32, #blocked>, tensor<1024xf32, #blocked>, tensor<1024xf32, #blocked> -> tensor<1024xf32, #blocked> %8 = tt.splat %arg1 : (!tt.ptr<f32>) -> tensor<1024x!tt.ptr<f32>, #blocked> %9 = tt.getelementptr %8, %0 : tensor<1024x!tt.ptr<f32>, #blocked> tt.store %9, %7 : tensor<1024xf32, #blocked> return } } ```	2022-09-01 16:34:27 -07:00
goostavz	bedbf221c0	[BACKEND] Support optional mask in TritonGPUToLLVM (#80 ) Co-authored-by: gzhu <gzhu@nvidia.com>	2022-08-24 17:51:37 -07:00
Yan Chunwei	10ba51c3bb	[FRONTEND] add python e2e launch empty kernel test (#68 )	2022-08-19 10:46:01 -07:00
Philippe Tillet	192be76b3c	[OPTIMIZER] Rewrite patterns for layout conversions (#64 )	2022-08-18 12:49:37 -07:00
Yan Chunwei	b1673caaf6	[FRONTEND] Expose end-to-end compile to python frontend (#58 )	2022-08-17 10:42:48 -07:00
Philippe Tillet	25357083e6	[CI] Added basic CI skeletons (#23 ) Includes minor fixes to make things compile and pass static checks properly	2022-07-26 14:16:30 -07:00
Philippe Tillet	3265e0df5a	[PYTHON] Cleaned up legacy code; added simple standalone compilation API (#22 )	2022-07-26 11:06:45 -07:00

44 Commits