triton

Author	SHA1	Message	Date
Philippe Tillet	2aa538ec2e	[BACKEND] Added support for mma layouts in reductions (#863 ) Validated hackily by manually modifying the reduction .ttgir in my local cache. There will be a follow-up PR adding some better testing infrastructure to test out conversions and reductions on arbitrary layouts.	2022-11-10 09:58:07 -08:00
Chenggang Zhao	57fd1864a7	[Triton-MLIR] Support FP8 (#864 ) Co-authored-by: Superjomn <yanchunwei@outlook.com>	2022-11-10 15:53:06 +08:00
Da Yan	4946167241	[Triton-MLIR] `tt.dot` operands now must have DotOperand layout; also added prefetch pass prototype (#712 ) Co-authored-by: Jokeren <kerenzhou@openai.com> Co-authored-by: Phil Tillet <phil@openai.com> Co-authored-by: Superjomn <yanchunwei@outlook.com>	2022-11-10 05:57:27 +00:00
Yan Chunwei	0c87360657	[Triton-MLIR][Backend] Port FMADot conversion for DotOp (#844 ) Co-authored-by: ben-zhang-609 <benzh609@gmail.com>	2022-11-09 12:57:50 +08:00
Yan Chunwei	de5b84c476	[Triton-MLIR][Backend] Fix mma<v2> int8 precision error (#850 ) Fix mma.16816 s8 precision error Co-authored-by: ben-zhang-609 <benzh609@gmail.com>	2022-11-09 12:23:43 +08:00
Da Yan	137344946f	[OPTIMIZER] Fix the load-mask issue with the pipeline pass (#857 )	2022-11-08 09:29:53 -08:00
Keren Zhou	fdd59900f7	[Triton-MLIR] Replace triton.extract_slice with tensor.extract_slice and support more general tensor slicing (#837 ) ## Features - Allow taking a block of tensor slice, as long as each dimension is contiguous (unit stride). - Fix some problems in `insert_slice_async`'s semantic. - More general verification for ops that return shared layout encoding. ## Known Limitations - `insert_slice_async` still uses the old semantic. May submit another PR later to support similar semantic like `tensor.extract_slice`. - No encoding verification for `tensor.extract_slice`. - 3d tensor ops are broken. - Strided accesses are not allowed. - May cause a little performance slowdown since we are passing strides as values but not constants (e.g., int). It would be difficult to pass strides as attributes when we have control flows. A block argument is possible to accept tensors with different strides.	2022-11-06 22:59:03 -08:00
Philippe Tillet	a4ff0c362c	[FRONTEND] Fix issues with atomics (#849 )	2022-11-06 20:52:11 -08:00
Philippe Tillet	91a9773b38	[OPTIMIZER] Minor bugfixes that affected matmul codegen performance (#834 )	2022-11-02 22:58:09 -07:00
Philippe Tillet	12d60cb4a3	[BACKEND] Added support for 1D conversion blocked -> slice (#831 )	2022-11-01 13:19:58 -07:00
Yan Chunwei	031c2ae77b	[Triton-MLIR][BACKEND] Port the mma<v1> conversion (#815 ) This PR does - port the mma<v1> related code, and support dot conversion and convert_layout[shared->dot_op<mma<v1>>] - add a lit test for dot v1	2022-11-01 09:42:14 +08:00
Philippe Tillet	e61dc75942	[FRONTEND] Fixed inliner and got more tests to pass (#822 ) This adds a `DialectInlinerInterface` to the Triton dialect. This, along with a few other minor semantic changes, fixes our tests on call instructions. Also added the option to provide use an "LLVM_SYSPATH" environment variable to link against locally build of LLVM; this was useful for debugging this issue.	2022-10-30 14:10:02 -07:00
Ian Bearman	f2106d0aa2	[BUILD] Fix Warnings and Enable Warnings as Errors (#794 )	2022-10-28 12:36:09 -07:00
Philippe Tillet	ac0f6793cc	[BACKEND] Added support for scalars in LoadOp / StoreOp / ElementwiseOp (#814 ) Also fixed various errors that showed up in `test_core.py`, and added more TODOs for open (hopefully relatively minor) issues	2022-10-28 16:17:55 +08:00
Keren Zhou	3b80801dff	[Triton-MLIR][Backend] Fix many problems to get the pipeline working (#809 ) 1. Rewrite code generation of insert_slice_async. 2. Correct the wrong index passed to extract_slice in pipeline. 3. Add a prologue in pipeline to wait for dangling cp.asyncs. 4. Move scf to cf conversion inside TritonGPUToLLVM because we need to perform membar before scf to cf. It shouldn't be a technical limitation and could be improved by a more general membar analysis. 5. Use an attribute to memoize the shared memory size and support dynamic shared memory. 6. Prevent the combine pass to reorder insert_slice and extract_slice across async_wait Co-authored-by: Superjomn <yanchunwei@outlook.com>	2022-10-27 22:09:06 -07:00
Qingyi Liu	42db3538e4	[Triton-MLIR][Backend] Add ReduceOpConversion into TritonGPUToLLVM conversion (#774 ) What is done in this PR: - [x] Add `ConvertLayout`, `getSizePerThread` and `getShapePerCTA` implementation for `SliceEncodingAttr` - [x] Split `emitIndices` into two phases: `emitBaseIndexForBlockedLayout` and `emitOffsetForBlockedLayout` - [x] Add `ReduceOpConversion::matchAndRewriteBasic` implementation - [x] Add `ReduceOpConversion::matchAndRewriteFast` implementation with ptx instruction `shfl.sync` - [x] Add support for scalar value in `StoreOpConversion` - [x] Add Reduce1d and Reduce2d unit tests and pass all unit tests Co-authored-by: Qingyi Liu <liuqingyi1993@gmail.com>	2022-10-28 11:07:45 +08:00
Philippe Tillet	3e6cc6d66c	[FRONTEND] Made more tests pass (#805 )	2022-10-26 17:47:33 -07:00
Yan Chunwei	877844de4f	[Triton-MLIR][BACKEND] add convert_layout[shared->dot_op] converstion to adapt DotOperand layout (#786 ) This PR helps to 1. Adapt the existing DotOp conversion to the design of the new DotOperand layout, 2. Making the DotOp conversion work with both shared-layout inputs case and dotoperand-layout inputs case for further upstream switch.	2022-10-24 11:40:13 +08:00
Philippe Tillet	bb0f9235d1	[OPTIMIZER] Made layout simplification pass efficient for fused attention kernels (#790 )	2022-10-21 16:52:15 -07:00
Philippe Tillet	dc0588a898	[OPTIMIZER] Improved layout simplification pass so it handles swizzled layouts better (#789 ) Note: uncommented `test_gemm`, since backend has an issue with swizzling. This will get uncommented in a subsequent PR.	2022-10-20 19:03:37 -07:00
Shintaro Iwasaki	0d22d2bc03	[TritonMLIR] Disallow 0D tensor (#788 )	2022-10-19 10:34:32 -07:00
Yan Chunwei	4464646efb	[Triton-MLIR][BACKEND] Fix masked load store op vector size (#785 ) Correct the Load/Store Op's vector size with the mask's alignment correctly considered. Some cases: ```mlir // num_warp = 2 // block_size = 128 func @vecadd_mask_align_16(%a_ptr: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %b_ptr: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %out_ptr: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %n_elements: i32 {tt.divisibility = 16 : i32}) { // mask = make_range(128) < n_element } ``` This should get the vec=2 `ld`/`st` instructions. While the following example ```mlir // num_warp = 2 // block_size = 128 func @vecadd_mask_align_16(%a_ptr: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %b_ptr: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %out_ptr: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %n_elements: i32) { // mask = make_range(128) < n_element } ``` it should get the vec=1 `ld`/`st` instructions.	2022-10-18 11:43:50 +08:00
Philippe Tillet	38a80664b5	[OPTIMIZER] Updated TritonGPU-combine pass (#784 ) WIP but should work int t…he cases we need so far	2022-10-16 21:19:42 -07:00
goostavz	e948a618b3	[Triton-MLIR] fix a tiny bug in coalesce pass (#782 )	2022-10-16 20:29:55 -07:00
Shintaro Iwasaki	5898352f97	[Triton-IR] Fix LoadOp definition (#771 ) (#777 )	2022-10-13 18:53:00 -07:00
Philippe Tillet	623c99609f	[Triton-IR] Added type inference and verifier for Triton-IR operations (#767 )	2022-10-11 18:16:41 -07:00
Philippe Tillet	b6e5a231e5	[OPTIMIZER] Added swizzling pass (#758 )	2022-10-10 01:12:37 -07:00
Philippe Tillet	498c685b46	[OPTIMIZER] layout simplification: ignore non-tensor iter arguments in for loop rematerialization (#749 )	2022-10-07 21:52:29 -07:00
Keren Zhou	289ff293cc	[Triton-MLIR] Generate LLVM/PTX code for async ops (#735 )	2022-10-04 09:37:00 -07:00
goostavz	f9d7f2f126	[Triton-MLIR][Backend] Support ConvertLayout blocked->shared and a few fixes related with mma(#716 )	2022-10-03 19:33:25 +08:00
Philippe Tillet	9ddf0921fb	[OPTIMIZER] Added `DotOp` to the list of expensive ops we don't want to rematerialize. (#718 )	2022-09-27 09:05:49 -07:00
Yan Chunwei	3a84278530	[Triton-MLIR][BACKEND] Refine dot conversion (#710 ) This PR does 1. Refine the dot conversion 2. some other tiny code refinement	2022-09-27 14:38:34 +08:00
goostavz	61b61755e5	[Triton-MLIR][Backend] Support layout conversion between mmaLayout and blockedLayout (#693 )	2022-09-27 03:58:47 +00:00
Keren Zhou	ecd1bc33df	[Triton-MLIR] Keren/code gen for extract slice and alloc tensor (#692 ) Co-authored-by: gzhu <goostavz@outlook.com>	2022-09-23 19:38:14 +00:00
Yan Chunwei	922155f1d2	[BACKEND] add dot conversion (mma version=2) (#672 ) LLVM Conversion for Dot op. Due to the lack of `convert_layout`, currently, the dot only supports the following combination of operands - `$a` in shared layout - `$b` in shared layout - `$c` in MMA layout(but only Splat-like, leaving the generic cases to `convert_layout`) This PR focus on `mma.16816` related logic support, leaving the other cases to the following PR. Co-authored-by: Philippe Tillet <phil@openai.com>	2022-09-22 20:43:54 -07:00
Shintaro Iwasaki	940ef3f0ac	[BACKEND] llvm::dyn_cast -> llvm::dyn_cast_or_null (#689 )	2022-09-22 03:26:40 +00:00
goostavz	15bfd0cb79	[BACKEND] Support of ConvertLayoutOp from blocked to blocked and SliceLayout with blocked parent (#658 )	2022-09-17 14:58:42 -07:00
Shintaro Iwasaki	13669b46a6	[DOCS] Correct spelling (#665 ) This PR corrects spelling like #664 for Triton-MLIR. It should not break anything.	2022-09-16 15:07:34 -07:00
Shintaro Iwasaki	43be75ad42	[FRONTEND] Add scalar type support for some ops (#661 ) This PR adds basic support for scalar-type inputs to some ops (cast and pointer arithmetics) for Triton-MLIR. Also renames getelementptr -> addptr	2022-09-15 16:12:52 -07:00
Da Yan	2e08450c80	[OPTIMIZER] Better pipeline tests (#660 )	2022-09-14 23:26:40 -07:00
Keren Zhou	16aed94ff5	[Analysis/Allocation] Allocation passes now assumes that slices always alias (#108 ) This code in this branch assumes the `src` operand in `insert_slice_async` always aliases the result, which shouldn't hold for generally cases but is just a workaround to make the pipeline pass work. I'm also working on the complete analysis in another [branch](https://github.com/openai/triton-mlir/tree/keren/analyze-slice).	2022-09-09 12:03:41 -07:00
Philippe Tillet	9bd5a3dcd2	[OPTIMIZER] Pipeline async buffer (#110 )	2022-09-09 11:01:14 -07:00
Yan Chunwei	a9464f4993	[Backend] Vectorize Load/Store Ops (#86 ) This PR does the following things: - Code refactoring on Load and Store op codegen, rewrite with same logic and share much code - Support the vectorized load/store	2022-09-06 12:28:09 -07:00
Da Yan	35e346bcff	[OPTIMIZER] Better pipeline pass (#100 ) * Use `insert_slice_async` instead of `CopyAsync` * Move async.wait to loop header Co-authored-by: Jokeren <kerenzhou@openai.com>	2022-09-06 08:31:13 -07:00
Philippe Tillet	a0bab9748e	[OPTIMIZER] Coalesce pass no longer takes a `num-warps` argument (#99 ) Improved design to avoid inconsistent `num-warps` value between the pass and the parent module of the operation it processes.	2022-09-05 18:09:02 -07:00
Philippe Tillet	d0b4c67b05	[OPTIMIZER] Improved layout conversion simplification algorithm (#97 ) This PR both simplifies the layout conversion simplification algorithm, and also improves it to make it work with vectorized element-wise ops. The conversion optimizer still has a lot of room for improvements, and other PRs will address its limitations (ideally via some sort of explicit cost model)	2022-09-02 16:52:44 -07:00
Shintaro Iwasaki	3c635449e5	[Triton] Support math and libdevice ops (#91 ) This PR adds basic math ops by using `MathDialect` and `libdevice` ops by using `extern_elementwise`. This is needed to compile some tutorial code (e.g., `softmax`). This PR implements only interface till PTX (so from frontend to TritonGPU-MLIR) - Currently till TritonGPU. It cannot be lowered to PTX now. - No special optimizations (e.g., constant folding etc) are applied. - 14.x does not define folders for many operators for math ops, but 15.x seems to increase its coverage: https://github.com/llvm/llvm-project/blob/llvmorg-15.0.0-rc3/mlir/include/mlir/Dialect/Math/IR/MathOps.td - No constant folding etc for `libdevice` ops. ```py import triton import triton.language as tl import sys @triton.jit def add_kernel( x_ptr, y_ptr, BLOCK_SIZE: tl.constexpr, ): offsets = tl.arange(0, BLOCK_SIZE) x = tl.load(x_ptr + offsets) x = tl.sin(x) output = tl.libdevice.sin(x) output = tl.libdevice.fdiv_rn(output, output) output = tl.libdevice.fmaf_rd(output, output, output) tl.store(y_ptr + offsets, output) if __name__ == "__main__" and len(sys.argv) >= 2: signature = "fp32,fp32" constants = {'BLOCK_SIZE': 1024} output = triton.compile(add_kernel, signature, device=0, constants=constants, output="ttgir") print(output) ``` -> ```llvm #blocked = #triton_gpu.blocked<{sizePerThread = [1], threadsPerWarp = [32], warpsPerCTA = [4], order = [0]}> module attributes {"triton_gpu.num-warps" = 4 : i32} { func @add_kernel__Pfp32_Pfp32__2c1024(%arg0: !tt.ptr<f32>, %arg1: !tt.ptr<f32>) { %0 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32, #blocked> %1 = tt.splat %arg0 : (!tt.ptr<f32>) -> tensor<1024x!tt.ptr<f32>, #blocked> %2 = tt.getelementptr %1, %0 : tensor<1024x!tt.ptr<f32>, #blocked> %3 = tt.load %2 {cache = 1 : i32, evict = 1 : i32, isVolatile = false} : tensor<1024xf32, #blocked> %4 = math.sin %3 : tensor<1024xf32, #blocked> %5 = tt.ext_elemwise %4 {libname = "libdevice", libpath = "/home/siwasaki/triton/python/triton/language/libdevice.10.bc", symbol = "__nv_sinf"} : tensor<1024xf32, #blocked> -> tensor<1024xf32, #blocked> %6 = tt.ext_elemwise %5, %5 {libname = "libdevice", libpath = "/home/siwasaki/triton/python/triton/language/libdevice.10.bc", symbol = "__nv_fdiv_rn"} : tensor<1024xf32, #blocked>, tensor<1024xf32, #blocked> -> tensor<1024xf32, #blocked> %7 = tt.ext_elemwise %6, %6, %6 {libname = "libdevice", libpath = "/home/siwasaki/triton/python/triton/language/libdevice.10.bc", symbol = "__nv_fmaf_rd"} : tensor<1024xf32, #blocked>, tensor<1024xf32, #blocked>, tensor<1024xf32, #blocked> -> tensor<1024xf32, #blocked> %8 = tt.splat %arg1 : (!tt.ptr<f32>) -> tensor<1024x!tt.ptr<f32>, #blocked> %9 = tt.getelementptr %8, %0 : tensor<1024x!tt.ptr<f32>, #blocked> tt.store %9, %7 : tensor<1024xf32, #blocked> return } } ```	2022-09-01 16:34:27 -07:00
Keren Zhou	328b87aec6	Keren/tensor slice insert alloc (#94 ) This branch defines three new triton_gpu operations to partially solve #87. Below is an overview: ``` %tensor = triton_gpu.alloc_tensor : tensor<2x16x16xf16, #A> %b = triton_gpu.insert_slice_async %a_ptr, %tensor, %offset {axis = 0 : i32, cache = 1 : i32, evict = 1 : i32, isVolatile = false} : tensor<16x16x!tt.ptr<f16>, #AL> -> tensor<2x16x16xf16, #A> %c = triton_gpu.extract_slice %b, %offset {axis = 0 : i32} : tensor<2x16x16xf16, #A> -> tensor<16x16xf16, #A> ``` We plan to fully replace `copy_async` with `insert_slice_async`. This hasn't been done yet.	2022-09-01 12:37:17 -07:00
Shintaro Iwasaki	84aa7d025a	[TritonIR] simplify Load/StoreOps when mask is true/false (#79 ) * [TritonIR] fix Load/Store/CopyAsyncOp's parsers * [TritonIR] simplify Load/StoreOps when mask is true/false * [TEST] adds tests to check load/store simplification	2022-08-24 12:55:49 -07:00
Shintaro Iwasaki	0ebef11c77	[TritonIR] Make mask operand optional (#74 )	2022-08-22 22:00:17 -07:00

1 2 3

105 Commits