triton

Author	SHA1	Message	Date
Philippe Tillet	ac0f6793cc	[BACKEND] Added support for scalars in LoadOp / StoreOp / ElementwiseOp (#814 ) Also fixed various errors that showed up in `test_core.py`, and added more TODOs for open (hopefully relatively minor) issues	2022-10-28 16:17:55 +08:00
ben-zhang-609	3685194456	[Triton-MLIR][BACKEND] Add elementwise ops and tests (#804 ) Co-authored-by: Keren Zhou <kerenzhou@openai.com>	2022-10-28 05:26:29 +00:00
Keren Zhou	3b80801dff	[Triton-MLIR][Backend] Fix many problems to get the pipeline working (#809 ) 1. Rewrite code generation of insert_slice_async. 2. Correct the wrong index passed to extract_slice in pipeline. 3. Add a prologue in pipeline to wait for dangling cp.asyncs. 4. Move scf to cf conversion inside TritonGPUToLLVM because we need to perform membar before scf to cf. It shouldn't be a technical limitation and could be improved by a more general membar analysis. 5. Use an attribute to memoize the shared memory size and support dynamic shared memory. 6. Prevent the combine pass to reorder insert_slice and extract_slice across async_wait Co-authored-by: Superjomn <yanchunwei@outlook.com>	2022-10-27 22:09:06 -07:00
Qingyi Liu	42db3538e4	[Triton-MLIR][Backend] Add ReduceOpConversion into TritonGPUToLLVM conversion (#774 ) What is done in this PR: - [x] Add `ConvertLayout`, `getSizePerThread` and `getShapePerCTA` implementation for `SliceEncodingAttr` - [x] Split `emitIndices` into two phases: `emitBaseIndexForBlockedLayout` and `emitOffsetForBlockedLayout` - [x] Add `ReduceOpConversion::matchAndRewriteBasic` implementation - [x] Add `ReduceOpConversion::matchAndRewriteFast` implementation with ptx instruction `shfl.sync` - [x] Add support for scalar value in `StoreOpConversion` - [x] Add Reduce1d and Reduce2d unit tests and pass all unit tests Co-authored-by: Qingyi Liu <liuqingyi1993@gmail.com>	2022-10-28 11:07:45 +08:00
Philippe Tillet	3e6cc6d66c	[FRONTEND] Made more tests pass (#805 )	2022-10-26 17:47:33 -07:00
goostavz	bb7008651a	[Backend] Hacky fix of missing barrier in ConvertLayout blocked->shared (#803 ) Barrier should be set by a separate pass, but it seems like there may be some bugs	2022-10-26 13:39:38 -07:00
Yan Chunwei	4dc2396ca0	[Triton-MLIR][BACKEND] Support $c from mma layout in dot (#798 ) This PR does 1. Support the case where $c holding a mma layout, this should be useful in forloop in k-axis in GEMM 2. Fix the `unrealized_conversion_cast` in ConvertLayout[shared->dot_op] Known issue 1. There is some IO conflict in GEMM with a k-forloop, it is temporarily solved by [adding a barrier](https://github.com/openai/triton/pull/798/files#diff-8a9a5a7f4a025fb1299af29d190d5626bd9000406d3ea47c49679272d3d6abe9R3028) in dot conversion, but we are still working on it, will get a more generic fix for it in the following PR. 2. The parallel pass will result in a buggy instruction result type ```mlir %1049 = llvm.inline_asm has_side_effects asm_dialect = att operand_attrs = [] "cp.async.commit_group ;", "" : () -> !llvm.void %1050 = builtin.unrealized_conversion_cast %1049 : !llvm.void to !llvm.ptr<f16, 3> ``` So we temporarily disable it.	2022-10-26 10:33:04 +08:00
Philippe Tillet	a2cbe7af91	[FRONTEND] Enhanced support for binary operators (#801 ) Disabled modulo test (due to change in behavior for `frem` in nvptx between llvm-11 and llvm-14) and bfloat16 (will require some work to emulate in software similar to how it's done in `master`)	2022-10-24 19:47:01 -07:00
Philippe Tillet	fcb228d1d4	Merge select commits from `master` branch into `triton-mlir` (#799 ) Co-authored-by: Keren Zhou <kerenzhou@openai.com> Co-authored-by: vesuppi <zt9465@gmail.com> Co-authored-by: Jason Ansel <jansel@jansel.net> Co-authored-by: daadaada <dyanab@connect.ust.hk> Co-authored-by: Anton Kostin <masguit42@users.noreply.github.com> Co-authored-by: Yunxing Dai <nov503@gmail.com> Co-authored-by: Shintaro Iwasaki <shintaro.iwasaki.work@gmail.com>	2022-10-24 14:52:37 -07:00
Yan Chunwei	877844de4f	[Triton-MLIR][BACKEND] add convert_layout[shared->dot_op] converstion to adapt DotOperand layout (#786 ) This PR helps to 1. Adapt the existing DotOp conversion to the design of the new DotOperand layout, 2. Making the DotOp conversion work with both shared-layout inputs case and dotoperand-layout inputs case for further upstream switch.	2022-10-24 11:40:13 +08:00
Philippe Tillet	3aa8296b06	[BUILD] Download pybind11 in setup.py (#703 ) (#797 ) Cherry-picks #703 and resolves conflicts Co-authored-by: Shintaro Iwasaki <siwasaki@fb.com>	2022-10-23 18:52:48 -07:00
Yan Chunwei	1bf59d315c	[Triton-MLIR][FRONTEND] Remove the dangling `check-triton` call in setup.py (#796 )	2022-10-23 18:26:18 -07:00
Philippe Tillet	bb0f9235d1	[OPTIMIZER] Made layout simplification pass efficient for fused attention kernels (#790 )	2022-10-21 16:52:15 -07:00
goostavz	c4726333bf	[Triton-MLIR] Minor fixes related with scf/swizzling support (#791 ) 1, Disable static loop unrolling in the frontend by default; 2, A minor fix in axisAnalysis in order to support scf; 3, A minor fix in TritonGPUToLLVM to support swizzling.	2022-10-21 11:46:28 +08:00
Philippe Tillet	dc0588a898	[OPTIMIZER] Improved layout simplification pass so it handles swizzled layouts better (#789 ) Note: uncommented `test_gemm`, since backend has an issue with swizzling. This will get uncommented in a subsequent PR.	2022-10-20 19:03:37 -07:00
Shintaro Iwasaki	0d22d2bc03	[TritonMLIR] Disallow 0D tensor (#788 )	2022-10-19 10:34:32 -07:00
Yan Chunwei	4464646efb	[Triton-MLIR][BACKEND] Fix masked load store op vector size (#785 ) Correct the Load/Store Op's vector size with the mask's alignment correctly considered. Some cases: ```mlir // num_warp = 2 // block_size = 128 func @vecadd_mask_align_16(%a_ptr: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %b_ptr: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %out_ptr: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %n_elements: i32 {tt.divisibility = 16 : i32}) { // mask = make_range(128) < n_element } ``` This should get the vec=2 `ld`/`st` instructions. While the following example ```mlir // num_warp = 2 // block_size = 128 func @vecadd_mask_align_16(%a_ptr: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %b_ptr: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %out_ptr: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %n_elements: i32) { // mask = make_range(128) < n_element } ``` it should get the vec=1 `ld`/`st` instructions.	2022-10-18 11:43:50 +08:00
Philippe Tillet	38a80664b5	[OPTIMIZER] Updated TritonGPU-combine pass (#784 ) WIP but should work int t…he cases we need so far	2022-10-16 21:19:42 -07:00
goostavz	e948a618b3	[Triton-MLIR] fix a tiny bug in coalesce pass (#782 )	2022-10-16 20:29:55 -07:00
Shintaro Iwasaki	5898352f97	[Triton-IR] Fix LoadOp definition (#771 ) (#777 )	2022-10-13 18:53:00 -07:00
Da Yan	963d031247	[Triton-IR] Fix LoadOp Triton->TritonGPU conversion (#775 )	2022-10-13 12:57:39 -07:00
Yan Chunwei	1baa4e125f	[triton-mlir][BACKEND] decouple loading from mma codegen in dot conversion (#764 ) This PR decouples the operand loading from the mma codegen to make it ready for the ongoing `DotOperandEncodingAttr` migration. The existing DotOp conversion is composed of the following two procedures: 1. Loading the $a,$b,$c operand from smem to registers 2. Conducting the MMA instruction codegen. While in the latest design, the 1st stage should be part of the `convert_layout(shared_layout) -> dot_operand_layout`, that's why the decoupling is necessary. Some details, this PR introduces a `MMA16816ConversionHelper` class, it has `loadA`, `loadB` and `loadC` methods to help load $a, $b and $c from smem to registers, both `loadA` and `loadB` methods returns a `LLVM::Struct` which should be compatible with the new `DotOperandEncodingAttr` conversion. The conversion layout for $a and $b is as follows: ```c++ // The layout is a list of Value with coordinate of (i,j), the order is as // the follows: // [ // (0,0), (0,1), (1,0), (1,1), # i=0, j=0 // (0,2), (0,3), (1,2), (1,3), # i=0, j=1 // (0,4), (0,5), (1,4), (1,5), # i=0, j=2 // ... // (2,0), (2,1), (3,0), (3,1), # i=1, j=0 // (2,2), (2,3), (3,2), (3,3), # i=1, j=1 // (2,4), (2,5), (2,4), (2,5), # i=1, j=2 // ... // ] // i \in [0, n0) and j \in [0, n1) ``` In the `convertDot` method, it takes loaded $a, $b, $c($a and $b are type of `LLVM::Struct` while $c is a scalar Value), extract the elements from `LLVM::Struct` following the layout above, and pass the elements to MMA inline asm.	2022-10-12 10:45:17 +08:00
Philippe Tillet	623c99609f	[Triton-IR] Added type inference and verifier for Triton-IR operations (#767 )	2022-10-11 18:16:41 -07:00
Philippe Tillet	b6e5a231e5	[OPTIMIZER] Added swizzling pass (#758 )	2022-10-10 01:12:37 -07:00
Yan Chunwei	555f94f9b9	[triton-mlir][BACKEND] Support masked load/store (#657 ) This PR does - fix some bugs to support masked load/store, - refine frontend, and support the `and` and `or` syntax in mask(by extending the BoolOp in python ast.visitor), e.g. `tl.store(..., mask=offset<n and other_conditions)`, - add `arith.cmpI` and `arith.cmpF` op conversion in backend(required by mask), - add more test cases in vecadd.	2022-10-10 13:29:53 +08:00
Ian Bearman	ccc5ab6ac9	[BUILD] When set, use MLIR_DIR for finding both MLIR and LLVM (#755 )	2022-10-09 13:11:20 -07:00
Ian Bearman	89f6e1db5e	[BUILD] use cmake to set include path when build isn't triggered by setup.py (#754 )	2022-10-09 12:30:44 -07:00
Ian Bearman	863578a7fa	[BUILD] Enable current-dir inclusion (#753 ) This change enables `CMAKE_INCLUDE_CURRENT_DIR` when building Triton.	2022-10-09 18:09:49 +00:00
Ian Bearman	448d14a598	[BUILD] Add TRITON Prefix to build variables (#752 )	2022-10-09 10:55:17 -07:00
goostavz	1d772cd843	[Triton-MLIR][Backend] Add SCF lowering in the backend (#750 )	2022-10-08 18:36:37 +08:00
Philippe Tillet	498c685b46	[OPTIMIZER] layout simplification: ignore non-tensor iter arguments in for loop rematerialization (#749 )	2022-10-07 21:52:29 -07:00
goostavz	e843257295	[Backend] Fix a bug in emitIndicesForBlocked (#740 )	2022-10-04 21:29:59 -07:00
Keren Zhou	289ff293cc	[Triton-MLIR] Generate LLVM/PTX code for async ops (#735 )	2022-10-04 09:37:00 -07:00
goostavz	f9d7f2f126	[Triton-MLIR][Backend] Support ConvertLayout blocked->shared and a few fixes related with mma(#716 )	2022-10-03 19:33:25 +08:00
Keren Zhou	baba98ad69	[Triton-MLIR] Fix threadsPerWarp derivation in BlockedEncodingAttr (#722 ) Example: ``` auto encoding = triton::gpu::BlockedEncodingAttr::get( &getContext(), {8, 32}, {2, 2}, {1, 0}, 2); //shape = [32 x 8], order = [1, 0], sizePerThread=[2, 2], numWarps=2 ``` Expected output: ``` //#triton_gpu.blocked_layout<{ // sizePerThread = {2, 2} // threadsPerWarp = {8, 4} // warpsPerCTA = {2, 1} //}> ``` Incorrect output by the current branch ``` //#triton_gpu.blocked_layout<{ // sizePerThread = {2, 2} // threadsPerWarp = {16, 2} // warpsPerCTA = {2, 1} //}> ```	2022-09-27 16:41:30 -07:00
Philippe Tillet	9ddf0921fb	[OPTIMIZER] Added `DotOp` to the list of expensive ops we don't want to rematerialize. (#718 )	2022-09-27 09:05:49 -07:00
Yan Chunwei	df8d276089	[Triton-MLIR][Backend] Fix smem base bug in dot codegen (#715 ) Get SMEM base address of an input operand from `adapter.arg()` instead of `getSharedMemoryBase(arg, ...)`, for the latter one not works with memory alias, for example: ```llvm %a = extract_slice %b, %offset %c = dot %a, %d ``` `%a` should have different smem base address from `%b`	2022-09-27 17:28:17 +08:00
Yan Chunwei	3a84278530	[Triton-MLIR][BACKEND] Refine dot conversion (#710 ) This PR does 1. Refine the dot conversion 2. some other tiny code refinement	2022-09-27 14:38:34 +08:00
goostavz	61b61755e5	[Triton-MLIR][Backend] Support layout conversion between mmaLayout and blockedLayout (#693 )	2022-09-27 03:58:47 +00:00
Philippe Tillet	1e91ed30d0	[RUNTIME] Major code cleanup (#711 ) This PR does the following: - CUDA utilities (e.g., cuGetInfo) won't be compiled as part of libtriton.so anymore. - Refactoring driver/llvm.cc to split it between PTX codegen and python. - By extension this will also deprecate include/external so Triton won't have to live with a copy of some CUDA/Hip headers anymore. - `triton-translate` becomes a `triton.tools.aot` Python utility that re-uses functions from the triton.compile sub-module.	2022-09-26 16:38:06 -07:00
Philippe Tillet	8bb09f83ee	[CI] Added CODEOWNERS file (#709 )	2022-09-24 16:32:44 -07:00
Philippe Tillet	22ec22c257	[FRONTEND] Backport new runtime from `master` (#706 ) This PR merges the new runtime back into the `triton-mlir` branch. This adds caching and just-in-time compilation functionality to the triton-mlir project, and paves the way for re-using tests from the master branch.	2022-09-23 16:09:43 -07:00
Keren Zhou	ecd1bc33df	[Triton-MLIR] Keren/code gen for extract slice and alloc tensor (#692 ) Co-authored-by: gzhu <goostavz@outlook.com>	2022-09-23 19:38:14 +00:00
Philippe Tillet	c56f0198dd	Revert "[Triton-MLIR][pybind11] Update pybind11 to 2.10.0" (#702 ) Reverts openai/triton#694	2022-09-23 12:31:33 -07:00
Yan Chunwei	922155f1d2	[BACKEND] add dot conversion (mma version=2) (#672 ) LLVM Conversion for Dot op. Due to the lack of `convert_layout`, currently, the dot only supports the following combination of operands - `$a` in shared layout - `$b` in shared layout - `$c` in MMA layout(but only Splat-like, leaving the generic cases to `convert_layout`) This PR focus on `mma.16816` related logic support, leaving the other cases to the following PR. Co-authored-by: Philippe Tillet <phil@openai.com>	2022-09-22 20:43:54 -07:00
Shintaro Iwasaki	23f424c660	[Triton-MLIR][pybind11] Update pybind11 to 2.10.0 (#694 ) This PR applies #691 to the Triton-MLIR branch.	2022-09-22 17:53:42 -07:00
Shintaro Iwasaki	940ef3f0ac	[BACKEND] llvm::dyn_cast -> llvm::dyn_cast_or_null (#689 )	2022-09-22 03:26:40 +00:00
goostavz	15bfd0cb79	[BACKEND] Support of ConvertLayoutOp from blocked to blocked and SliceLayout with blocked parent (#658 )	2022-09-17 14:58:42 -07:00
Shintaro Iwasaki	13669b46a6	[DOCS] Correct spelling (#665 ) This PR corrects spelling like #664 for Triton-MLIR. It should not break anything.	2022-09-16 15:07:34 -07:00
Shintaro Iwasaki	e9e1a4e682	[FRONTEND] Fix the implicit broadcasting rule (#663 ) This PR solves the cast issue that appears in some tutorial code.	2022-09-16 10:49:15 -07:00

1 2 3 4 5 ...

758 Commits