triton

Author	SHA1	Message	Date
goostavz	e948a618b3	[Triton-MLIR] fix a tiny bug in coalesce pass (#782 )	2022-10-16 20:29:55 -07:00
Shintaro Iwasaki	5898352f97	[Triton-IR] Fix LoadOp definition (#771 ) (#777 )	2022-10-13 18:53:00 -07:00
Da Yan	963d031247	[Triton-IR] Fix LoadOp Triton->TritonGPU conversion (#775 )	2022-10-13 12:57:39 -07:00
Yan Chunwei	1baa4e125f	[triton-mlir][BACKEND] decouple loading from mma codegen in dot conversion (#764 ) This PR decouples the operand loading from the mma codegen to make it ready for the ongoing `DotOperandEncodingAttr` migration. The existing DotOp conversion is composed of the following two procedures: 1. Loading the $a,$b,$c operand from smem to registers 2. Conducting the MMA instruction codegen. While in the latest design, the 1st stage should be part of the `convert_layout(shared_layout) -> dot_operand_layout`, that's why the decoupling is necessary. Some details, this PR introduces a `MMA16816ConversionHelper` class, it has `loadA`, `loadB` and `loadC` methods to help load $a, $b and $c from smem to registers, both `loadA` and `loadB` methods returns a `LLVM::Struct` which should be compatible with the new `DotOperandEncodingAttr` conversion. The conversion layout for $a and $b is as follows: ```c++ // The layout is a list of Value with coordinate of (i,j), the order is as // the follows: // [ // (0,0), (0,1), (1,0), (1,1), # i=0, j=0 // (0,2), (0,3), (1,2), (1,3), # i=0, j=1 // (0,4), (0,5), (1,4), (1,5), # i=0, j=2 // ... // (2,0), (2,1), (3,0), (3,1), # i=1, j=0 // (2,2), (2,3), (3,2), (3,3), # i=1, j=1 // (2,4), (2,5), (2,4), (2,5), # i=1, j=2 // ... // ] // i \in [0, n0) and j \in [0, n1) ``` In the `convertDot` method, it takes loaded $a, $b, $c($a and $b are type of `LLVM::Struct` while $c is a scalar Value), extract the elements from `LLVM::Struct` following the layout above, and pass the elements to MMA inline asm.	2022-10-12 10:45:17 +08:00
Philippe Tillet	623c99609f	[Triton-IR] Added type inference and verifier for Triton-IR operations (#767 )	2022-10-11 18:16:41 -07:00
Philippe Tillet	b6e5a231e5	[OPTIMIZER] Added swizzling pass (#758 )	2022-10-10 01:12:37 -07:00
Yan Chunwei	555f94f9b9	[triton-mlir][BACKEND] Support masked load/store (#657 ) This PR does - fix some bugs to support masked load/store, - refine frontend, and support the `and` and `or` syntax in mask(by extending the BoolOp in python ast.visitor), e.g. `tl.store(..., mask=offset<n and other_conditions)`, - add `arith.cmpI` and `arith.cmpF` op conversion in backend(required by mask), - add more test cases in vecadd.	2022-10-10 13:29:53 +08:00
Ian Bearman	ccc5ab6ac9	[BUILD] When set, use MLIR_DIR for finding both MLIR and LLVM (#755 )	2022-10-09 13:11:20 -07:00
Ian Bearman	89f6e1db5e	[BUILD] use cmake to set include path when build isn't triggered by setup.py (#754 )	2022-10-09 12:30:44 -07:00
Ian Bearman	863578a7fa	[BUILD] Enable current-dir inclusion (#753 ) This change enables `CMAKE_INCLUDE_CURRENT_DIR` when building Triton.	2022-10-09 18:09:49 +00:00
Ian Bearman	448d14a598	[BUILD] Add TRITON Prefix to build variables (#752 )	2022-10-09 10:55:17 -07:00
goostavz	1d772cd843	[Triton-MLIR][Backend] Add SCF lowering in the backend (#750 )	2022-10-08 18:36:37 +08:00
Philippe Tillet	498c685b46	[OPTIMIZER] layout simplification: ignore non-tensor iter arguments in for loop rematerialization (#749 )	2022-10-07 21:52:29 -07:00
goostavz	e843257295	[Backend] Fix a bug in emitIndicesForBlocked (#740 )	2022-10-04 21:29:59 -07:00
Keren Zhou	289ff293cc	[Triton-MLIR] Generate LLVM/PTX code for async ops (#735 )	2022-10-04 09:37:00 -07:00
goostavz	f9d7f2f126	[Triton-MLIR][Backend] Support ConvertLayout blocked->shared and a few fixes related with mma(#716 )	2022-10-03 19:33:25 +08:00
Keren Zhou	baba98ad69	[Triton-MLIR] Fix threadsPerWarp derivation in BlockedEncodingAttr (#722 ) Example: ``` auto encoding = triton::gpu::BlockedEncodingAttr::get( &getContext(), {8, 32}, {2, 2}, {1, 0}, 2); //shape = [32 x 8], order = [1, 0], sizePerThread=[2, 2], numWarps=2 ``` Expected output: ``` //#triton_gpu.blocked_layout<{ // sizePerThread = {2, 2} // threadsPerWarp = {8, 4} // warpsPerCTA = {2, 1} //}> ``` Incorrect output by the current branch ``` //#triton_gpu.blocked_layout<{ // sizePerThread = {2, 2} // threadsPerWarp = {16, 2} // warpsPerCTA = {2, 1} //}> ```	2022-09-27 16:41:30 -07:00
Philippe Tillet	9ddf0921fb	[OPTIMIZER] Added `DotOp` to the list of expensive ops we don't want to rematerialize. (#718 )	2022-09-27 09:05:49 -07:00
Yan Chunwei	df8d276089	[Triton-MLIR][Backend] Fix smem base bug in dot codegen (#715 ) Get SMEM base address of an input operand from `adapter.arg()` instead of `getSharedMemoryBase(arg, ...)`, for the latter one not works with memory alias, for example: ```llvm %a = extract_slice %b, %offset %c = dot %a, %d ``` `%a` should have different smem base address from `%b`	2022-09-27 17:28:17 +08:00
Yan Chunwei	3a84278530	[Triton-MLIR][BACKEND] Refine dot conversion (#710 ) This PR does 1. Refine the dot conversion 2. some other tiny code refinement	2022-09-27 14:38:34 +08:00
goostavz	61b61755e5	[Triton-MLIR][Backend] Support layout conversion between mmaLayout and blockedLayout (#693 )	2022-09-27 03:58:47 +00:00
Philippe Tillet	1e91ed30d0	[RUNTIME] Major code cleanup (#711 ) This PR does the following: - CUDA utilities (e.g., cuGetInfo) won't be compiled as part of libtriton.so anymore. - Refactoring driver/llvm.cc to split it between PTX codegen and python. - By extension this will also deprecate include/external so Triton won't have to live with a copy of some CUDA/Hip headers anymore. - `triton-translate` becomes a `triton.tools.aot` Python utility that re-uses functions from the triton.compile sub-module.	2022-09-26 16:38:06 -07:00
Philippe Tillet	8bb09f83ee	[CI] Added CODEOWNERS file (#709 )	2022-09-24 16:32:44 -07:00
Philippe Tillet	22ec22c257	[FRONTEND] Backport new runtime from `master` (#706 ) This PR merges the new runtime back into the `triton-mlir` branch. This adds caching and just-in-time compilation functionality to the triton-mlir project, and paves the way for re-using tests from the master branch.	2022-09-23 16:09:43 -07:00
Keren Zhou	ecd1bc33df	[Triton-MLIR] Keren/code gen for extract slice and alloc tensor (#692 ) Co-authored-by: gzhu <goostavz@outlook.com>	2022-09-23 19:38:14 +00:00
Philippe Tillet	c56f0198dd	Revert "[Triton-MLIR][pybind11] Update pybind11 to 2.10.0" (#702 ) Reverts openai/triton#694	2022-09-23 12:31:33 -07:00
Yan Chunwei	922155f1d2	[BACKEND] add dot conversion (mma version=2) (#672 ) LLVM Conversion for Dot op. Due to the lack of `convert_layout`, currently, the dot only supports the following combination of operands - `$a` in shared layout - `$b` in shared layout - `$c` in MMA layout(but only Splat-like, leaving the generic cases to `convert_layout`) This PR focus on `mma.16816` related logic support, leaving the other cases to the following PR. Co-authored-by: Philippe Tillet <phil@openai.com>	2022-09-22 20:43:54 -07:00
Shintaro Iwasaki	23f424c660	[Triton-MLIR][pybind11] Update pybind11 to 2.10.0 (#694 ) This PR applies #691 to the Triton-MLIR branch.	2022-09-22 17:53:42 -07:00
Shintaro Iwasaki	940ef3f0ac	[BACKEND] llvm::dyn_cast -> llvm::dyn_cast_or_null (#689 )	2022-09-22 03:26:40 +00:00
goostavz	15bfd0cb79	[BACKEND] Support of ConvertLayoutOp from blocked to blocked and SliceLayout with blocked parent (#658 )	2022-09-17 14:58:42 -07:00
Shintaro Iwasaki	13669b46a6	[DOCS] Correct spelling (#665 ) This PR corrects spelling like #664 for Triton-MLIR. It should not break anything.	2022-09-16 15:07:34 -07:00
Shintaro Iwasaki	e9e1a4e682	[FRONTEND] Fix the implicit broadcasting rule (#663 ) This PR solves the cast issue that appears in some tutorial code.	2022-09-16 10:49:15 -07:00
Philippe Tillet	80e3fb5270	[CI] Now using clang-format from pip (#662 )	2022-09-15 16:24:37 -07:00
Shintaro Iwasaki	43be75ad42	[FRONTEND] Add scalar type support for some ops (#661 ) This PR adds basic support for scalar-type inputs to some ops (cast and pointer arithmetics) for Triton-MLIR. Also renames getelementptr -> addptr	2022-09-15 16:12:52 -07:00
Da Yan	2e08450c80	[OPTIMIZER] Better pipeline tests (#660 )	2022-09-14 23:26:40 -07:00
Shintaro Iwasaki	297d27e1c8	[Triton-MLIR] add GitHub CI runners (#655 ) This PR is to add GitHub Actions runners to the CI for better coverage.	2022-09-14 23:09:56 -07:00
Philippe Tillet	c14dff2190	[CI] Added A10 tag to disambiguate self-hosted runners (#652 )	2022-09-14 13:08:01 -07:00
Keren Zhou	16aed94ff5	[Analysis/Allocation] Allocation passes now assumes that slices always alias (#108 ) This code in this branch assumes the `src` operand in `insert_slice_async` always aliases the result, which shouldn't hold for generally cases but is just a workaround to make the pipeline pass work. I'm also working on the complete analysis in another [branch](https://github.com/openai/triton-mlir/tree/keren/analyze-slice).	2022-09-09 12:03:41 -07:00
Philippe Tillet	9bd5a3dcd2	[OPTIMIZER] Pipeline async buffer (#110 )	2022-09-09 11:01:14 -07:00
Yan Chunwei	2a852044d9	[BACKEND] Add C++ tests for PTXFormat and some tiny refinement (#109 ) This PR does 1. Add some C++ tests for `PTXFormat` 2. Enhance the functionality of `PTXFormat`, make a `PTXInstr` instance can be called multiple times similar as a C function.	2022-09-09 09:15:07 -07:00
Yan Chunwei	a9464f4993	[Backend] Vectorize Load/Store Ops (#86 ) This PR does the following things: - Code refactoring on Load and Store op codegen, rewrite with same logic and share much code - Support the vectorized load/store	2022-09-06 12:28:09 -07:00
Da Yan	35e346bcff	[OPTIMIZER] Better pipeline pass (#100 ) * Use `insert_slice_async` instead of `CopyAsync` * Move async.wait to loop header Co-authored-by: Jokeren <kerenzhou@openai.com>	2022-09-06 08:31:13 -07:00
Philippe Tillet	a0bab9748e	[OPTIMIZER] Coalesce pass no longer takes a `num-warps` argument (#99 ) Improved design to avoid inconsistent `num-warps` value between the pass and the parent module of the operation it processes.	2022-09-05 18:09:02 -07:00
Jun Yang	ea175f689e	[CI]Added initial framework of CXX unittest (#98 ) Based on the discussion in #53 , I just added the initial flow of CXX unittests for this repo, with providing two dummy UTs as placeholder to show the usage, feel free to add your own CXX unittests. @Superjomn @ptillet @ptillet , in this PR, I also configure the integration-tests.yml to add the unittest into github CI check. Thanks	2022-09-04 12:50:27 +08:00
Philippe Tillet	d0b4c67b05	[OPTIMIZER] Improved layout conversion simplification algorithm (#97 ) This PR both simplifies the layout conversion simplification algorithm, and also improves it to make it work with vectorized element-wise ops. The conversion optimizer still has a lot of room for improvements, and other PRs will address its limitations (ideally via some sort of explicit cost model)	2022-09-02 16:52:44 -07:00
Shintaro Iwasaki	3c635449e5	[Triton] Support math and libdevice ops (#91 ) This PR adds basic math ops by using `MathDialect` and `libdevice` ops by using `extern_elementwise`. This is needed to compile some tutorial code (e.g., `softmax`). This PR implements only interface till PTX (so from frontend to TritonGPU-MLIR) - Currently till TritonGPU. It cannot be lowered to PTX now. - No special optimizations (e.g., constant folding etc) are applied. - 14.x does not define folders for many operators for math ops, but 15.x seems to increase its coverage: https://github.com/llvm/llvm-project/blob/llvmorg-15.0.0-rc3/mlir/include/mlir/Dialect/Math/IR/MathOps.td - No constant folding etc for `libdevice` ops. ```py import triton import triton.language as tl import sys @triton.jit def add_kernel( x_ptr, y_ptr, BLOCK_SIZE: tl.constexpr, ): offsets = tl.arange(0, BLOCK_SIZE) x = tl.load(x_ptr + offsets) x = tl.sin(x) output = tl.libdevice.sin(x) output = tl.libdevice.fdiv_rn(output, output) output = tl.libdevice.fmaf_rd(output, output, output) tl.store(y_ptr + offsets, output) if __name__ == "__main__" and len(sys.argv) >= 2: signature = "fp32,fp32" constants = {'BLOCK_SIZE': 1024} output = triton.compile(add_kernel, signature, device=0, constants=constants, output="ttgir") print(output) ``` -> ```llvm #blocked = #triton_gpu.blocked<{sizePerThread = [1], threadsPerWarp = [32], warpsPerCTA = [4], order = [0]}> module attributes {"triton_gpu.num-warps" = 4 : i32} { func @add_kernel__Pfp32_Pfp32__2c1024(%arg0: !tt.ptr<f32>, %arg1: !tt.ptr<f32>) { %0 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32, #blocked> %1 = tt.splat %arg0 : (!tt.ptr<f32>) -> tensor<1024x!tt.ptr<f32>, #blocked> %2 = tt.getelementptr %1, %0 : tensor<1024x!tt.ptr<f32>, #blocked> %3 = tt.load %2 {cache = 1 : i32, evict = 1 : i32, isVolatile = false} : tensor<1024xf32, #blocked> %4 = math.sin %3 : tensor<1024xf32, #blocked> %5 = tt.ext_elemwise %4 {libname = "libdevice", libpath = "/home/siwasaki/triton/python/triton/language/libdevice.10.bc", symbol = "__nv_sinf"} : tensor<1024xf32, #blocked> -> tensor<1024xf32, #blocked> %6 = tt.ext_elemwise %5, %5 {libname = "libdevice", libpath = "/home/siwasaki/triton/python/triton/language/libdevice.10.bc", symbol = "__nv_fdiv_rn"} : tensor<1024xf32, #blocked>, tensor<1024xf32, #blocked> -> tensor<1024xf32, #blocked> %7 = tt.ext_elemwise %6, %6, %6 {libname = "libdevice", libpath = "/home/siwasaki/triton/python/triton/language/libdevice.10.bc", symbol = "__nv_fmaf_rd"} : tensor<1024xf32, #blocked>, tensor<1024xf32, #blocked>, tensor<1024xf32, #blocked> -> tensor<1024xf32, #blocked> %8 = tt.splat %arg1 : (!tt.ptr<f32>) -> tensor<1024x!tt.ptr<f32>, #blocked> %9 = tt.getelementptr %8, %0 : tensor<1024x!tt.ptr<f32>, #blocked> tt.store %9, %7 : tensor<1024xf32, #blocked> return } } ```	2022-09-01 16:34:27 -07:00
Keren Zhou	328b87aec6	Keren/tensor slice insert alloc (#94 ) This branch defines three new triton_gpu operations to partially solve #87. Below is an overview: ``` %tensor = triton_gpu.alloc_tensor : tensor<2x16x16xf16, #A> %b = triton_gpu.insert_slice_async %a_ptr, %tensor, %offset {axis = 0 : i32, cache = 1 : i32, evict = 1 : i32, isVolatile = false} : tensor<16x16x!tt.ptr<f16>, #AL> -> tensor<2x16x16xf16, #A> %c = triton_gpu.extract_slice %b, %offset {axis = 0 : i32} : tensor<2x16x16xf16, #A> -> tensor<16x16xf16, #A> ``` We plan to fully replace `copy_async` with `insert_slice_async`. This hasn't been done yet.	2022-09-01 12:37:17 -07:00
Shintaro Iwasaki	d01353de07	[CI] add assert-enabled MLIR option (#78 ) This deprecates the use of release-build LLVM hosted by the LLVM project, which makes debugging harder for developers. This PR implements the following solution: 1. Create LLVM release tarballs with assert enabled on our own (using Docker) 2. Host them in our own GitHub repositories 3. Use our LLVM for CI and/or development if `TRITON_USE_ASSERT_ENABLED_LLVM=1` is set.	2022-08-31 18:55:32 -07:00
Keren Zhou	02ebf24d35	Analyze shared memory alias (#81 ) The purpose of this PR is analyzing shared memory aliases so that we can fix memory allocation bugs and save memory allocations in triton code involving complex control flows. Changes to memory bar and allocation are on the way. Co-authored-by: Philippe Tillet <phil@openai.com>	2022-08-29 10:43:20 -07:00
Philippe Tillet	83287d7193	[CI] enable self-hosted runner (#85 )	2022-08-25 19:12:16 -07:00

1 2 3 4 5 ...

640 Commits