Phil Tillet
d767919bc1
[OPTIMIZER] Not using MMA on FP32 when allowTF32 is false
2022-11-04 23:16:28 -07:00
Superjomn
db64477153
Merge remote-tracking branch 'origin/triton-mlir' into port-fma
2022-11-04 17:43:54 +08:00
Superjomn
1ed6ee34ba
finish coding
2022-11-04 16:54:05 +08:00
Philippe Tillet
91a9773b38
[OPTIMIZER] Minor bugfixes that affected matmul codegen performance ( #834 )
2022-11-02 22:58:09 -07:00
Philippe Tillet
12d60cb4a3
[BACKEND] Added support for 1D conversion blocked -> slice ( #831 )
2022-11-01 13:19:58 -07:00
Yan Chunwei
031c2ae77b
[Triton-MLIR][BACKEND] Port the mma<v1> conversion ( #815 )
...
This PR does
- port the mma<v1> related code, and support dot conversion and
convert_layout[shared->dot_op<mma<v1>>]
- add a lit test for dot v1
2022-11-01 09:42:14 +08:00
Philippe Tillet
e61dc75942
[FRONTEND] Fixed inliner and got more tests to pass ( #822 )
...
This adds a `DialectInlinerInterface` to the Triton dialect. This, along
with a few other minor semantic changes, fixes our tests on call
instructions. Also added the option to provide use an "LLVM_SYSPATH"
environment variable to link against locally build of LLVM; this was
useful for debugging this issue.
2022-10-30 14:10:02 -07:00
Ian Bearman
f2106d0aa2
[BUILD] Fix Warnings and Enable Warnings as Errors ( #794 )
2022-10-28 12:36:09 -07:00
Philippe Tillet
ac0f6793cc
[BACKEND] Added support for scalars in LoadOp / StoreOp / ElementwiseOp ( #814 )
...
Also fixed various errors that showed up in `test_core.py`, and added more TODOs for open (hopefully relatively minor) issues
2022-10-28 16:17:55 +08:00
Keren Zhou
3b80801dff
[Triton-MLIR][Backend] Fix many problems to get the pipeline working ( #809 )
...
1. Rewrite code generation of insert_slice_async.
2. Correct the wrong index passed to extract_slice in pipeline.
3. Add a prologue in pipeline to wait for dangling cp.asyncs.
4. Move scf to cf conversion inside TritonGPUToLLVM because we need to
perform membar before scf to cf. It shouldn't be a technical limitation
and could be improved by a more general membar analysis.
5. Use an attribute to memoize the shared memory size and support
dynamic shared memory.
6. Prevent the combine pass to reorder insert_slice and extract_slice
across async_wait
Co-authored-by: Superjomn <yanchunwei@outlook.com >
2022-10-27 22:09:06 -07:00
Qingyi Liu
42db3538e4
[Triton-MLIR][Backend] Add ReduceOpConversion into TritonGPUToLLVM conversion ( #774 )
...
What is done in this PR:
- [x] Add `ConvertLayout`, `getSizePerThread` and `getShapePerCTA`
implementation for `SliceEncodingAttr`
- [x] Split `emitIndices` into two phases:
`emitBaseIndexForBlockedLayout` and `emitOffsetForBlockedLayout`
- [x] Add `ReduceOpConversion::matchAndRewriteBasic` implementation
- [x] Add `ReduceOpConversion::matchAndRewriteFast` implementation with
ptx instruction `shfl.sync`
- [x] Add support for scalar value in `StoreOpConversion`
- [x] Add Reduce1d and Reduce2d unit tests and pass all unit tests
Co-authored-by: Qingyi Liu <liuqingyi1993@gmail.com >
2022-10-28 11:07:45 +08:00
Philippe Tillet
3e6cc6d66c
[FRONTEND] Made more tests pass ( #805 )
2022-10-26 17:47:33 -07:00
Yan Chunwei
877844de4f
[Triton-MLIR][BACKEND] add convert_layout[shared->dot_op] converstion to adapt DotOperand layout ( #786 )
...
This PR helps to
1. Adapt the existing DotOp conversion to the design of the new
DotOperand layout,
2. Making the DotOp conversion work with both shared-layout inputs case
and dotoperand-layout inputs case for further upstream switch.
2022-10-24 11:40:13 +08:00
Philippe Tillet
bb0f9235d1
[OPTIMIZER] Made layout simplification pass efficient for fused attention kernels ( #790 )
2022-10-21 16:52:15 -07:00
Philippe Tillet
dc0588a898
[OPTIMIZER] Improved layout simplification pass so it handles swizzled layouts better ( #789 )
...
Note: uncommented `test_gemm`, since backend has an issue with swizzling. This will get uncommented in a subsequent PR.
2022-10-20 19:03:37 -07:00
Shintaro Iwasaki
0d22d2bc03
[TritonMLIR] Disallow 0D tensor ( #788 )
2022-10-19 10:34:32 -07:00
Yan Chunwei
4464646efb
[Triton-MLIR][BACKEND] Fix masked load store op vector size ( #785 )
...
Correct the Load/Store Op's vector size with the mask's alignment
correctly considered.
Some cases:
```mlir
// num_warp = 2
// block_size = 128
func @vecadd_mask_align_16(%a_ptr: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %b_ptr: !tt.ptr<f32> {tt.divisibility = 16 : i32},
%out_ptr: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %n_elements: i32 {tt.divisibility = 16 : i32}) {
// mask = make_range(128) < n_element
}
```
This should get the vec=2 `ld`/`st` instructions.
While the following example
```mlir
// num_warp = 2
// block_size = 128
func @vecadd_mask_align_16(%a_ptr: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %b_ptr: !tt.ptr<f32> {tt.divisibility = 16 : i32},
%out_ptr: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %n_elements: i32) {
// mask = make_range(128) < n_element
}
```
it should get the vec=1 `ld`/`st` instructions.
2022-10-18 11:43:50 +08:00
Philippe Tillet
38a80664b5
[OPTIMIZER] Updated TritonGPU-combine pass ( #784 )
...
WIP but should work int t…he cases we need so far
2022-10-16 21:19:42 -07:00
goostavz
e948a618b3
[Triton-MLIR] fix a tiny bug in coalesce pass ( #782 )
2022-10-16 20:29:55 -07:00
Shintaro Iwasaki
5898352f97
[Triton-IR] Fix LoadOp definition ( #771 ) ( #777 )
2022-10-13 18:53:00 -07:00
Philippe Tillet
623c99609f
[Triton-IR] Added type inference and verifier for Triton-IR operations ( #767 )
2022-10-11 18:16:41 -07:00
Philippe Tillet
b6e5a231e5
[OPTIMIZER] Added swizzling pass ( #758 )
2022-10-10 01:12:37 -07:00
Philippe Tillet
498c685b46
[OPTIMIZER] layout simplification: ignore non-tensor iter arguments in for loop rematerialization ( #749 )
2022-10-07 21:52:29 -07:00
Keren Zhou
289ff293cc
[Triton-MLIR] Generate LLVM/PTX code for async ops ( #735 )
2022-10-04 09:37:00 -07:00
goostavz
f9d7f2f126
[Triton-MLIR][Backend] Support ConvertLayout blocked->shared and a few fixes related with mma( #716 )
2022-10-03 19:33:25 +08:00
Philippe Tillet
9ddf0921fb
[OPTIMIZER] Added DotOp
to the list of expensive ops we don't want to rematerialize. ( #718 )
2022-09-27 09:05:49 -07:00
Yan Chunwei
3a84278530
[Triton-MLIR][BACKEND] Refine dot conversion ( #710 )
...
This PR does
1. Refine the dot conversion
2. some other tiny code refinement
2022-09-27 14:38:34 +08:00
goostavz
61b61755e5
[Triton-MLIR][Backend] Support layout conversion between mmaLayout and blockedLayout ( #693 )
2022-09-27 03:58:47 +00:00
Keren Zhou
ecd1bc33df
[Triton-MLIR] Keren/code gen for extract slice and alloc tensor ( #692 )
...
Co-authored-by: gzhu <goostavz@outlook.com >
2022-09-23 19:38:14 +00:00
Yan Chunwei
922155f1d2
[BACKEND] add dot conversion (mma version=2) ( #672 )
...
LLVM Conversion for Dot op.
Due to the lack of `convert_layout`, currently, the dot only supports
the following combination of operands
- `$a` in shared layout
- `$b` in shared layout
- `$c` in MMA layout(but only Splat-like, leaving the generic cases to
`convert_layout`)
This PR focus on `mma.16816` related logic support, leaving the other
cases to the following PR.
Co-authored-by: Philippe Tillet <phil@openai.com >
2022-09-22 20:43:54 -07:00
Shintaro Iwasaki
940ef3f0ac
[BACKEND] llvm::dyn_cast -> llvm::dyn_cast_or_null ( #689 )
2022-09-22 03:26:40 +00:00
goostavz
15bfd0cb79
[BACKEND] Support of ConvertLayoutOp from blocked to blocked and SliceLayout with blocked parent ( #658 )
2022-09-17 14:58:42 -07:00
Shintaro Iwasaki
13669b46a6
[DOCS] Correct spelling ( #665 )
...
This PR corrects spelling like #664 for Triton-MLIR. It should not break anything.
2022-09-16 15:07:34 -07:00
Shintaro Iwasaki
43be75ad42
[FRONTEND] Add scalar type support for some ops ( #661 )
...
This PR adds basic support for scalar-type inputs to some ops (cast and pointer arithmetics) for Triton-MLIR. Also renames getelementptr -> addptr
2022-09-15 16:12:52 -07:00
Da Yan
2e08450c80
[OPTIMIZER] Better pipeline tests ( #660 )
2022-09-14 23:26:40 -07:00
Keren Zhou
16aed94ff5
[Analysis/Allocation] Allocation passes now assumes that slices always alias ( #108 )
...
This code in this branch assumes the `src` operand in
`insert_slice_async` always aliases the result, which shouldn't hold for
generally cases but is just a workaround to make the pipeline pass work.
I'm also working on the complete analysis in another
[branch](https://github.com/openai/triton-mlir/tree/keren/analyze-slice ).
2022-09-09 12:03:41 -07:00
Philippe Tillet
9bd5a3dcd2
[OPTIMIZER] Pipeline async buffer ( #110 )
2022-09-09 11:01:14 -07:00
Yan Chunwei
a9464f4993
[Backend] Vectorize Load/Store Ops ( #86 )
...
This PR does the following things:
- Code refactoring on Load and Store op codegen, rewrite with same logic
and share much code
- Support the vectorized load/store
2022-09-06 12:28:09 -07:00
Da Yan
35e346bcff
[OPTIMIZER] Better pipeline pass ( #100 )
...
* Use `insert_slice_async` instead of `CopyAsync`
* Move async.wait to loop header
Co-authored-by: Jokeren <kerenzhou@openai.com >
2022-09-06 08:31:13 -07:00
Philippe Tillet
a0bab9748e
[OPTIMIZER] Coalesce pass no longer takes a num-warps
argument ( #99 )
...
Improved design to avoid inconsistent `num-warps` value between the pass and the parent module of the operation it processes.
2022-09-05 18:09:02 -07:00
Philippe Tillet
d0b4c67b05
[OPTIMIZER] Improved layout conversion simplification algorithm ( #97 )
...
This PR both simplifies the layout conversion simplification algorithm, and also improves it to make it work with vectorized element-wise ops. The conversion optimizer still has a lot of room for improvements, and other PRs will address its limitations (ideally via some sort of explicit cost model)
2022-09-02 16:52:44 -07:00
Shintaro Iwasaki
3c635449e5
[Triton] Support math and libdevice ops ( #91 )
...
This PR adds basic math ops by using `MathDialect` and `libdevice` ops by using `extern_elementwise`. This is needed to compile some tutorial code (e.g., `softmax`). This PR implements only interface till PTX (so from frontend to TritonGPU-MLIR)
- Currently till TritonGPU. It cannot be lowered to PTX now.
- No special optimizations (e.g., constant folding etc) are applied.
- 14.x does not define folders for many operators for math ops, but 15.x seems to increase its coverage: https://github.com/llvm/llvm-project/blob/llvmorg-15.0.0-rc3/mlir/include/mlir/Dialect/Math/IR/MathOps.td
- No constant folding etc for `libdevice` ops.
```py
import triton
import triton.language as tl
import sys
@triton.jit
def add_kernel(
x_ptr,
y_ptr,
BLOCK_SIZE: tl.constexpr,
):
offsets = tl.arange(0, BLOCK_SIZE)
x = tl.load(x_ptr + offsets)
x = tl.sin(x)
output = tl.libdevice.sin(x)
output = tl.libdevice.fdiv_rn(output, output)
output = tl.libdevice.fmaf_rd(output, output, output)
tl.store(y_ptr + offsets, output)
if __name__ == "__main__" and len(sys.argv) >= 2:
signature = "*fp32,*fp32"
constants = {'BLOCK_SIZE': 1024}
output = triton.compile(add_kernel, signature, device=0, constants=constants, output="ttgir")
print(output)
```
->
```llvm
#blocked = #triton_gpu.blocked<{sizePerThread = [1], threadsPerWarp = [32], warpsPerCTA = [4], order = [0]}>
module attributes {"triton_gpu.num-warps" = 4 : i32} {
func @add_kernel__Pfp32_Pfp32__2c1024(%arg0: !tt.ptr<f32>, %arg1: !tt.ptr<f32>) {
%0 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32, #blocked>
%1 = tt.splat %arg0 : (!tt.ptr<f32>) -> tensor<1024x!tt.ptr<f32>, #blocked>
%2 = tt.getelementptr %1, %0 : tensor<1024x!tt.ptr<f32>, #blocked>
%3 = tt.load %2 {cache = 1 : i32, evict = 1 : i32, isVolatile = false} : tensor<1024xf32, #blocked>
%4 = math.sin %3 : tensor<1024xf32, #blocked>
%5 = tt.ext_elemwise %4 {libname = "libdevice", libpath = "/home/siwasaki/triton/python/triton/language/libdevice.10.bc", symbol = "__nv_sinf"} : tensor<1024xf32, #blocked> -> tensor<1024xf32, #blocked>
%6 = tt.ext_elemwise %5, %5 {libname = "libdevice", libpath = "/home/siwasaki/triton/python/triton/language/libdevice.10.bc", symbol = "__nv_fdiv_rn"} : tensor<1024xf32, #blocked>, tensor<1024xf32, #blocked> -> tensor<1024xf32, #blocked>
%7 = tt.ext_elemwise %6, %6, %6 {libname = "libdevice", libpath = "/home/siwasaki/triton/python/triton/language/libdevice.10.bc", symbol = "__nv_fmaf_rd"} : tensor<1024xf32, #blocked>, tensor<1024xf32, #blocked>, tensor<1024xf32, #blocked> -> tensor<1024xf32, #blocked>
%8 = tt.splat %arg1 : (!tt.ptr<f32>) -> tensor<1024x!tt.ptr<f32>, #blocked>
%9 = tt.getelementptr %8, %0 : tensor<1024x!tt.ptr<f32>, #blocked>
tt.store %9, %7 : tensor<1024xf32, #blocked>
return
}
}
```
2022-09-01 16:34:27 -07:00
Keren Zhou
328b87aec6
Keren/tensor slice insert alloc ( #94 )
...
This branch defines three new triton_gpu operations to partially solve #87 . Below is an overview:
```
%tensor = triton_gpu.alloc_tensor : tensor<2x16x16xf16, #A>
%b = triton_gpu.insert_slice_async %a_ptr, %tensor, %offset {axis = 0 : i32, cache = 1 : i32, evict = 1 : i32, isVolatile = false} : tensor<16x16x!tt.ptr<f16>, #AL> -> tensor<2x16x16xf16, #A>
%c = triton_gpu.extract_slice %b, %offset {axis = 0 : i32} : tensor<2x16x16xf16, #A> -> tensor<16x16xf16, #A>
```
We plan to fully replace `copy_async` with `insert_slice_async`. **This hasn't been done yet.**
2022-09-01 12:37:17 -07:00
Shintaro Iwasaki
84aa7d025a
[TritonIR] simplify Load/StoreOps when mask is true/false ( #79 )
...
* [TritonIR] fix Load/Store/CopyAsyncOp's parsers
* [TritonIR] simplify Load/StoreOps when mask is true/false
* [TEST] adds tests to check load/store simplification
2022-08-24 12:55:49 -07:00
Shintaro Iwasaki
0ebef11c77
[TritonIR] Make mask operand optional ( #74 )
2022-08-22 22:00:17 -07:00
Da Yan
92ef552a54
[OPTIMIZER] Fix Num in AsyncWaitOp generated by the pipeline pass ( #72 )
2022-08-22 15:58:10 -07:00
Shintaro Iwasaki
9aa00249a6
[TritonIR] make other optional and remove isOtherUnspecified ( #67 )
...
[Triton] make other optional and remove isOtherUnspecified
2022-08-18 18:19:55 -07:00
Philippe Tillet
192be76b3c
[OPTIMIZER] Rewrite patterns for layout conversions ( #64 )
2022-08-18 12:49:37 -07:00
Da Yan
8776ad1a0e
[OPTIMIZER] Let the pipeline pass insert async wait. ( #63 )
2022-08-18 10:31:57 -07:00
Shintaro Iwasaki
d69ce77b19
[FRONTEND] add an attr for masked load without explicit other ( #55 )
2022-08-18 09:51:37 -07:00