Shintaro Iwasaki
|
3c635449e5
|
[Triton] Support math and libdevice ops (#91)
This PR adds basic math ops by using `MathDialect` and `libdevice` ops by using `extern_elementwise`. This is needed to compile some tutorial code (e.g., `softmax`). This PR implements only interface till PTX (so from frontend to TritonGPU-MLIR)
- Currently till TritonGPU. It cannot be lowered to PTX now.
- No special optimizations (e.g., constant folding etc) are applied.
- 14.x does not define folders for many operators for math ops, but 15.x seems to increase its coverage: https://github.com/llvm/llvm-project/blob/llvmorg-15.0.0-rc3/mlir/include/mlir/Dialect/Math/IR/MathOps.td
- No constant folding etc for `libdevice` ops.
```py
import triton
import triton.language as tl
import sys
@triton.jit
def add_kernel(
x_ptr,
y_ptr,
BLOCK_SIZE: tl.constexpr,
):
offsets = tl.arange(0, BLOCK_SIZE)
x = tl.load(x_ptr + offsets)
x = tl.sin(x)
output = tl.libdevice.sin(x)
output = tl.libdevice.fdiv_rn(output, output)
output = tl.libdevice.fmaf_rd(output, output, output)
tl.store(y_ptr + offsets, output)
if __name__ == "__main__" and len(sys.argv) >= 2:
signature = "*fp32,*fp32"
constants = {'BLOCK_SIZE': 1024}
output = triton.compile(add_kernel, signature, device=0, constants=constants, output="ttgir")
print(output)
```
->
```llvm
#blocked = #triton_gpu.blocked<{sizePerThread = [1], threadsPerWarp = [32], warpsPerCTA = [4], order = [0]}>
module attributes {"triton_gpu.num-warps" = 4 : i32} {
func @add_kernel__Pfp32_Pfp32__2c1024(%arg0: !tt.ptr<f32>, %arg1: !tt.ptr<f32>) {
%0 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32, #blocked>
%1 = tt.splat %arg0 : (!tt.ptr<f32>) -> tensor<1024x!tt.ptr<f32>, #blocked>
%2 = tt.getelementptr %1, %0 : tensor<1024x!tt.ptr<f32>, #blocked>
%3 = tt.load %2 {cache = 1 : i32, evict = 1 : i32, isVolatile = false} : tensor<1024xf32, #blocked>
%4 = math.sin %3 : tensor<1024xf32, #blocked>
%5 = tt.ext_elemwise %4 {libname = "libdevice", libpath = "/home/siwasaki/triton/python/triton/language/libdevice.10.bc", symbol = "__nv_sinf"} : tensor<1024xf32, #blocked> -> tensor<1024xf32, #blocked>
%6 = tt.ext_elemwise %5, %5 {libname = "libdevice", libpath = "/home/siwasaki/triton/python/triton/language/libdevice.10.bc", symbol = "__nv_fdiv_rn"} : tensor<1024xf32, #blocked>, tensor<1024xf32, #blocked> -> tensor<1024xf32, #blocked>
%7 = tt.ext_elemwise %6, %6, %6 {libname = "libdevice", libpath = "/home/siwasaki/triton/python/triton/language/libdevice.10.bc", symbol = "__nv_fmaf_rd"} : tensor<1024xf32, #blocked>, tensor<1024xf32, #blocked>, tensor<1024xf32, #blocked> -> tensor<1024xf32, #blocked>
%8 = tt.splat %arg1 : (!tt.ptr<f32>) -> tensor<1024x!tt.ptr<f32>, #blocked>
%9 = tt.getelementptr %8, %0 : tensor<1024x!tt.ptr<f32>, #blocked>
tt.store %9, %7 : tensor<1024xf32, #blocked>
return
}
}
```
|
2022-09-01 16:34:27 -07:00 |
|
Keren Zhou
|
328b87aec6
|
Keren/tensor slice insert alloc (#94)
This branch defines three new triton_gpu operations to partially solve #87. Below is an overview:
```
%tensor = triton_gpu.alloc_tensor : tensor<2x16x16xf16, #A>
%b = triton_gpu.insert_slice_async %a_ptr, %tensor, %offset {axis = 0 : i32, cache = 1 : i32, evict = 1 : i32, isVolatile = false} : tensor<16x16x!tt.ptr<f16>, #AL> -> tensor<2x16x16xf16, #A>
%c = triton_gpu.extract_slice %b, %offset {axis = 0 : i32} : tensor<2x16x16xf16, #A> -> tensor<16x16xf16, #A>
```
We plan to fully replace `copy_async` with `insert_slice_async`. **This hasn't been done yet.**
|
2022-09-01 12:37:17 -07:00 |
|
Shintaro Iwasaki
|
84aa7d025a
|
[TritonIR] simplify Load/StoreOps when mask is true/false (#79)
* [TritonIR] fix Load/Store/CopyAsyncOp's parsers
* [TritonIR] simplify Load/StoreOps when mask is true/false
* [TEST] adds tests to check load/store simplification
|
2022-08-24 12:55:49 -07:00 |
|
Shintaro Iwasaki
|
0ebef11c77
|
[TritonIR] Make mask operand optional (#74)
|
2022-08-22 22:00:17 -07:00 |
|
Da Yan
|
92ef552a54
|
[OPTIMIZER] Fix Num in AsyncWaitOp generated by the pipeline pass (#72)
|
2022-08-22 15:58:10 -07:00 |
|
Shintaro Iwasaki
|
9aa00249a6
|
[TritonIR] make other optional and remove isOtherUnspecified (#67)
[Triton] make other optional and remove isOtherUnspecified
|
2022-08-18 18:19:55 -07:00 |
|
Philippe Tillet
|
192be76b3c
|
[OPTIMIZER] Rewrite patterns for layout conversions (#64)
|
2022-08-18 12:49:37 -07:00 |
|
Da Yan
|
8776ad1a0e
|
[OPTIMIZER] Let the pipeline pass insert async wait. (#63)
|
2022-08-18 10:31:57 -07:00 |
|
Shintaro Iwasaki
|
d69ce77b19
|
[FRONTEND] add an attr for masked load without explicit other (#55)
|
2022-08-18 09:51:37 -07:00 |
|
Yan Chunwei
|
95bbac41e7
|
[BACKEND] Add LLVM-translation for store and splat ops (#47)
|
2022-08-15 00:46:37 -07:00 |
|
Shintaro Iwasaki
|
2ba9a83465
|
[BUILD] fix minor issues with MLIR assert enabled (#46)
|
2022-08-11 21:20:47 -07:00 |
|
Philippe Tillet
|
3236642e8f
|
[OPTIMIZER] Added memory coalescing pass (#31)
|
2022-07-31 20:59:31 -07:00 |
|
Philippe Tillet
|
d1593e6ca8
|
[TritonGPU] Improved documentation and semantics of layout encodings (#30)
|
2022-07-31 13:59:44 -07:00 |
|
Yan Chunwei
|
e02c82c765
|
[TritonIR] Convert Triton dialect's Combine pass to MLIR DRR based (#16)
|
2022-07-27 12:50:08 -07:00 |
|
Philippe Tillet
|
432c3df265
|
[BUILD] MacOS can now build compiler and run MLIR tests (#25)
|
2022-07-27 01:32:10 -07:00 |
|
Philippe Tillet
|
6d62d88d4f
|
[CI] run clang-format (#24)
|
2022-07-26 17:25:03 -07:00 |
|
Keren Zhou
|
96cc6fb563
|
[TritonGPU] Pretty printer for layouts (#21)
|
2022-07-26 10:50:11 -07:00 |
|
Philippe Tillet
|
a633d2b403
|
[Analysis] Added Axis Info Analysis (#8)
|
2022-07-19 13:38:48 -07:00 |
|
Yan Da
|
63e6a85901
|
Fix blocked layout parser
|
2022-07-15 15:19:11 +08:00 |
|
Yan Da
|
9d1b5e3f79
|
special encoding for broadcast
|
2022-06-18 21:16:45 +08:00 |
|
Yan Da
|
53cf93ce6a
|
Revert "Remove TypeConverter from TritonToTritonGPU conversion"
This reverts commit 64d0b87ef0 .
|
2022-06-18 14:57:41 +08:00 |
|
Yan Da
|
64d0b87ef0
|
Remove TypeConverter from TritonToTritonGPU conversion
|
2022-06-18 14:34:59 +08:00 |
|
Yan Da
|
9feb256b71
|
op combine in Triton Dialect: broadcast(cst) -> cst
|
2022-06-17 16:19:47 +08:00 |
|
Yan Da
|
117a402c1b
|
more comments to TypeConverter & update warpTileSize
|
2022-06-08 16:20:07 +08:00 |
|
Yan Da
|
7b09b5f9e9
|
the pipeline pass now generates and accepts valid IR
|
2022-06-07 19:34:59 +08:00 |
|
Yan Da
|
366dddc3bc
|
update mma encoding & triton-opt
|
2022-06-06 21:03:58 +08:00 |
|
Yan Da
|
7807f64ef3
|
rename sharded_layout => blocked_layout
|
2022-06-05 16:14:59 +08:00 |
|
Yan Da
|
a4a2c72173
|
default address space of PointerType 0 => 1
|
2022-06-05 15:09:41 +08:00 |
|
Yan Da
|
d5eca56cf3
|
more TritonGPU unit tests
|
2022-06-05 14:25:09 +08:00 |
|
Da Yan
|
e36a54eb86
|
more progress on the definition of layouts
|
2022-05-31 11:43:21 +00:00 |
|
Yan Da
|
41d338d848
|
Fix op mapping in pipeline.cpp
|
2022-05-26 13:57:01 +08:00 |
|
Yan Da
|
c529b462f5
|
more fixes on pipeline.cpp
|
2022-05-26 13:14:41 +08:00 |
|
Yan Da
|
71d1c10e19
|
Remove weird includes
|
2022-05-25 21:54:06 +08:00 |
|
Yan Da
|
9308e9c90c
|
A more general pipeliner
|
2022-05-25 21:52:51 +08:00 |
|
Yan Da
|
441fd7c3cc
|
assembly format
|
2022-05-25 17:53:24 +08:00 |
|
Yan Da
|
9b670cfb9f
|
Add ReduceOp
|
2022-05-25 14:15:36 +08:00 |
|
Yan Da
|
a2c9f919a8
|
TritonGPU verifier
|
2022-05-24 19:48:56 +08:00 |
|
Yan Da
|
36c45ec687
|
make numStages an option in PipelinePass
|
2022-05-23 12:47:55 +08:00 |
|
Yan Da
|
79298d61bc
|
fix a pipeline issue
|
2022-05-16 19:38:40 +08:00 |
|
Yan Da
|
c3c4ac3733
|
TritonGPU combiner
|
2022-05-16 19:17:15 +08:00 |
|
Yan Da
|
e3916c3a46
|
TritonGPU combiner
|
2022-05-16 19:16:01 +08:00 |
|
Yan Da
|
0e68e6eb59
|
delete erroneous include
|
2022-05-15 22:30:26 +08:00 |
|
Yan Da
|
7027af9666
|
The pipeline pass is now functional
|
2022-05-15 22:29:27 +08:00 |
|
Yan Da
|
7e0e7ec365
|
more progress on the pipeline pass
|
2022-05-14 22:04:36 +08:00 |
|
Yan Da
|
978463ba39
|
more progress on the pipeline pass
|
2022-05-13 21:32:35 +08:00 |
|
Yan Da
|
d23d7b244c
|
More on the pipeline pass
|
2022-05-11 20:31:08 +08:00 |
|
Yan Da
|
1a4fbed25b
|
Skeleton for the pipeline pass
|
2022-05-11 16:13:53 +08:00 |
|
Yan Da
|
96876a46d1
|
More progress on Triton=>TritonGPU conversion (works for matmul)
|
2022-05-09 21:19:53 +08:00 |
|
Yan Da
|
0c5319eed9
|
More progress on SCF type conversion
|
2022-05-05 20:56:55 +08:00 |
|
Yan Da
|
26c59e4718
|
More on SCF conversion
|
2022-05-04 21:50:32 +08:00 |
|