Philippe Tillet
532e10cf87
[FRONTEND][BACKEND] Clean-up transpositions ( #953 )
2022-12-06 09:32:13 -08:00
Philippe Tillet
8edfe813a5
[FRONTEND][BACKEND] Added trans
instruction; made flash attention bwd pass work ( #943 )
2022-12-03 09:58:24 -08:00
donproc
521ff9ad74
[TRITON-MLIR][FRONTEND]fix scf.if to run through layernorm tutorial ( #938 )
...
Co-authored-by: dongdongl <dongdongl@nvidia.com >
2022-12-02 17:45:29 +08:00
Keren Zhou
7d90a07d0b
[Triton-MLIR][BACKEND] Refactor decompose insert_slice_async ( #929 )
...
1. Improve pipline's comment
2. Decompose insert_slice_async when load vector size is not supported
3. Add a test that could fail our gemm code
Copy my comments here:
There's a knob that may cause performance regression when decomposition
has been performed. We should remove this knob once we have thorough
analysis on async wait. Currently, we decompose `insert_slice_async`
into `load` and `insert_slice` without knowing which `async_wait` is
responsible for the `insert_slice_async`. To guarantee correctness, we
blindly set the `async_wait` to wait for all async ops if any `insert_slice_async` has been decomposed.
There are two options to improve this:
1. We can perform a dataflow analysis to find the `async_wait` that is
responsible for the `insert_slice_async` in the backend.
4. We can modify the pipeline to perform the decomposition before the
`async_wait` is inserted. However, it is also risky because we don't
know the correct vectorized shape yet in the pipeline pass. Making the
pipeline pass aware of the vectorization could introduce additional
dependencies on the AxisInfoAnalysis and the Coalesce analysis.
2022-11-30 10:07:34 -08:00
donproc
f63be0e9b5
[TRITON-MLIR][BACKEND]support atomic_cas ( #914 )
...
1. support atomics-cas
2. add xchg support in atomic_rmw
Co-authored-by: dongdongl <dongdongl@nvidia.com >
2022-11-25 12:02:08 +08:00
donproc
5eee738df7
[Triton-MLIR][FRONTEND] [BACKEND] fix atomics ( #879 )
...
minor fix to backend and frontend of atomics, we can pass 1 test without
mask and the shape aligned with CTA size now
Co-authored-by: dongdongl <dongdongl@nvidia.com >
2022-11-16 12:25:15 +08:00
Chenggang Zhao
57fd1864a7
[Triton-MLIR] Support FP8 ( #864 )
...
Co-authored-by: Superjomn <yanchunwei@outlook.com >
2022-11-10 15:53:06 +08:00
Da Yan
4946167241
[Triton-MLIR] tt.dot
operands now must have DotOperand layout; also added prefetch pass prototype ( #712 )
...
Co-authored-by: Jokeren <kerenzhou@openai.com >
Co-authored-by: Phil Tillet <phil@openai.com >
Co-authored-by: Superjomn <yanchunwei@outlook.com >
2022-11-10 05:57:27 +00:00
Philippe Tillet
a4ff0c362c
[FRONTEND] Fix issues with atomics ( #849 )
2022-11-06 20:52:11 -08:00
ben-zhang-609
5feb6e24f9
[Triton-MLIR]Add ptx vprintf support ( #825 )
...
Not know how to write unit test for this feature.
Co-authored-by: Yan Chunwei <yanchunwei@outlook.com >
2022-11-02 16:39:09 +08:00
Ian Bearman
f2106d0aa2
[BUILD] Fix Warnings and Enable Warnings as Errors ( #794 )
2022-10-28 12:36:09 -07:00
Philippe Tillet
3e6cc6d66c
[FRONTEND] Made more tests pass ( #805 )
2022-10-26 17:47:33 -07:00
Philippe Tillet
a2cbe7af91
[FRONTEND] Enhanced support for binary operators ( #801 )
...
Disabled modulo test (due to change in behavior for `frem` in nvptx
between llvm-11 and llvm-14) and bfloat16 (will require some work to
emulate in software similar to how it's done in `master`)
2022-10-24 19:47:01 -07:00
Philippe Tillet
bb0f9235d1
[OPTIMIZER] Made layout simplification pass efficient for fused attention kernels ( #790 )
2022-10-21 16:52:15 -07:00
Shintaro Iwasaki
5898352f97
[Triton-IR] Fix LoadOp definition ( #771 ) ( #777 )
2022-10-13 18:53:00 -07:00
Da Yan
963d031247
[Triton-IR] Fix LoadOp Triton->TritonGPU conversion ( #775 )
2022-10-13 12:57:39 -07:00
Philippe Tillet
623c99609f
[Triton-IR] Added type inference and verifier for Triton-IR operations ( #767 )
2022-10-11 18:16:41 -07:00
Shintaro Iwasaki
43be75ad42
[FRONTEND] Add scalar type support for some ops ( #661 )
...
This PR adds basic support for scalar-type inputs to some ops (cast and pointer arithmetics) for Triton-MLIR. Also renames getelementptr -> addptr
2022-09-15 16:12:52 -07:00
Shintaro Iwasaki
3c635449e5
[Triton] Support math and libdevice ops ( #91 )
...
This PR adds basic math ops by using `MathDialect` and `libdevice` ops by using `extern_elementwise`. This is needed to compile some tutorial code (e.g., `softmax`). This PR implements only interface till PTX (so from frontend to TritonGPU-MLIR)
- Currently till TritonGPU. It cannot be lowered to PTX now.
- No special optimizations (e.g., constant folding etc) are applied.
- 14.x does not define folders for many operators for math ops, but 15.x seems to increase its coverage: https://github.com/llvm/llvm-project/blob/llvmorg-15.0.0-rc3/mlir/include/mlir/Dialect/Math/IR/MathOps.td
- No constant folding etc for `libdevice` ops.
```py
import triton
import triton.language as tl
import sys
@triton.jit
def add_kernel(
x_ptr,
y_ptr,
BLOCK_SIZE: tl.constexpr,
):
offsets = tl.arange(0, BLOCK_SIZE)
x = tl.load(x_ptr + offsets)
x = tl.sin(x)
output = tl.libdevice.sin(x)
output = tl.libdevice.fdiv_rn(output, output)
output = tl.libdevice.fmaf_rd(output, output, output)
tl.store(y_ptr + offsets, output)
if __name__ == "__main__" and len(sys.argv) >= 2:
signature = "*fp32,*fp32"
constants = {'BLOCK_SIZE': 1024}
output = triton.compile(add_kernel, signature, device=0, constants=constants, output="ttgir")
print(output)
```
->
```llvm
#blocked = #triton_gpu.blocked<{sizePerThread = [1], threadsPerWarp = [32], warpsPerCTA = [4], order = [0]}>
module attributes {"triton_gpu.num-warps" = 4 : i32} {
func @add_kernel__Pfp32_Pfp32__2c1024(%arg0: !tt.ptr<f32>, %arg1: !tt.ptr<f32>) {
%0 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32, #blocked>
%1 = tt.splat %arg0 : (!tt.ptr<f32>) -> tensor<1024x!tt.ptr<f32>, #blocked>
%2 = tt.getelementptr %1, %0 : tensor<1024x!tt.ptr<f32>, #blocked>
%3 = tt.load %2 {cache = 1 : i32, evict = 1 : i32, isVolatile = false} : tensor<1024xf32, #blocked>
%4 = math.sin %3 : tensor<1024xf32, #blocked>
%5 = tt.ext_elemwise %4 {libname = "libdevice", libpath = "/home/siwasaki/triton/python/triton/language/libdevice.10.bc", symbol = "__nv_sinf"} : tensor<1024xf32, #blocked> -> tensor<1024xf32, #blocked>
%6 = tt.ext_elemwise %5, %5 {libname = "libdevice", libpath = "/home/siwasaki/triton/python/triton/language/libdevice.10.bc", symbol = "__nv_fdiv_rn"} : tensor<1024xf32, #blocked>, tensor<1024xf32, #blocked> -> tensor<1024xf32, #blocked>
%7 = tt.ext_elemwise %6, %6, %6 {libname = "libdevice", libpath = "/home/siwasaki/triton/python/triton/language/libdevice.10.bc", symbol = "__nv_fmaf_rd"} : tensor<1024xf32, #blocked>, tensor<1024xf32, #blocked>, tensor<1024xf32, #blocked> -> tensor<1024xf32, #blocked>
%8 = tt.splat %arg1 : (!tt.ptr<f32>) -> tensor<1024x!tt.ptr<f32>, #blocked>
%9 = tt.getelementptr %8, %0 : tensor<1024x!tt.ptr<f32>, #blocked>
tt.store %9, %7 : tensor<1024xf32, #blocked>
return
}
}
```
2022-09-01 16:34:27 -07:00
goostavz
bedbf221c0
[BACKEND] Support optional mask in TritonGPUToLLVM ( #80 )
...
Co-authored-by: gzhu <gzhu@nvidia.com >
2022-08-24 17:51:37 -07:00
Shintaro Iwasaki
0ebef11c77
[TritonIR] Make mask operand optional ( #74 )
2022-08-22 22:00:17 -07:00
Shintaro Iwasaki
9aa00249a6
[TritonIR] make other optional and remove isOtherUnspecified ( #67 )
...
[Triton] make other optional and remove isOtherUnspecified
2022-08-18 18:19:55 -07:00
Philippe Tillet
192be76b3c
[OPTIMIZER] Rewrite patterns for layout conversions ( #64 )
2022-08-18 12:49:37 -07:00
Shintaro Iwasaki
d69ce77b19
[FRONTEND] add an attr for masked load without explicit other ( #55 )
2022-08-18 09:51:37 -07:00
Yan Chunwei
83ef74f248
[BACKEND] Extracting numWarps from tritonGPU module ( #39 )
2022-08-08 09:40:20 -07:00
Philippe Tillet
78ebbe24c7
[FRONTEND] Added ExpandDimsOp
primitive ( #36 )
2022-08-04 18:41:06 -07:00
Philippe Tillet
d1593e6ca8
[TritonGPU] Improved documentation and semantics of layout encodings ( #30 )
2022-07-31 13:59:44 -07:00
Philippe Tillet
6d62d88d4f
[CI] run clang-format ( #24 )
2022-07-26 17:25:03 -07:00
Philippe Tillet
a633d2b403
[Analysis] Added Axis Info Analysis ( #8 )
2022-07-19 13:38:48 -07:00
Yan Da
9d1b5e3f79
special encoding for broadcast
2022-06-18 21:16:45 +08:00
Yan Da
53cf93ce6a
Revert "Remove TypeConverter from TritonToTritonGPU conversion"
...
This reverts commit 64d0b87ef0
.
2022-06-18 14:57:41 +08:00
Yan Da
64d0b87ef0
Remove TypeConverter from TritonToTritonGPU conversion
2022-06-18 14:34:59 +08:00
Yan Da
117a402c1b
more comments to TypeConverter & update warpTileSize
2022-06-08 16:20:07 +08:00
Yan Da
7b09b5f9e9
the pipeline pass now generates and accepts valid IR
2022-06-07 19:34:59 +08:00
Yan Da
9308e9c90c
A more general pipeliner
2022-05-25 21:52:51 +08:00
Yan Da
e6f89a5777
Fix ReduceOp conversion
2022-05-25 16:03:06 +08:00
Yan Da
9b670cfb9f
Add ReduceOp
2022-05-25 14:15:36 +08:00
Yan Da
1a4fbed25b
Skeleton for the pipeline pass
2022-05-11 16:13:53 +08:00
Yan Da
96876a46d1
More progress on Triton=>TritonGPU conversion (works for matmul)
2022-05-09 21:19:53 +08:00
Yan Da
0c5319eed9
More progress on SCF type conversion
2022-05-05 20:56:55 +08:00
Yan Da
26c59e4718
More on SCF conversion
2022-05-04 21:50:32 +08:00
Yan Da
a96fe07e1c
DotOp conversion
2022-05-04 15:56:24 +08:00
Yan Da
2d281cbc0a
ConstantOp conversion pattern
2022-05-04 15:35:43 +08:00
Yan Da
b9279d2e3b
More progress on TritonGPU conversion
2022-05-04 14:54:31 +08:00
Yan Da
3ad7bee35e
More conversion patterns
2022-05-04 12:50:02 +08:00
Yan Da
5f08e2fdae
More arith patterns
2022-05-02 22:31:29 +08:00
Yan Da
75d32e2442
More on TritonGPU conversion
2022-05-02 21:51:00 +08:00
Yan Da
1428185c9c
More progress on TritonGPUTypeConverter & TritonGPUConversionTarget
2022-05-01 22:06:54 +08:00
Yan Da
4ece9fd1f3
Move dependentDialects from .cpp to .td
2022-05-01 13:06:51 +08:00
Phil Tillet
d9017f8593
add basic template for legalizing arithmetic op
2022-04-30 20:42:25 -07:00