Commit Graph

51 Commits

Author SHA1 Message Date
Philippe Tillet
532e10cf87 [FRONTEND][BACKEND] Clean-up transpositions (#953) 2022-12-06 09:32:13 -08:00
Philippe Tillet
8edfe813a5 [FRONTEND][BACKEND] Added trans instruction; made flash attention bwd pass work (#943) 2022-12-03 09:58:24 -08:00
donproc
521ff9ad74 [TRITON-MLIR][FRONTEND]fix scf.if to run through layernorm tutorial (#938)
Co-authored-by: dongdongl <dongdongl@nvidia.com>
2022-12-02 17:45:29 +08:00
Keren Zhou
7d90a07d0b [Triton-MLIR][BACKEND] Refactor decompose insert_slice_async (#929)
1. Improve pipline's comment
2. Decompose insert_slice_async when load vector size is not supported
3. Add a test that could fail our gemm code

Copy my comments here:

There's a knob that may cause performance regression when decomposition
has been performed. We should remove this knob once we have thorough
analysis on async wait. Currently, we decompose `insert_slice_async`
into `load` and `insert_slice` without knowing which `async_wait` is
responsible for the `insert_slice_async`. To guarantee correctness, we
blindly set the `async_wait` to wait for all async ops if any `insert_slice_async` has been decomposed.

There are two options to improve this:
1. We can perform a dataflow analysis to find the `async_wait` that is
responsible for the `insert_slice_async` in the backend.
4. We can modify the pipeline to perform the decomposition before the
`async_wait` is inserted. However, it is also risky because we don't
know the correct vectorized shape yet in the pipeline pass. Making the
pipeline pass aware of the vectorization could introduce additional
dependencies on the AxisInfoAnalysis and the Coalesce analysis.
2022-11-30 10:07:34 -08:00
donproc
f63be0e9b5 [TRITON-MLIR][BACKEND]support atomic_cas (#914)
1. support atomics-cas
2. add xchg support in atomic_rmw

Co-authored-by: dongdongl <dongdongl@nvidia.com>
2022-11-25 12:02:08 +08:00
donproc
5eee738df7 [Triton-MLIR][FRONTEND] [BACKEND] fix atomics (#879)
minor fix to backend and frontend of atomics, we can pass 1 test without
mask and the shape aligned with CTA size now

Co-authored-by: dongdongl <dongdongl@nvidia.com>
2022-11-16 12:25:15 +08:00
Chenggang Zhao
57fd1864a7 [Triton-MLIR] Support FP8 (#864)
Co-authored-by: Superjomn <yanchunwei@outlook.com>
2022-11-10 15:53:06 +08:00
Da Yan
4946167241 [Triton-MLIR] tt.dot operands now must have DotOperand layout; also added prefetch pass prototype (#712)
Co-authored-by: Jokeren <kerenzhou@openai.com>
Co-authored-by: Phil Tillet <phil@openai.com>
Co-authored-by: Superjomn <yanchunwei@outlook.com>
2022-11-10 05:57:27 +00:00
Philippe Tillet
a4ff0c362c [FRONTEND] Fix issues with atomics (#849) 2022-11-06 20:52:11 -08:00
ben-zhang-609
5feb6e24f9 [Triton-MLIR]Add ptx vprintf support (#825)
Not know how to write unit test for this feature.

Co-authored-by: Yan Chunwei <yanchunwei@outlook.com>
2022-11-02 16:39:09 +08:00
Ian Bearman
f2106d0aa2 [BUILD] Fix Warnings and Enable Warnings as Errors (#794) 2022-10-28 12:36:09 -07:00
Philippe Tillet
3e6cc6d66c [FRONTEND] Made more tests pass (#805) 2022-10-26 17:47:33 -07:00
Philippe Tillet
a2cbe7af91 [FRONTEND] Enhanced support for binary operators (#801)
Disabled modulo test (due to change in behavior for `frem` in nvptx
between llvm-11 and llvm-14) and bfloat16 (will require some work to
emulate in software similar to how it's done in `master`)
2022-10-24 19:47:01 -07:00
Philippe Tillet
bb0f9235d1 [OPTIMIZER] Made layout simplification pass efficient for fused attention kernels (#790) 2022-10-21 16:52:15 -07:00
Shintaro Iwasaki
5898352f97 [Triton-IR] Fix LoadOp definition (#771) (#777) 2022-10-13 18:53:00 -07:00
Da Yan
963d031247 [Triton-IR] Fix LoadOp Triton->TritonGPU conversion (#775) 2022-10-13 12:57:39 -07:00
Philippe Tillet
623c99609f [Triton-IR] Added type inference and verifier for Triton-IR operations (#767) 2022-10-11 18:16:41 -07:00
Shintaro Iwasaki
43be75ad42 [FRONTEND] Add scalar type support for some ops (#661)
This PR adds basic support for scalar-type inputs to some ops (cast and pointer arithmetics) for Triton-MLIR. Also renames getelementptr -> addptr
2022-09-15 16:12:52 -07:00
Shintaro Iwasaki
3c635449e5 [Triton] Support math and libdevice ops (#91)
This PR adds basic math ops by using `MathDialect` and `libdevice` ops by using `extern_elementwise`. This is needed to compile some tutorial code (e.g., `softmax`). This PR implements only interface till PTX (so from frontend to TritonGPU-MLIR) 
- Currently till TritonGPU. It cannot be lowered to PTX now.
- No special optimizations (e.g., constant folding etc) are applied.
  - 14.x does not define folders for many operators for math ops, but 15.x seems to increase its coverage: https://github.com/llvm/llvm-project/blob/llvmorg-15.0.0-rc3/mlir/include/mlir/Dialect/Math/IR/MathOps.td
  - No constant folding etc for `libdevice` ops.

```py
import triton
import triton.language as tl
import sys

@triton.jit
def add_kernel(
    x_ptr,
    y_ptr,
    BLOCK_SIZE: tl.constexpr,
):
    offsets = tl.arange(0, BLOCK_SIZE)
    x = tl.load(x_ptr + offsets)
    x = tl.sin(x)
    output = tl.libdevice.sin(x)
    output = tl.libdevice.fdiv_rn(output, output)
    output = tl.libdevice.fmaf_rd(output, output, output)
    tl.store(y_ptr + offsets, output)


if __name__ == "__main__" and len(sys.argv) >= 2:
    signature = "*fp32,*fp32"
    constants = {'BLOCK_SIZE': 1024}
    output = triton.compile(add_kernel, signature, device=0, constants=constants, output="ttgir")
    print(output)
```
->
```llvm
#blocked = #triton_gpu.blocked<{sizePerThread = [1], threadsPerWarp = [32], warpsPerCTA = [4], order = [0]}>
module attributes {"triton_gpu.num-warps" = 4 : i32} {
  func @add_kernel__Pfp32_Pfp32__2c1024(%arg0: !tt.ptr<f32>, %arg1: !tt.ptr<f32>) {
    %0 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32, #blocked>
    %1 = tt.splat %arg0 : (!tt.ptr<f32>) -> tensor<1024x!tt.ptr<f32>, #blocked>
    %2 = tt.getelementptr %1, %0 : tensor<1024x!tt.ptr<f32>, #blocked>
    %3 = tt.load %2 {cache = 1 : i32, evict = 1 : i32, isVolatile = false} : tensor<1024xf32, #blocked>
    %4 = math.sin %3 : tensor<1024xf32, #blocked>
    %5 = tt.ext_elemwise %4 {libname = "libdevice", libpath = "/home/siwasaki/triton/python/triton/language/libdevice.10.bc", symbol = "__nv_sinf"} : tensor<1024xf32, #blocked> -> tensor<1024xf32, #blocked>
    %6 = tt.ext_elemwise %5, %5 {libname = "libdevice", libpath = "/home/siwasaki/triton/python/triton/language/libdevice.10.bc", symbol = "__nv_fdiv_rn"} : tensor<1024xf32, #blocked>, tensor<1024xf32, #blocked> -> tensor<1024xf32, #blocked>
    %7 = tt.ext_elemwise %6, %6, %6 {libname = "libdevice", libpath = "/home/siwasaki/triton/python/triton/language/libdevice.10.bc", symbol = "__nv_fmaf_rd"} : tensor<1024xf32, #blocked>, tensor<1024xf32, #blocked>, tensor<1024xf32, #blocked> -> tensor<1024xf32, #blocked>
    %8 = tt.splat %arg1 : (!tt.ptr<f32>) -> tensor<1024x!tt.ptr<f32>, #blocked>
    %9 = tt.getelementptr %8, %0 : tensor<1024x!tt.ptr<f32>, #blocked>
    tt.store %9, %7 : tensor<1024xf32, #blocked>
    return
  }
}
```
2022-09-01 16:34:27 -07:00
goostavz
bedbf221c0 [BACKEND] Support optional mask in TritonGPUToLLVM (#80)
Co-authored-by: gzhu <gzhu@nvidia.com>
2022-08-24 17:51:37 -07:00
Shintaro Iwasaki
0ebef11c77 [TritonIR] Make mask operand optional (#74) 2022-08-22 22:00:17 -07:00
Shintaro Iwasaki
9aa00249a6 [TritonIR] make other optional and remove isOtherUnspecified (#67)
[Triton] make other optional and remove isOtherUnspecified
2022-08-18 18:19:55 -07:00
Philippe Tillet
192be76b3c [OPTIMIZER] Rewrite patterns for layout conversions (#64) 2022-08-18 12:49:37 -07:00
Shintaro Iwasaki
d69ce77b19 [FRONTEND] add an attr for masked load without explicit other (#55) 2022-08-18 09:51:37 -07:00
Yan Chunwei
83ef74f248 [BACKEND] Extracting numWarps from tritonGPU module (#39) 2022-08-08 09:40:20 -07:00
Philippe Tillet
78ebbe24c7 [FRONTEND] Added ExpandDimsOp primitive (#36) 2022-08-04 18:41:06 -07:00
Philippe Tillet
d1593e6ca8 [TritonGPU] Improved documentation and semantics of layout encodings (#30) 2022-07-31 13:59:44 -07:00
Philippe Tillet
6d62d88d4f [CI] run clang-format (#24) 2022-07-26 17:25:03 -07:00
Philippe Tillet
a633d2b403 [Analysis] Added Axis Info Analysis (#8) 2022-07-19 13:38:48 -07:00
Yan Da
9d1b5e3f79 special encoding for broadcast 2022-06-18 21:16:45 +08:00
Yan Da
53cf93ce6a Revert "Remove TypeConverter from TritonToTritonGPU conversion"
This reverts commit 64d0b87ef0.
2022-06-18 14:57:41 +08:00
Yan Da
64d0b87ef0 Remove TypeConverter from TritonToTritonGPU conversion 2022-06-18 14:34:59 +08:00
Yan Da
117a402c1b more comments to TypeConverter & update warpTileSize 2022-06-08 16:20:07 +08:00
Yan Da
7b09b5f9e9 the pipeline pass now generates and accepts valid IR 2022-06-07 19:34:59 +08:00
Yan Da
9308e9c90c A more general pipeliner 2022-05-25 21:52:51 +08:00
Yan Da
e6f89a5777 Fix ReduceOp conversion 2022-05-25 16:03:06 +08:00
Yan Da
9b670cfb9f Add ReduceOp 2022-05-25 14:15:36 +08:00
Yan Da
1a4fbed25b Skeleton for the pipeline pass 2022-05-11 16:13:53 +08:00
Yan Da
96876a46d1 More progress on Triton=>TritonGPU conversion (works for matmul) 2022-05-09 21:19:53 +08:00
Yan Da
0c5319eed9 More progress on SCF type conversion 2022-05-05 20:56:55 +08:00
Yan Da
26c59e4718 More on SCF conversion 2022-05-04 21:50:32 +08:00
Yan Da
a96fe07e1c DotOp conversion 2022-05-04 15:56:24 +08:00
Yan Da
2d281cbc0a ConstantOp conversion pattern 2022-05-04 15:35:43 +08:00
Yan Da
b9279d2e3b More progress on TritonGPU conversion 2022-05-04 14:54:31 +08:00
Yan Da
3ad7bee35e More conversion patterns 2022-05-04 12:50:02 +08:00
Yan Da
5f08e2fdae More arith patterns 2022-05-02 22:31:29 +08:00
Yan Da
75d32e2442 More on TritonGPU conversion 2022-05-02 21:51:00 +08:00
Yan Da
1428185c9c More progress on TritonGPUTypeConverter & TritonGPUConversionTarget 2022-05-01 22:06:54 +08:00
Yan Da
4ece9fd1f3 Move dependentDialects from .cpp to .td 2022-05-01 13:06:51 +08:00
Phil Tillet
d9017f8593 add basic template for legalizing arithmetic op 2022-04-30 20:42:25 -07:00