Commit Graph

24 Commits

Author SHA1 Message Date
Chenggang Zhao
516a241234 [Triton-MLIR] Fix some typos (#874)
Fix some typos
2022-11-13 18:15:53 -08:00
Da Yan
4946167241 [Triton-MLIR] tt.dot operands now must have DotOperand layout; also added prefetch pass prototype (#712)
Co-authored-by: Jokeren <kerenzhou@openai.com>
Co-authored-by: Phil Tillet <phil@openai.com>
Co-authored-by: Superjomn <yanchunwei@outlook.com>
2022-11-10 05:57:27 +00:00
Ian Bearman
f2106d0aa2 [BUILD] Fix Warnings and Enable Warnings as Errors (#794) 2022-10-28 12:36:09 -07:00
Philippe Tillet
bb0f9235d1 [OPTIMIZER] Made layout simplification pass efficient for fused attention kernels (#790) 2022-10-21 16:52:15 -07:00
Shintaro Iwasaki
3c635449e5 [Triton] Support math and libdevice ops (#91)
This PR adds basic math ops by using `MathDialect` and `libdevice` ops by using `extern_elementwise`. This is needed to compile some tutorial code (e.g., `softmax`). This PR implements only interface till PTX (so from frontend to TritonGPU-MLIR) 
- Currently till TritonGPU. It cannot be lowered to PTX now.
- No special optimizations (e.g., constant folding etc) are applied.
  - 14.x does not define folders for many operators for math ops, but 15.x seems to increase its coverage: https://github.com/llvm/llvm-project/blob/llvmorg-15.0.0-rc3/mlir/include/mlir/Dialect/Math/IR/MathOps.td
  - No constant folding etc for `libdevice` ops.

```py
import triton
import triton.language as tl
import sys

@triton.jit
def add_kernel(
    x_ptr,
    y_ptr,
    BLOCK_SIZE: tl.constexpr,
):
    offsets = tl.arange(0, BLOCK_SIZE)
    x = tl.load(x_ptr + offsets)
    x = tl.sin(x)
    output = tl.libdevice.sin(x)
    output = tl.libdevice.fdiv_rn(output, output)
    output = tl.libdevice.fmaf_rd(output, output, output)
    tl.store(y_ptr + offsets, output)


if __name__ == "__main__" and len(sys.argv) >= 2:
    signature = "*fp32,*fp32"
    constants = {'BLOCK_SIZE': 1024}
    output = triton.compile(add_kernel, signature, device=0, constants=constants, output="ttgir")
    print(output)
```
->
```llvm
#blocked = #triton_gpu.blocked<{sizePerThread = [1], threadsPerWarp = [32], warpsPerCTA = [4], order = [0]}>
module attributes {"triton_gpu.num-warps" = 4 : i32} {
  func @add_kernel__Pfp32_Pfp32__2c1024(%arg0: !tt.ptr<f32>, %arg1: !tt.ptr<f32>) {
    %0 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32, #blocked>
    %1 = tt.splat %arg0 : (!tt.ptr<f32>) -> tensor<1024x!tt.ptr<f32>, #blocked>
    %2 = tt.getelementptr %1, %0 : tensor<1024x!tt.ptr<f32>, #blocked>
    %3 = tt.load %2 {cache = 1 : i32, evict = 1 : i32, isVolatile = false} : tensor<1024xf32, #blocked>
    %4 = math.sin %3 : tensor<1024xf32, #blocked>
    %5 = tt.ext_elemwise %4 {libname = "libdevice", libpath = "/home/siwasaki/triton/python/triton/language/libdevice.10.bc", symbol = "__nv_sinf"} : tensor<1024xf32, #blocked> -> tensor<1024xf32, #blocked>
    %6 = tt.ext_elemwise %5, %5 {libname = "libdevice", libpath = "/home/siwasaki/triton/python/triton/language/libdevice.10.bc", symbol = "__nv_fdiv_rn"} : tensor<1024xf32, #blocked>, tensor<1024xf32, #blocked> -> tensor<1024xf32, #blocked>
    %7 = tt.ext_elemwise %6, %6, %6 {libname = "libdevice", libpath = "/home/siwasaki/triton/python/triton/language/libdevice.10.bc", symbol = "__nv_fmaf_rd"} : tensor<1024xf32, #blocked>, tensor<1024xf32, #blocked>, tensor<1024xf32, #blocked> -> tensor<1024xf32, #blocked>
    %8 = tt.splat %arg1 : (!tt.ptr<f32>) -> tensor<1024x!tt.ptr<f32>, #blocked>
    %9 = tt.getelementptr %8, %0 : tensor<1024x!tt.ptr<f32>, #blocked>
    tt.store %9, %7 : tensor<1024xf32, #blocked>
    return
  }
}
```
2022-09-01 16:34:27 -07:00
Philippe Tillet
192be76b3c [OPTIMIZER] Rewrite patterns for layout conversions (#64) 2022-08-18 12:49:37 -07:00
Philippe Tillet
d1593e6ca8 [TritonGPU] Improved documentation and semantics of layout encodings (#30) 2022-07-31 13:59:44 -07:00
Philippe Tillet
6d62d88d4f [CI] run clang-format (#24) 2022-07-26 17:25:03 -07:00
Yan Da
9d1b5e3f79 special encoding for broadcast 2022-06-18 21:16:45 +08:00
Yan Da
53cf93ce6a Revert "Remove TypeConverter from TritonToTritonGPU conversion"
This reverts commit 64d0b87ef0.
2022-06-18 14:57:41 +08:00
Yan Da
64d0b87ef0 Remove TypeConverter from TritonToTritonGPU conversion 2022-06-18 14:34:59 +08:00
Yan Da
117a402c1b more comments to TypeConverter & update warpTileSize 2022-06-08 16:20:07 +08:00
Yan Da
7b09b5f9e9 the pipeline pass now generates and accepts valid IR 2022-06-07 19:34:59 +08:00
Yan Da
7807f64ef3 rename sharded_layout => blocked_layout 2022-06-05 16:14:59 +08:00
Da Yan
e36a54eb86 more progress on the definition of layouts 2022-05-31 11:43:21 +00:00
Yan Da
9308e9c90c A more general pipeliner 2022-05-25 21:52:51 +08:00
Yan Da
9b670cfb9f Add ReduceOp 2022-05-25 14:15:36 +08:00
Yan Da
96876a46d1 More progress on Triton=>TritonGPU conversion (works for matmul) 2022-05-09 21:19:53 +08:00
Yan Da
0c5319eed9 More progress on SCF type conversion 2022-05-05 20:56:55 +08:00
Yan Da
26c59e4718 More on SCF conversion 2022-05-04 21:50:32 +08:00
Yan Da
b9279d2e3b More progress on TritonGPU conversion 2022-05-04 14:54:31 +08:00
Yan Da
3ad7bee35e More conversion patterns 2022-05-04 12:50:02 +08:00
Yan Da
75d32e2442 More on TritonGPU conversion 2022-05-02 21:51:00 +08:00
Yan Da
1428185c9c More progress on TritonGPUTypeConverter & TritonGPUConversionTarget 2022-05-01 22:06:54 +08:00