Commit Graph

225 Commits

Author SHA1 Message Date
goostavz
15bfd0cb79 [BACKEND] Support of ConvertLayoutOp from blocked to blocked and SliceLayout with blocked parent (#658) 2022-09-17 14:58:42 -07:00
Shintaro Iwasaki
13669b46a6 [DOCS] Correct spelling (#665)
This PR corrects spelling like #664 for Triton-MLIR. It should not break anything.
2022-09-16 15:07:34 -07:00
Philippe Tillet
80e3fb5270 [CI] Now using clang-format from pip (#662) 2022-09-15 16:24:37 -07:00
Shintaro Iwasaki
43be75ad42 [FRONTEND] Add scalar type support for some ops (#661)
This PR adds basic support for scalar-type inputs to some ops (cast and pointer arithmetics) for Triton-MLIR. Also renames getelementptr -> addptr
2022-09-15 16:12:52 -07:00
Shintaro Iwasaki
297d27e1c8 [Triton-MLIR] add GitHub CI runners (#655)
This PR is to add GitHub Actions runners to the CI for better coverage.
2022-09-14 23:09:56 -07:00
Keren Zhou
16aed94ff5 [Analysis/Allocation] Allocation passes now assumes that slices always alias (#108)
This code in this branch assumes the `src` operand in
`insert_slice_async` always aliases the result, which shouldn't hold for
generally cases but is just a workaround to make the pipeline pass work.

I'm also working on the complete analysis in another
[branch](https://github.com/openai/triton-mlir/tree/keren/analyze-slice).
2022-09-09 12:03:41 -07:00
Yan Chunwei
2a852044d9 [BACKEND] Add C++ tests for PTXFormat and some tiny refinement (#109)
This PR does

1. Add some C++ tests for `PTXFormat`
2. Enhance the functionality of `PTXFormat`, make a `PTXInstr` instance
can be called multiple times similar as a C function.
2022-09-09 09:15:07 -07:00
Philippe Tillet
a0bab9748e [OPTIMIZER] Coalesce pass no longer takes a num-warps argument (#99)
Improved design to avoid inconsistent `num-warps` value between the pass and the parent module of the operation it processes.
2022-09-05 18:09:02 -07:00
Shintaro Iwasaki
3c635449e5 [Triton] Support math and libdevice ops (#91)
This PR adds basic math ops by using `MathDialect` and `libdevice` ops by using `extern_elementwise`. This is needed to compile some tutorial code (e.g., `softmax`). This PR implements only interface till PTX (so from frontend to TritonGPU-MLIR) 
- Currently till TritonGPU. It cannot be lowered to PTX now.
- No special optimizations (e.g., constant folding etc) are applied.
  - 14.x does not define folders for many operators for math ops, but 15.x seems to increase its coverage: https://github.com/llvm/llvm-project/blob/llvmorg-15.0.0-rc3/mlir/include/mlir/Dialect/Math/IR/MathOps.td
  - No constant folding etc for `libdevice` ops.

```py
import triton
import triton.language as tl
import sys

@triton.jit
def add_kernel(
    x_ptr,
    y_ptr,
    BLOCK_SIZE: tl.constexpr,
):
    offsets = tl.arange(0, BLOCK_SIZE)
    x = tl.load(x_ptr + offsets)
    x = tl.sin(x)
    output = tl.libdevice.sin(x)
    output = tl.libdevice.fdiv_rn(output, output)
    output = tl.libdevice.fmaf_rd(output, output, output)
    tl.store(y_ptr + offsets, output)


if __name__ == "__main__" and len(sys.argv) >= 2:
    signature = "*fp32,*fp32"
    constants = {'BLOCK_SIZE': 1024}
    output = triton.compile(add_kernel, signature, device=0, constants=constants, output="ttgir")
    print(output)
```
->
```llvm
#blocked = #triton_gpu.blocked<{sizePerThread = [1], threadsPerWarp = [32], warpsPerCTA = [4], order = [0]}>
module attributes {"triton_gpu.num-warps" = 4 : i32} {
  func @add_kernel__Pfp32_Pfp32__2c1024(%arg0: !tt.ptr<f32>, %arg1: !tt.ptr<f32>) {
    %0 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32, #blocked>
    %1 = tt.splat %arg0 : (!tt.ptr<f32>) -> tensor<1024x!tt.ptr<f32>, #blocked>
    %2 = tt.getelementptr %1, %0 : tensor<1024x!tt.ptr<f32>, #blocked>
    %3 = tt.load %2 {cache = 1 : i32, evict = 1 : i32, isVolatile = false} : tensor<1024xf32, #blocked>
    %4 = math.sin %3 : tensor<1024xf32, #blocked>
    %5 = tt.ext_elemwise %4 {libname = "libdevice", libpath = "/home/siwasaki/triton/python/triton/language/libdevice.10.bc", symbol = "__nv_sinf"} : tensor<1024xf32, #blocked> -> tensor<1024xf32, #blocked>
    %6 = tt.ext_elemwise %5, %5 {libname = "libdevice", libpath = "/home/siwasaki/triton/python/triton/language/libdevice.10.bc", symbol = "__nv_fdiv_rn"} : tensor<1024xf32, #blocked>, tensor<1024xf32, #blocked> -> tensor<1024xf32, #blocked>
    %7 = tt.ext_elemwise %6, %6, %6 {libname = "libdevice", libpath = "/home/siwasaki/triton/python/triton/language/libdevice.10.bc", symbol = "__nv_fmaf_rd"} : tensor<1024xf32, #blocked>, tensor<1024xf32, #blocked>, tensor<1024xf32, #blocked> -> tensor<1024xf32, #blocked>
    %8 = tt.splat %arg1 : (!tt.ptr<f32>) -> tensor<1024x!tt.ptr<f32>, #blocked>
    %9 = tt.getelementptr %8, %0 : tensor<1024x!tt.ptr<f32>, #blocked>
    tt.store %9, %7 : tensor<1024xf32, #blocked>
    return
  }
}
```
2022-09-01 16:34:27 -07:00
Keren Zhou
328b87aec6 Keren/tensor slice insert alloc (#94)
This branch defines three new triton_gpu operations to partially solve #87. Below is an overview:

```
%tensor = triton_gpu.alloc_tensor : tensor<2x16x16xf16, #A>
%b = triton_gpu.insert_slice_async %a_ptr, %tensor, %offset {axis = 0 : i32, cache = 1 : i32, evict = 1 : i32, isVolatile = false} : tensor<16x16x!tt.ptr<f16>, #AL> -> tensor<2x16x16xf16, #A>
%c = triton_gpu.extract_slice %b, %offset {axis = 0 : i32} : tensor<2x16x16xf16, #A> -> tensor<16x16xf16, #A>
```

We plan to fully replace `copy_async` with `insert_slice_async`. **This hasn't been done yet.**
2022-09-01 12:37:17 -07:00
Shintaro Iwasaki
d01353de07 [CI] add assert-enabled MLIR option (#78)
This deprecates the use of release-build LLVM hosted by the LLVM project, which makes debugging harder for developers.

This PR implements the following solution:
1. Create LLVM release tarballs with assert enabled on our own (using Docker)
2. Host them in our own GitHub repositories
3. Use our LLVM for CI and/or development if `TRITON_USE_ASSERT_ENABLED_LLVM=1` is set.
2022-08-31 18:55:32 -07:00
Keren Zhou
02ebf24d35 Analyze shared memory alias (#81)
The purpose of this PR is analyzing shared memory aliases so that we can
fix memory allocation bugs and save memory allocations in triton code
involving complex control flows.

Changes to memory bar and allocation are on the way.

Co-authored-by: Philippe Tillet <phil@openai.com>
2022-08-29 10:43:20 -07:00
Shintaro Iwasaki
84aa7d025a [TritonIR] simplify Load/StoreOps when mask is true/false (#79)
* [TritonIR] fix Load/Store/CopyAsyncOp's parsers

* [TritonIR] simplify Load/StoreOps when mask is true/false

* [TEST] adds tests to check load/store simplification
2022-08-24 12:55:49 -07:00
Yan Chunwei
1b513c9866 [BACKEND] Refactoring codegen for LoadOp with PTXFormat (#77)
This PR does following things:

Enhance the PTXFormat by
Introducing PTXBuilder to enable multiple instructions in a single asm program
override PTXInstr's operator() method to enable instr(opr0, opr1) style of setting operands for an instruction
Refactor the PTX code used in LoadOpConversion with PTXFormat

Authored-by: goostavz <gzhu@nvidia.com>
2022-08-23 15:51:13 -07:00
Shintaro Iwasaki
0ebef11c77 [TritonIR] Make mask operand optional (#74) 2022-08-22 22:00:17 -07:00
Yan Chunwei
10ba51c3bb [FRONTEND] add python e2e launch empty kernel test (#68) 2022-08-19 10:46:01 -07:00
Shintaro Iwasaki
9aa00249a6 [TritonIR] make other optional and remove isOtherUnspecified (#67)
[Triton] make other optional and remove isOtherUnspecified
2022-08-18 18:19:55 -07:00
Philippe Tillet
192be76b3c [OPTIMIZER] Rewrite patterns for layout conversions (#64) 2022-08-18 12:49:37 -07:00
Keren Zhou
e0bedeb44c [BACKEND] Keren/shared memory barrier (#59) 2022-08-18 12:32:57 -07:00
Da Yan
8776ad1a0e [OPTIMIZER] Let the pipeline pass insert async wait. (#63) 2022-08-18 10:31:57 -07:00
Shintaro Iwasaki
d69ce77b19 [FRONTEND] add an attr for masked load without explicit other (#55) 2022-08-18 09:51:37 -07:00
Yan Chunwei
b1673caaf6 [FRONTEND] Expose end-to-end compile to python frontend (#58) 2022-08-17 10:42:48 -07:00
Yan Chunwei
95bbac41e7 [BACKEND] Add LLVM-translation for store and splat ops (#47) 2022-08-15 00:46:37 -07:00
goostavz
993ba7035a [BACKEND] Codegen bringup, index calculation of blocked_layout & support of LoadOp, BroadcastOp, ViewOp & MakeRangeOp (#38)
Co-authored-by: gzhu <gzhu@nvidia.com>
2022-08-14 19:58:59 -07:00
Shintaro Iwasaki
2ba9a83465 [BUILD] fix minor issues with MLIR assert enabled (#46) 2022-08-11 21:20:47 -07:00
Philippe Tillet
3a48ca0d4d [BUILD] Fix includes (#49) 2022-08-11 11:49:29 -07:00
Yan Chunwei
83ef74f248 [BACKEND] Extracting numWarps from tritonGPU module (#39) 2022-08-08 09:40:20 -07:00
Yan Chunwei
920723cf3d [BACKEND] add triton-translate to translate mlir to llvmir or PTX code (#37) 2022-08-07 22:34:36 -07:00
Philippe Tillet
78ebbe24c7 [FRONTEND] Added ExpandDimsOp primitive (#36) 2022-08-04 18:41:06 -07:00
Keren Zhou
a7b49b3227 [BACKEND] Memory allocation (#33) 2022-08-04 11:22:49 -07:00
Yan Chunwei
b988bae813 Init TritonGPU to LLVM dialect conversion (#32)
* add toLLVM pass

* update num-warps setting in mlir
2022-08-04 10:15:45 +08:00
Philippe Tillet
3236642e8f [OPTIMIZER] Added memory coalescing pass (#31) 2022-07-31 20:59:31 -07:00
Philippe Tillet
d1593e6ca8 [TritonGPU] Improved documentation and semantics of layout encodings (#30) 2022-07-31 13:59:44 -07:00
Philippe Tillet
6d62d88d4f [CI] run clang-format (#24) 2022-07-26 17:25:03 -07:00
Philippe Tillet
3265e0df5a [PYTHON] Cleaned up legacy code; added simple standalone compilation API (#22) 2022-07-26 11:06:45 -07:00
Philippe Tillet
27c9f3d8cb [FRONTEND] Added comment on TensorSizeTrait::maxElement (#20) 2022-07-25 01:18:45 -07:00
Philippe Tillet
a633d2b403 [Analysis] Added Axis Info Analysis (#8) 2022-07-19 13:38:48 -07:00
Yan Da
9d1b5e3f79 special encoding for broadcast 2022-06-18 21:16:45 +08:00
Yan Da
53cf93ce6a Revert "Remove TypeConverter from TritonToTritonGPU conversion"
This reverts commit 64d0b87ef0.
2022-06-18 14:57:41 +08:00
Yan Da
64d0b87ef0 Remove TypeConverter from TritonToTritonGPU conversion 2022-06-18 14:34:59 +08:00
Yan Da
560e29229b register conversion in triton-opt 2022-06-07 19:33:51 +08:00
Yan Da
366dddc3bc update mma encoding & triton-opt 2022-06-06 21:03:58 +08:00
Yan Da
7807f64ef3 rename sharded_layout => blocked_layout 2022-06-05 16:14:59 +08:00
Da Yan
e36a54eb86 more progress on the definition of layouts 2022-05-31 11:43:21 +00:00
Yan Da
441fd7c3cc assembly format 2022-05-25 17:53:24 +08:00
Yan Da
9b670cfb9f Add ReduceOp 2022-05-25 14:15:36 +08:00
Yan Da
a2c9f919a8 TritonGPU verifier 2022-05-24 19:48:56 +08:00
Yan Da
36c45ec687 make numStages an option in PipelinePass 2022-05-23 12:47:55 +08:00
Yan Da
c3c4ac3733 TritonGPU combiner 2022-05-16 19:17:15 +08:00
Yan Da
e3916c3a46 TritonGPU combiner 2022-05-16 19:16:01 +08:00