Commit Graph

109 Commits

Author SHA1 Message Date
Keren Zhou
6c5f646f4e [WIP][Triton-MLIR] Prefetch pass fixup (#873)
A (potential) problem by directly adopting `tensor.extract_slice`.

Long story short, `tensor.extract_slice` is not aware of swizzling.
Consider the following shared memory tensor and its first three slices,
where each slice includes two tile (the loading unit of LDGSTS) of
elements. Currently, the tiles haven't been swizzled yet, so slicing
seems to work.

<img width="1219" alt="image"
src="https://user-images.githubusercontent.com/2306281/201833023-a7950705-2d50-4c0a-8527-7505261c3a3c.png">

However, now consider the following figure, which is the layout after
applying swizzling on the first figure.

<img width="1244" alt="image"
src="https://user-images.githubusercontent.com/2306281/201834824-7daae360-f5bc-4e6b-a921-20be3f294b78.png">

Note that on phase 2, all tiles have been swizzled out of their
originally slices. This implies that if we use the tile index after
slicing, we can no longer locate the correct tiles. For example, T3 was
in slice 1 but got swapped to slice 0 after swizzling.

Here's a more detailed explanation. In the current `triton-mlir` branch,
we only compute the relative offset of each tile. So T3's index in Slice
1 is *1*, and it will be swizzled using *1* and *phase id*. Whereas the
correct index of T3 should be *3*, which is the relative offset to the
beginning of the shared memory tensor being swizzled, and T3 should be
swizzled using *3* and *phase id*.

This PR proposes a hacky solution for this problem. We restore the
"correct" offset of each tile by **assuming that slicing on a specific
dim only happens at most once on the output of insert_slice_async**. I
admit it's risky and fragile.

The other possible solution is adopting cutlass' swizzling logic that
limits the indices being swizzled in a "bounding box" that matches the
mma instruction executes. For example, in the following tensor layout,
each 4x4 submatrix is a minimum swizzling unit, and the entire tensor
represents the tensor layout of operand A in `mma.16816`.

<img width="565" alt="image"
src="https://user-images.githubusercontent.com/2306281/201836879-4ca7824b-530c-4a06-a3d5-1e74a2de1b42.png">

Co-authored-by: Phil Tillet <phil@openai.com>
2022-11-19 19:57:16 -08:00
Yan Chunwei
1eedaf7bec [Triton-MLIR][BACKEND] adapt DotOp layout for FMADot (#872) 2022-11-14 16:56:30 +08:00
Chenggang Zhao
516a241234 [Triton-MLIR] Fix some typos (#874)
Fix some typos
2022-11-13 18:15:53 -08:00
Philippe Tillet
f40c63fb03 [Triton-MLIR][OPTIMIZER] Cleaned up swizzling (#869)
Swizzling is no longer implemented as a separate pass. It is instead
done in a specialized constructor of SharedEncodingAttr, and tested via
google tests instead of triton-opt + filecheck.

In the future we may want to implement it as a pass again once we have
an additional dialect between TritonGPU and LLVM.
2022-11-10 12:05:46 -08:00
Philippe Tillet
2aa538ec2e [BACKEND] Added support for mma layouts in reductions (#863)
Validated hackily by manually modifying the reduction .ttgir in my local
cache. There will be a follow-up PR adding some better testing
infrastructure to test out conversions and reductions on arbitrary
layouts.
2022-11-10 09:58:07 -08:00
Chenggang Zhao
57fd1864a7 [Triton-MLIR] Support FP8 (#864)
Co-authored-by: Superjomn <yanchunwei@outlook.com>
2022-11-10 15:53:06 +08:00
Da Yan
4946167241 [Triton-MLIR] tt.dot operands now must have DotOperand layout; also added prefetch pass prototype (#712)
Co-authored-by: Jokeren <kerenzhou@openai.com>
Co-authored-by: Phil Tillet <phil@openai.com>
Co-authored-by: Superjomn <yanchunwei@outlook.com>
2022-11-10 05:57:27 +00:00
Yan Chunwei
0c87360657 [Triton-MLIR][Backend] Port FMADot conversion for DotOp (#844)
Co-authored-by: ben-zhang-609 <benzh609@gmail.com>
2022-11-09 12:57:50 +08:00
Yan Chunwei
de5b84c476 [Triton-MLIR][Backend] Fix mma<v2> int8 precision error (#850)
Fix mma.16816 s8 precision error

Co-authored-by: ben-zhang-609 <benzh609@gmail.com>
2022-11-09 12:23:43 +08:00
Da Yan
137344946f [OPTIMIZER] Fix the load-mask issue with the pipeline pass (#857) 2022-11-08 09:29:53 -08:00
Keren Zhou
fdd59900f7 [Triton-MLIR] Replace triton.extract_slice with tensor.extract_slice and support more general tensor slicing (#837)
## Features

- Allow taking a block of tensor slice, as long as each dimension is
contiguous (unit stride).
- Fix some problems in `insert_slice_async`'s semantic.
- More general verification for ops that return shared layout encoding.

## Known Limitations

- `insert_slice_async` still uses the old semantic. May submit another
PR later to support similar semantic like `tensor.extract_slice`.
- No encoding verification for `tensor.extract_slice`.
- 3d tensor ops are broken.
- Strided accesses are not allowed.
- May cause a little performance slowdown since we are passing strides
as values but not constants (e.g., int).
It would be difficult to pass strides as attributes when we have control
flows. A block argument is possible to accept tensors with different
strides.
2022-11-06 22:59:03 -08:00
Philippe Tillet
a4ff0c362c [FRONTEND] Fix issues with atomics (#849) 2022-11-06 20:52:11 -08:00
Philippe Tillet
91a9773b38 [OPTIMIZER] Minor bugfixes that affected matmul codegen performance (#834) 2022-11-02 22:58:09 -07:00
Philippe Tillet
12d60cb4a3 [BACKEND] Added support for 1D conversion blocked -> slice (#831) 2022-11-01 13:19:58 -07:00
Yan Chunwei
031c2ae77b [Triton-MLIR][BACKEND] Port the mma<v1> conversion (#815)
This PR does

- port the mma<v1> related code, and support dot conversion and
convert_layout[shared->dot_op<mma<v1>>]
- add a lit test for dot v1
2022-11-01 09:42:14 +08:00
Philippe Tillet
e61dc75942 [FRONTEND] Fixed inliner and got more tests to pass (#822)
This adds a `DialectInlinerInterface` to the Triton dialect. This, along
with a few other minor semantic changes, fixes our tests on call
instructions. Also added the option to provide use an "LLVM_SYSPATH"
environment variable to link against locally build of LLVM; this was
useful for debugging this issue.
2022-10-30 14:10:02 -07:00
Ian Bearman
f2106d0aa2 [BUILD] Fix Warnings and Enable Warnings as Errors (#794) 2022-10-28 12:36:09 -07:00
Philippe Tillet
ac0f6793cc [BACKEND] Added support for scalars in LoadOp / StoreOp / ElementwiseOp (#814)
Also fixed various errors that showed up in `test_core.py`, and added more TODOs for open (hopefully relatively minor) issues
2022-10-28 16:17:55 +08:00
Keren Zhou
3b80801dff [Triton-MLIR][Backend] Fix many problems to get the pipeline working (#809)
1. Rewrite code generation of insert_slice_async.
2. Correct the wrong index passed to extract_slice in pipeline.
3. Add a prologue in pipeline to wait for dangling cp.asyncs.  
4. Move scf to cf conversion inside TritonGPUToLLVM because we need to
perform membar before scf to cf. It shouldn't be a technical limitation
and could be improved by a more general membar analysis.
5. Use an attribute to memoize the shared memory size and support
dynamic shared memory.
6. Prevent the combine pass to reorder insert_slice and extract_slice
across async_wait

Co-authored-by: Superjomn <yanchunwei@outlook.com>
2022-10-27 22:09:06 -07:00
Qingyi Liu
42db3538e4 [Triton-MLIR][Backend] Add ReduceOpConversion into TritonGPUToLLVM conversion (#774)
What is done in this PR:
- [x] Add `ConvertLayout`, `getSizePerThread` and `getShapePerCTA`
implementation for `SliceEncodingAttr`
- [x] Split `emitIndices` into two phases:
`emitBaseIndexForBlockedLayout` and `emitOffsetForBlockedLayout`
- [x] Add `ReduceOpConversion::matchAndRewriteBasic` implementation
- [x] Add `ReduceOpConversion::matchAndRewriteFast` implementation with
ptx instruction `shfl.sync`
- [x] Add support for scalar value in `StoreOpConversion`
- [x] Add Reduce1d and Reduce2d unit tests and pass all unit tests

Co-authored-by: Qingyi Liu <liuqingyi1993@gmail.com>
2022-10-28 11:07:45 +08:00
Philippe Tillet
3e6cc6d66c [FRONTEND] Made more tests pass (#805) 2022-10-26 17:47:33 -07:00
Yan Chunwei
877844de4f [Triton-MLIR][BACKEND] add convert_layout[shared->dot_op] converstion to adapt DotOperand layout (#786)
This PR helps to

1. Adapt the existing DotOp conversion to the design of the new
DotOperand layout,
2. Making the DotOp conversion work with both shared-layout inputs case
and dotoperand-layout inputs case for further upstream switch.
2022-10-24 11:40:13 +08:00
Philippe Tillet
bb0f9235d1 [OPTIMIZER] Made layout simplification pass efficient for fused attention kernels (#790) 2022-10-21 16:52:15 -07:00
Philippe Tillet
dc0588a898 [OPTIMIZER] Improved layout simplification pass so it handles swizzled layouts better (#789)
Note: uncommented `test_gemm`, since backend has an issue with swizzling. This will get uncommented in a subsequent PR.
2022-10-20 19:03:37 -07:00
Shintaro Iwasaki
0d22d2bc03 [TritonMLIR] Disallow 0D tensor (#788) 2022-10-19 10:34:32 -07:00
Yan Chunwei
4464646efb [Triton-MLIR][BACKEND] Fix masked load store op vector size (#785)
Correct the Load/Store Op's vector size with the mask's alignment
correctly considered.

Some cases:

```mlir
// num_warp = 2
// block_size = 128
func @vecadd_mask_align_16(%a_ptr: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %b_ptr: !tt.ptr<f32> {tt.divisibility = 16 : i32}, 
  %out_ptr: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %n_elements: i32 {tt.divisibility = 16 : i32}) {
    // mask = make_range(128) < n_element
}
```
This should get the vec=2 `ld`/`st` instructions.

While the following example

```mlir
// num_warp = 2
// block_size = 128
func @vecadd_mask_align_16(%a_ptr: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %b_ptr: !tt.ptr<f32> {tt.divisibility = 16 : i32}, 
  %out_ptr: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %n_elements: i32) {
    // mask = make_range(128) < n_element
}
```
it should get the vec=1 `ld`/`st` instructions.
2022-10-18 11:43:50 +08:00
Philippe Tillet
38a80664b5 [OPTIMIZER] Updated TritonGPU-combine pass (#784)
WIP but should work int t…he cases we need so far
2022-10-16 21:19:42 -07:00
goostavz
e948a618b3 [Triton-MLIR] fix a tiny bug in coalesce pass (#782) 2022-10-16 20:29:55 -07:00
Shintaro Iwasaki
5898352f97 [Triton-IR] Fix LoadOp definition (#771) (#777) 2022-10-13 18:53:00 -07:00
Philippe Tillet
623c99609f [Triton-IR] Added type inference and verifier for Triton-IR operations (#767) 2022-10-11 18:16:41 -07:00
Philippe Tillet
b6e5a231e5 [OPTIMIZER] Added swizzling pass (#758) 2022-10-10 01:12:37 -07:00
Philippe Tillet
498c685b46 [OPTIMIZER] layout simplification: ignore non-tensor iter arguments in for loop rematerialization (#749) 2022-10-07 21:52:29 -07:00
Keren Zhou
289ff293cc [Triton-MLIR] Generate LLVM/PTX code for async ops (#735) 2022-10-04 09:37:00 -07:00
goostavz
f9d7f2f126 [Triton-MLIR][Backend] Support ConvertLayout blocked->shared and a few fixes related with mma(#716) 2022-10-03 19:33:25 +08:00
Philippe Tillet
9ddf0921fb [OPTIMIZER] Added DotOp to the list of expensive ops we don't want to rematerialize. (#718) 2022-09-27 09:05:49 -07:00
Yan Chunwei
3a84278530 [Triton-MLIR][BACKEND] Refine dot conversion (#710)
This PR does

1. Refine the dot conversion
2. some other tiny code refinement
2022-09-27 14:38:34 +08:00
goostavz
61b61755e5 [Triton-MLIR][Backend] Support layout conversion between mmaLayout and blockedLayout (#693) 2022-09-27 03:58:47 +00:00
Keren Zhou
ecd1bc33df [Triton-MLIR] Keren/code gen for extract slice and alloc tensor (#692)
Co-authored-by: gzhu <goostavz@outlook.com>
2022-09-23 19:38:14 +00:00
Yan Chunwei
922155f1d2 [BACKEND] add dot conversion (mma version=2) (#672)
LLVM Conversion for Dot op.

Due to the lack of `convert_layout`, currently, the dot only supports
the following combination of operands

- `$a` in shared layout
- `$b` in shared layout
- `$c` in MMA layout(but only Splat-like, leaving the generic cases to
`convert_layout`)

This PR focus on `mma.16816` related logic support, leaving the other
cases to the following PR.

Co-authored-by: Philippe Tillet <phil@openai.com>
2022-09-22 20:43:54 -07:00
Shintaro Iwasaki
940ef3f0ac [BACKEND] llvm::dyn_cast -> llvm::dyn_cast_or_null (#689) 2022-09-22 03:26:40 +00:00
goostavz
15bfd0cb79 [BACKEND] Support of ConvertLayoutOp from blocked to blocked and SliceLayout with blocked parent (#658) 2022-09-17 14:58:42 -07:00
Shintaro Iwasaki
13669b46a6 [DOCS] Correct spelling (#665)
This PR corrects spelling like #664 for Triton-MLIR. It should not break anything.
2022-09-16 15:07:34 -07:00
Shintaro Iwasaki
43be75ad42 [FRONTEND] Add scalar type support for some ops (#661)
This PR adds basic support for scalar-type inputs to some ops (cast and pointer arithmetics) for Triton-MLIR. Also renames getelementptr -> addptr
2022-09-15 16:12:52 -07:00
Da Yan
2e08450c80 [OPTIMIZER] Better pipeline tests (#660) 2022-09-14 23:26:40 -07:00
Keren Zhou
16aed94ff5 [Analysis/Allocation] Allocation passes now assumes that slices always alias (#108)
This code in this branch assumes the `src` operand in
`insert_slice_async` always aliases the result, which shouldn't hold for
generally cases but is just a workaround to make the pipeline pass work.

I'm also working on the complete analysis in another
[branch](https://github.com/openai/triton-mlir/tree/keren/analyze-slice).
2022-09-09 12:03:41 -07:00
Philippe Tillet
9bd5a3dcd2 [OPTIMIZER] Pipeline async buffer (#110) 2022-09-09 11:01:14 -07:00
Yan Chunwei
a9464f4993 [Backend] Vectorize Load/Store Ops (#86)
This PR does the following things:

- Code refactoring on Load and Store op codegen, rewrite with same logic
and share much code
- Support the vectorized load/store
2022-09-06 12:28:09 -07:00
Da Yan
35e346bcff [OPTIMIZER] Better pipeline pass (#100)
* Use `insert_slice_async` instead of `CopyAsync`
* Move async.wait to loop header

Co-authored-by: Jokeren <kerenzhou@openai.com>
2022-09-06 08:31:13 -07:00
Philippe Tillet
a0bab9748e [OPTIMIZER] Coalesce pass no longer takes a num-warps argument (#99)
Improved design to avoid inconsistent `num-warps` value between the pass and the parent module of the operation it processes.
2022-09-05 18:09:02 -07:00
Philippe Tillet
d0b4c67b05 [OPTIMIZER] Improved layout conversion simplification algorithm (#97)
This PR both simplifies the layout conversion simplification algorithm, and also improves it to make it work with vectorized element-wise ops. The conversion optimizer still has a lot of room for improvements, and other PRs will address its limitations (ideally via some sort of explicit cost model)
2022-09-02 16:52:44 -07:00