Keren Zhou
3b80801dff
[Triton-MLIR][Backend] Fix many problems to get the pipeline working ( #809 )
...
1. Rewrite code generation of insert_slice_async.
2. Correct the wrong index passed to extract_slice in pipeline.
3. Add a prologue in pipeline to wait for dangling cp.asyncs.
4. Move scf to cf conversion inside TritonGPUToLLVM because we need to
perform membar before scf to cf. It shouldn't be a technical limitation
and could be improved by a more general membar analysis.
5. Use an attribute to memoize the shared memory size and support
dynamic shared memory.
6. Prevent the combine pass to reorder insert_slice and extract_slice
across async_wait
Co-authored-by: Superjomn <yanchunwei@outlook.com >
2022-10-27 22:09:06 -07:00
Philippe Tillet
bb0f9235d1
[OPTIMIZER] Made layout simplification pass efficient for fused attention kernels ( #790 )
2022-10-21 16:52:15 -07:00
Philippe Tillet
38a80664b5
[OPTIMIZER] Updated TritonGPU-combine pass ( #784 )
...
WIP but should work int t…he cases we need so far
2022-10-16 21:19:42 -07:00
Philippe Tillet
623c99609f
[Triton-IR] Added type inference and verifier for Triton-IR operations ( #767 )
2022-10-11 18:16:41 -07:00
Philippe Tillet
b6e5a231e5
[OPTIMIZER] Added swizzling pass ( #758 )
2022-10-10 01:12:37 -07:00
Keren Zhou
baba98ad69
[Triton-MLIR] Fix threadsPerWarp derivation in BlockedEncodingAttr ( #722 )
...
Example:
```
auto encoding = triton::gpu::BlockedEncodingAttr::get(
&getContext(), {8, 32}, {2, 2}, {1, 0}, 2);
//shape = [32 x 8], order = [1, 0], sizePerThread=[2, 2], numWarps=2
```
Expected output:
```
//#triton_gpu.blocked_layout<{
// sizePerThread = {2, 2}
// threadsPerWarp = {8, 4}
// warpsPerCTA = {2, 1}
//}>
```
Incorrect output by the current branch
```
//#triton_gpu.blocked_layout<{
// sizePerThread = {2, 2}
// threadsPerWarp = {16, 2}
// warpsPerCTA = {2, 1}
//}>
```
2022-09-27 16:41:30 -07:00
Shintaro Iwasaki
43be75ad42
[FRONTEND] Add scalar type support for some ops ( #661 )
...
This PR adds basic support for scalar-type inputs to some ops (cast and pointer arithmetics) for Triton-MLIR. Also renames getelementptr -> addptr
2022-09-15 16:12:52 -07:00
Da Yan
2e08450c80
[OPTIMIZER] Better pipeline tests ( #660 )
2022-09-14 23:26:40 -07:00
Keren Zhou
16aed94ff5
[Analysis/Allocation] Allocation passes now assumes that slices always alias ( #108 )
...
This code in this branch assumes the `src` operand in
`insert_slice_async` always aliases the result, which shouldn't hold for
generally cases but is just a workaround to make the pipeline pass work.
I'm also working on the complete analysis in another
[branch](https://github.com/openai/triton-mlir/tree/keren/analyze-slice ).
2022-09-09 12:03:41 -07:00
Philippe Tillet
9bd5a3dcd2
[OPTIMIZER] Pipeline async buffer ( #110 )
2022-09-09 11:01:14 -07:00
Da Yan
35e346bcff
[OPTIMIZER] Better pipeline pass ( #100 )
...
* Use `insert_slice_async` instead of `CopyAsync`
* Move async.wait to loop header
Co-authored-by: Jokeren <kerenzhou@openai.com >
2022-09-06 08:31:13 -07:00
Philippe Tillet
a0bab9748e
[OPTIMIZER] Coalesce pass no longer takes a num-warps
argument ( #99 )
...
Improved design to avoid inconsistent `num-warps` value between the pass and the parent module of the operation it processes.
2022-09-05 18:09:02 -07:00
Philippe Tillet
d0b4c67b05
[OPTIMIZER] Improved layout conversion simplification algorithm ( #97 )
...
This PR both simplifies the layout conversion simplification algorithm, and also improves it to make it work with vectorized element-wise ops. The conversion optimizer still has a lot of room for improvements, and other PRs will address its limitations (ideally via some sort of explicit cost model)
2022-09-02 16:52:44 -07:00
Shintaro Iwasaki
0ebef11c77
[TritonIR] Make mask operand optional ( #74 )
2022-08-22 22:00:17 -07:00
Da Yan
92ef552a54
[OPTIMIZER] Fix Num in AsyncWaitOp generated by the pipeline pass ( #72 )
2022-08-22 15:58:10 -07:00
Shintaro Iwasaki
9aa00249a6
[TritonIR] make other optional and remove isOtherUnspecified ( #67 )
...
[Triton] make other optional and remove isOtherUnspecified
2022-08-18 18:19:55 -07:00
Philippe Tillet
192be76b3c
[OPTIMIZER] Rewrite patterns for layout conversions ( #64 )
2022-08-18 12:49:37 -07:00
Da Yan
8776ad1a0e
[OPTIMIZER] Let the pipeline pass insert async wait. ( #63 )
2022-08-18 10:31:57 -07:00
Shintaro Iwasaki
d69ce77b19
[FRONTEND] add an attr for masked load without explicit other ( #55 )
2022-08-18 09:51:37 -07:00
Philippe Tillet
78ebbe24c7
[FRONTEND] Added ExpandDimsOp
primitive ( #36 )
2022-08-04 18:41:06 -07:00
Philippe Tillet
3236642e8f
[OPTIMIZER] Added memory coalescing pass ( #31 )
2022-07-31 20:59:31 -07:00
Philippe Tillet
d1593e6ca8
[TritonGPU] Improved documentation and semantics of layout encodings ( #30 )
2022-07-31 13:59:44 -07:00
Phil Tillet
65237f6117
[PACKAGING] Added FileCheck
2022-07-07 16:53:19 -07:00
Yan Da
26fcc12afd
better unit tests
2022-06-07 19:35:38 +08:00
Yan Da
0e11435448
more tests
2022-06-06 21:10:28 +08:00
Yan Da
7807f64ef3
rename sharded_layout => blocked_layout
2022-06-05 16:14:59 +08:00
Yan Da
bbf75b492f
more tests
2022-06-05 15:10:09 +08:00
Yan Da
d5eca56cf3
more TritonGPU unit tests
2022-06-05 14:25:09 +08:00