triton

Files

Keren Zhou 153aecb339 [Triton-MLIR][BACKEND] insert_slice_async on GPUs < sm80 (#908 )

`insert_slice_async` is decomposed into `load + insert_slice` in the
backend.

Not sure if V100 perf can match the master branch though in this way.
Maybe the performance can be improved if instructions are arranged in
the following form:

```
%0 = load
%1 = load 
%2 = load 
...
insert_slice %0
insert_slice %1
insert_slice %2
```

Tested on A100 when manually enabling this decomposition.
Tests on V100 haven't been integrated yet, we can divide the tests into
two phases:
1. Test only load, insert_slice, and insert_slice_async, given TritonGPU
IRs in `test_backend.py`.
2. End to end gemm tests on V100.

2022-11-24 14:05:54 -08:00

FileCheck

[CI] run clang-format (#24 )

2022-07-26 17:25:03 -07:00

CMakeLists.txt

[RUNTIME] Major code cleanup (#711 )

2022-09-26 16:38:06 -07:00

triton-opt.cpp

[Triton] Support math and libdevice ops (#91 )

2022-09-01 16:34:27 -07:00

triton-translate.cpp

[Triton-MLIR][BACKEND] insert_slice_async on GPUs < sm80 (#908 )

2022-11-24 14:05:54 -08:00