[Triton-MLIR][BACKEND] insert_slice_async on GPUs < sm80 (#908)

`insert_slice_async` is decomposed into `load + insert_slice` in the
backend.

Not sure if V100 perf can match the master branch though in this way.
Maybe the performance can be improved if instructions are arranged in
the following form:

```
%0 = load
%1 = load 
%2 = load 
...
insert_slice %0
insert_slice %1
insert_slice %2
```

Tested on A100 when manually enabling this decomposition.
Tests on V100 haven't been integrated yet, we can divide the tests into
two phases:
1. Test only load, insert_slice, and insert_slice_async, given TritonGPU
IRs in `test_backend.py`.
2. End to end gemm tests on V100.
This commit is contained in:
Keren Zhou
2022-11-24 14:05:54 -08:00
committed by GitHub
parent f98aed1258
commit 153aecb339
16 changed files with 351 additions and 137 deletions

View File

@@ -34,7 +34,8 @@ TEST_P(SwizzleDotOperandTestFixture, DotOperands) {
// create element type
Type eltType = IntegerType::get(&ctx, params.typeWidth);
auto layout = SharedEncodingAttr::get(&ctx, encoding, params.shape, {1, 0}, eltType);
auto layout =
SharedEncodingAttr::get(&ctx, encoding, params.shape, {1, 0}, eltType);
ASSERT_EQ(layout.getVec(), params.refSwizzle.vec);
ASSERT_EQ(layout.getPerPhase(), params.refSwizzle.perPhase);