`insert_slice_async` is decomposed into `load + insert_slice` in the
backend.
Not sure if V100 perf can match the master branch though in this way.
Maybe the performance can be improved if instructions are arranged in
the following form:
```
%0 = load
%1 = load
%2 = load
...
insert_slice %0
insert_slice %1
insert_slice %2
```
Tested on A100 when manually enabling this decomposition.
Tests on V100 haven't been integrated yet, we can divide the tests into
two phases:
1. Test only load, insert_slice, and insert_slice_async, given TritonGPU
IRs in `test_backend.py`.
2. End to end gemm tests on V100.
Swizzling is no longer implemented as a separate pass. It is instead
done in a specialized constructor of SharedEncodingAttr, and tested via
google tests instead of triton-opt + filecheck.
In the future we may want to implement it as a pass again once we have
an additional dialect between TritonGPU and LLVM.
This PR does
1. Add `onlyBindMLIRArgs` argument to `PTXInstrCommon::call` method to
support passing in a whole PTX code snippet
2. Refine the APIs and simplify the code usage.
This PR does
1. Add some C++ tests for `PTXFormat`
2. Enhance the functionality of `PTXFormat`, make a `PTXInstr` instance
can be called multiple times similar as a C function.
Based on the discussion in #53 , I just added the initial flow of CXX unittests for this repo, with providing two dummy UTs as placeholder to show the usage, feel free to add your own CXX unittests.
@Superjomn @ptillet
@ptillet , in this PR, I also configure the integration-tests.yml to add the unittest into github CI check.
Thanks