[Triton-MLIR][Backend] Fix many problems to get the pipeline working (#809)

1. Rewrite code generation of insert_slice_async.
2. Correct the wrong index passed to extract_slice in pipeline.
3. Add a prologue in pipeline to wait for dangling cp.asyncs.  
4. Move scf to cf conversion inside TritonGPUToLLVM because we need to
perform membar before scf to cf. It shouldn't be a technical limitation
and could be improved by a more general membar analysis.
5. Use an attribute to memoize the shared memory size and support
dynamic shared memory.
6. Prevent the combine pass to reorder insert_slice and extract_slice
across async_wait

Co-authored-by: Superjomn <yanchunwei@outlook.com>
This commit is contained in:
Keren Zhou
2022-10-27 22:09:06 -07:00
committed by GitHub
parent 42db3538e4
commit 3b80801dff
10 changed files with 122 additions and 70 deletions

View File

@@ -1257,8 +1257,8 @@ void init_triton_translation(py::module &m) {
using ret = py::return_value_policy;
m.def("get_shared_memory_size", [](mlir::ModuleOp module) {
auto pass = std::make_unique<mlir::Allocation>(module);
return pass->getSharedMemorySize();
return module->getAttrOfType<mlir::IntegerAttr>("triton_gpu.shared")
.getInt();
});
m.def(

View File

@@ -875,7 +875,7 @@ def optimize_tritongpu_ir(mod, num_stages):
pm.enable_debug()
# Get error in backend due to wrong conversion in expanding async-related instruction.
# TODO[Superjomn]: Open it when fixed.
# pm.add_tritongpu_pipeline_pass(num_stages)
pm.add_tritongpu_pipeline_pass(num_stages)
pm.add_canonicalizer_pass()
pm.add_cse_pass()
pm.add_coalesce_pass()