[CODEGEN] Performance improvement on A100 (#125)

Improved codegen for the Ampere GPUs. * Make the layout pass recognize the multistage pipelined pattern. * Now the pipeline pass can automate the multistage pipelining transformation. * Remove extra barriers (from the prefetch pass & WAR) on Ampere. * Update the code generator (generator.cc) to make Triton generate n-buffered shared memory loads/stores.
2021-06-21 14:25:13 +08:00
parent 5a51f3e529
commit d8d6b715c8
21 changed files with 855 additions and 174 deletions
--- a/python/triton/testing.py
+++ b/python/triton/testing.py
@@ -1,5 +1,6 @@
 import torch
 import os
+from .code_gen import OutOfResources

 try:
    import triton._C.libtriton.cutlass as _cutlass
@@ -8,6 +9,15 @@ except ImportError:
    _cutlass = None
    has_cutlass = False

+def catch_oor(kernel, pytest_handle=None):
+    try:
+        res = kernel()
+    except OutOfResources as e:
+        if pytest_handle:
+            pytest_handle.skip(str(e))
+        return None
+    return res
+

 def sparsify_tensor(x, mask, block):
    ret = torch.empty((x.size(0), mask.sum(), block, block), dtype=x.dtype, device=x.device)