[Triton-MLIR][BACKEND] add convert_layout[shared->dot_op] converstion to adapt DotOperand layout (#786)

This PR helps to 1. Adapt the existing DotOp conversion to the design of the new DotOperand layout, 2. Making the DotOp conversion work with both shared-layout inputs case and dotoperand-layout inputs case for further upstream switch.
2022-10-24 11:40:13 +08:00
parent 3aa8296b06
commit 877844de4f
4 changed files with 241 additions and 103 deletions
--- a/include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td
+++ b/include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td
@@ -52,7 +52,7 @@ different cuda threads in the programs, via shared memory. In other words,
 for all indices i \in R^d, \mathcal{L}(i) = {0, 1, ..., 32*num_warps - 1}.

 In order to avoid shared memory bank conflicts, elements may be swizzled
-in memory. For example, a swizzled row-major layout could store its data 
+in memory. For example, a swizzled row-major layout could store its data
 as follows:

 A_{0, 0}  A_{0, 1}  A_{0, 2}  A_{0, 3} ...   [phase 0] \ per_phase = 2
@@ -215,9 +215,9 @@ def MmaEncodingAttr : DistributedEncoding<"MmaEncoding"> {
 An encoding for tensors that have been produced by tensor cores.
 It is characterized by two parameters:
 - A 'version' which specifies the generation the tensor cores
-whose output is being partitioned: 1 for first-gen tensor cores (Volta), 
+whose output is being partitioned: 1 for first-gen tensor cores (Volta),
 and 2 for second-gen tensor cores (Turing/Ampere).
- A `blockTileSize` to indicate how data should be 
+- A `blockTileSize` to indicate how data should be
 partitioned between warps.

 // -------------------------------- version = 1 --------------------------- //
@@ -229,7 +229,7 @@ https://docs.nvidia.com/cuda/parallel-thread-execution/index.html

 For example, the matrix L corresponding to blockTileSize=[32,16] is:

-                               warp 0              
+                               warp 0
 --------------------------------/\-------------------------------
 [ 0   0   2   2   0   0   2   2    4   4   6   6   4   4   6   6 ]
 [ 1   1   3   3   1   1   3   3    5   5   7   7   5   5   7   7 ]
@@ -246,7 +246,7 @@ For example, the matrix L corresponding to blockTileSize=[32,16] is:
 [ 24  24  26  26  24  24  26  26   28  28  30  30  28  28  30  30]
 [ 25  25  27  27  25  25  27  27   29  29  31  31  29  29  31  31]

-                         warp 1 = warp0 + 32             
+                         warp 1 = warp0 + 32
 --------------------------------/\-------------------------------
 [ 32  32  34  34  32  32  34  34   36  36  38  38  36  36  38  38]
 [ 33  33  35  35  33  33  35  35   37  37  39  39  37  37  39  39]
@@ -260,29 +260,29 @@ For example, the matrix L corresponding to blockTileSize=[32,16] is:
 For second-gen tensor cores, the implicit warpTileSize is [16, 8].
 Information about this layout can be found in the official PTX documentation
 https://docs.nvidia.com/cuda/parallel-thread-execution/index.html
-(mma.16816 section, FP32 accumulator).             
+(mma.16816 section, FP32 accumulator).

 For example, the matrix L corresponding to blockTileSize=[32,16] is:
                warp 0                          warp 1
 -----------------/\-------------  ----------------/\-------------
 [ 0   0   1   1   2   2   3   3   32  32  33  33  34  34  35  35
 [ 4   4   5   5   6   6   7   7   36  36  37  37  38  38  39  39
-[ ..............................  ..............................                      
+[ ..............................  ..............................
 [ 28  28  29  29  30  30  31  31  60  60  61  61  62  62  63  63
 [ 0   0   1   1   2   2   3   3   32  32  33  33  34  34  35  35
 [ 4   4   5   5   6   6   7   7   36  36  37  37  38  38  39  39
-[ ..............................  ..............................                
+[ ..............................  ..............................
 [ 28  28  29  29  30  30  31  31  60  60  61  61  62  62  63  63
-                
+
              warp 3                           warp 4
 ----------------/\-------------   ----------------/\-------------
 [ 64  64  65  65  66  66  67  67  96  96  97  97  98  98  99  99
 [ 68  68  69  69  70  70  71  71  100 100 101 101 102 102 103 103
-[ ..............................  ...............................                   
+[ ..............................  ...............................
 [ 92  92  93  93  94  94  95  95  124 124 125 125 126 126 127 127
 [ 64  64  65  65  66  66  67  67  96  96  97  97  98  98  99  99
 [ 68  68  69  69  70  70  71  71  100 100 101 101 102 102 103 103
-[ ..............................  ...............................                   
+[ ..............................  ...............................
 [ 92  92  93  93  94  94  95  95  124 124 125 125 126 126 127 127

 }];
@@ -316,7 +316,7 @@ def SliceEncodingAttr : DistributedEncoding<"SliceEncoding"> {
    This is useful for constructing the inverse layout of an expand_dims operation during some optimization passes.

  }];
-  
+
  let parameters = (
    ins
    "unsigned":$dim,