[BACKEND] Compiler improvements (#557)

This PR adds several optimization capabilities in the compiler backend: - Now using inline PTX for `tl.store`, making it possible to use things like evict_last - For A100, mma layout can be directly converted to shared memory - For A100, an additional "transpose" argument in `dot` allows tensors to be loaded once and used both row- and col- major. - Fixed liveness analysis; this was broken. - Now can load/store directly mma layout without converting. Useful for when tl.dot accumulator is initialized with DRAM data inside of an inner loop. - `tl.dot` can now take LHS inputs in registers when it comes from a previous `tl.dot` instruction. Useful for e.g. fused attention.
2022-06-27 11:49:19 -07:00
parent 87413bc925
commit 5b4c8f221e
25 changed files with 882 additions and 284 deletions
--- a/lib/codegen/analysis/allocation.cc
+++ b/lib/codegen/analysis/allocation.cc
@@ -92,8 +92,10 @@ void allocation::run(ir::module &mod) {
  }
  // Save maximum size of induced memory space
  allocated_size_ = 0;
-  for(shared_layout* x: V)
+  for(shared_layout* x: V){
    allocated_size_ = std::max<size_t>(allocated_size_, starts[x] + x->get_size());
+    // std::cout << "start: " << starts[x] << " | end: " << starts[x] + x->get_size() << std::endl;
+  }
 }

 }