triton

Author	SHA1	Message	Date
Philippe Tillet	8edfe813a5	[FRONTEND][BACKEND] Added `trans` instruction; made flash attention bwd pass work (#943 )	2022-12-03 09:58:24 -08:00
Keren Zhou	153aecb339	[Triton-MLIR][BACKEND] insert_slice_async on GPUs < sm80 (#908 ) `insert_slice_async` is decomposed into `load + insert_slice` in the backend. Not sure if V100 perf can match the master branch though in this way. Maybe the performance can be improved if instructions are arranged in the following form: ``` %0 = load %1 = load %2 = load ... insert_slice %0 insert_slice %1 insert_slice %2 ``` Tested on A100 when manually enabling this decomposition. Tests on V100 haven't been integrated yet, we can divide the tests into two phases: 1. Test only load, insert_slice, and insert_slice_async, given TritonGPU IRs in `test_backend.py`. 2. End to end gemm tests on V100.	2022-11-24 14:05:54 -08:00
Chenggang Zhao	516a241234	[Triton-MLIR] Fix some typos (#874 ) Fix some typos	2022-11-13 18:15:53 -08:00
Keren Zhou	fdd59900f7	[Triton-MLIR] Replace triton.extract_slice with tensor.extract_slice and support more general tensor slicing (#837 ) ## Features - Allow taking a block of tensor slice, as long as each dimension is contiguous (unit stride). - Fix some problems in `insert_slice_async`'s semantic. - More general verification for ops that return shared layout encoding. ## Known Limitations - `insert_slice_async` still uses the old semantic. May submit another PR later to support similar semantic like `tensor.extract_slice`. - No encoding verification for `tensor.extract_slice`. - 3d tensor ops are broken. - Strided accesses are not allowed. - May cause a little performance slowdown since we are passing strides as values but not constants (e.g., int). It would be difficult to pass strides as attributes when we have control flows. A block argument is possible to accept tensors with different strides.	2022-11-06 22:59:03 -08:00
Keren Zhou	16aed94ff5	[Analysis/Allocation] Allocation passes now assumes that slices always alias (#108 ) This code in this branch assumes the `src` operand in `insert_slice_async` always aliases the result, which shouldn't hold for generally cases but is just a workaround to make the pipeline pass work. I'm also working on the complete analysis in another [branch](https://github.com/openai/triton-mlir/tree/keren/analyze-slice).	2022-09-09 12:03:41 -07:00
Yan Chunwei	a9464f4993	[Backend] Vectorize Load/Store Ops (#86 ) This PR does the following things: - Code refactoring on Load and Store op codegen, rewrite with same logic and share much code - Support the vectorized load/store	2022-09-06 12:28:09 -07:00
Keren Zhou	02ebf24d35	Analyze shared memory alias (#81 ) The purpose of this PR is analyzing shared memory aliases so that we can fix memory allocation bugs and save memory allocations in triton code involving complex control flows. Changes to memory bar and allocation are on the way. Co-authored-by: Philippe Tillet <phil@openai.com>	2022-08-29 10:43:20 -07:00

7 Commits