triton

Author	SHA1	Message	Date
Philippe Tillet	f40c63fb03	[Triton-MLIR][OPTIMIZER] Cleaned up swizzling (#869 ) Swizzling is no longer implemented as a separate pass. It is instead done in a specialized constructor of SharedEncodingAttr, and tested via google tests instead of triton-opt + filecheck. In the future we may want to implement it as a pass again once we have an additional dialect between TritonGPU and LLVM.	2022-11-10 12:05:46 -08:00
Da Yan	4946167241	[Triton-MLIR] `tt.dot` operands now must have DotOperand layout; also added prefetch pass prototype (#712 ) Co-authored-by: Jokeren <kerenzhou@openai.com> Co-authored-by: Phil Tillet <phil@openai.com> Co-authored-by: Superjomn <yanchunwei@outlook.com>	2022-11-10 05:57:27 +00:00
Da Yan	137344946f	[OPTIMIZER] Fix the load-mask issue with the pipeline pass (#857 )	2022-11-08 09:29:53 -08:00
Philippe Tillet	976cf12af1	[OPTIMIZER] Fixed memory coalescing (#847 )	2022-11-07 06:22:18 -08:00
Keren Zhou	fdd59900f7	[Triton-MLIR] Replace triton.extract_slice with tensor.extract_slice and support more general tensor slicing (#837 ) ## Features - Allow taking a block of tensor slice, as long as each dimension is contiguous (unit stride). - Fix some problems in `insert_slice_async`'s semantic. - More general verification for ops that return shared layout encoding. ## Known Limitations - `insert_slice_async` still uses the old semantic. May submit another PR later to support similar semantic like `tensor.extract_slice`. - No encoding verification for `tensor.extract_slice`. - 3d tensor ops are broken. - Strided accesses are not allowed. - May cause a little performance slowdown since we are passing strides as values but not constants (e.g., int). It would be difficult to pass strides as attributes when we have control flows. A block argument is possible to accept tensors with different strides.	2022-11-06 22:59:03 -08:00
Keren Zhou	3b80801dff	[Triton-MLIR][Backend] Fix many problems to get the pipeline working (#809 ) 1. Rewrite code generation of insert_slice_async. 2. Correct the wrong index passed to extract_slice in pipeline. 3. Add a prologue in pipeline to wait for dangling cp.asyncs. 4. Move scf to cf conversion inside TritonGPUToLLVM because we need to perform membar before scf to cf. It shouldn't be a technical limitation and could be improved by a more general membar analysis. 5. Use an attribute to memoize the shared memory size and support dynamic shared memory. 6. Prevent the combine pass to reorder insert_slice and extract_slice across async_wait Co-authored-by: Superjomn <yanchunwei@outlook.com>	2022-10-27 22:09:06 -07:00
Philippe Tillet	bb0f9235d1	[OPTIMIZER] Made layout simplification pass efficient for fused attention kernels (#790 )	2022-10-21 16:52:15 -07:00
Philippe Tillet	38a80664b5	[OPTIMIZER] Updated TritonGPU-combine pass (#784 ) WIP but should work int t…he cases we need so far	2022-10-16 21:19:42 -07:00
Philippe Tillet	623c99609f	[Triton-IR] Added type inference and verifier for Triton-IR operations (#767 )	2022-10-11 18:16:41 -07:00
Philippe Tillet	b6e5a231e5	[OPTIMIZER] Added swizzling pass (#758 )	2022-10-10 01:12:37 -07:00
Keren Zhou	baba98ad69	[Triton-MLIR] Fix threadsPerWarp derivation in BlockedEncodingAttr (#722 ) Example: ``` auto encoding = triton::gpu::BlockedEncodingAttr::get( &getContext(), {8, 32}, {2, 2}, {1, 0}, 2); //shape = [32 x 8], order = [1, 0], sizePerThread=[2, 2], numWarps=2 ``` Expected output: ``` //#triton_gpu.blocked_layout<{ // sizePerThread = {2, 2} // threadsPerWarp = {8, 4} // warpsPerCTA = {2, 1} //}> ``` Incorrect output by the current branch ``` //#triton_gpu.blocked_layout<{ // sizePerThread = {2, 2} // threadsPerWarp = {16, 2} // warpsPerCTA = {2, 1} //}> ```	2022-09-27 16:41:30 -07:00
Shintaro Iwasaki	43be75ad42	[FRONTEND] Add scalar type support for some ops (#661 ) This PR adds basic support for scalar-type inputs to some ops (cast and pointer arithmetics) for Triton-MLIR. Also renames getelementptr -> addptr	2022-09-15 16:12:52 -07:00
Da Yan	2e08450c80	[OPTIMIZER] Better pipeline tests (#660 )	2022-09-14 23:26:40 -07:00
Keren Zhou	16aed94ff5	[Analysis/Allocation] Allocation passes now assumes that slices always alias (#108 ) This code in this branch assumes the `src` operand in `insert_slice_async` always aliases the result, which shouldn't hold for generally cases but is just a workaround to make the pipeline pass work. I'm also working on the complete analysis in another [branch](https://github.com/openai/triton-mlir/tree/keren/analyze-slice).	2022-09-09 12:03:41 -07:00
Philippe Tillet	9bd5a3dcd2	[OPTIMIZER] Pipeline async buffer (#110 )	2022-09-09 11:01:14 -07:00
Da Yan	35e346bcff	[OPTIMIZER] Better pipeline pass (#100 ) * Use `insert_slice_async` instead of `CopyAsync` * Move async.wait to loop header Co-authored-by: Jokeren <kerenzhou@openai.com>	2022-09-06 08:31:13 -07:00
Philippe Tillet	a0bab9748e	[OPTIMIZER] Coalesce pass no longer takes a `num-warps` argument (#99 ) Improved design to avoid inconsistent `num-warps` value between the pass and the parent module of the operation it processes.	2022-09-05 18:09:02 -07:00
Philippe Tillet	d0b4c67b05	[OPTIMIZER] Improved layout conversion simplification algorithm (#97 ) This PR both simplifies the layout conversion simplification algorithm, and also improves it to make it work with vectorized element-wise ops. The conversion optimizer still has a lot of room for improvements, and other PRs will address its limitations (ideally via some sort of explicit cost model)	2022-09-02 16:52:44 -07:00
Shintaro Iwasaki	0ebef11c77	[TritonIR] Make mask operand optional (#74 )	2022-08-22 22:00:17 -07:00
Da Yan	92ef552a54	[OPTIMIZER] Fix Num in AsyncWaitOp generated by the pipeline pass (#72 )	2022-08-22 15:58:10 -07:00
Shintaro Iwasaki	9aa00249a6	[TritonIR] make other optional and remove isOtherUnspecified (#67 ) [Triton] make other optional and remove isOtherUnspecified	2022-08-18 18:19:55 -07:00
Philippe Tillet	192be76b3c	[OPTIMIZER] Rewrite patterns for layout conversions (#64 )	2022-08-18 12:49:37 -07:00
Da Yan	8776ad1a0e	[OPTIMIZER] Let the pipeline pass insert async wait. (#63 )	2022-08-18 10:31:57 -07:00
Shintaro Iwasaki	d69ce77b19	[FRONTEND] add an attr for masked load without explicit other (#55 )	2022-08-18 09:51:37 -07:00
Philippe Tillet	78ebbe24c7	[FRONTEND] Added `ExpandDimsOp` primitive (#36 )	2022-08-04 18:41:06 -07:00
Philippe Tillet	3236642e8f	[OPTIMIZER] Added memory coalescing pass (#31 )	2022-07-31 20:59:31 -07:00
Philippe Tillet	d1593e6ca8	[TritonGPU] Improved documentation and semantics of layout encodings (#30 )	2022-07-31 13:59:44 -07:00
Phil Tillet	65237f6117	[PACKAGING] Added FileCheck	2022-07-07 16:53:19 -07:00
Yan Da	26fcc12afd	better unit tests	2022-06-07 19:35:38 +08:00
Yan Da	0e11435448	more tests	2022-06-06 21:10:28 +08:00
Yan Da	7807f64ef3	rename sharded_layout => blocked_layout	2022-06-05 16:14:59 +08:00
Yan Da	bbf75b492f	more tests	2022-06-05 15:10:09 +08:00
Yan Da	d5eca56cf3	more TritonGPU unit tests	2022-06-05 14:25:09 +08:00

33 Commits