triton

Author	SHA1	Message	Date
Keren Zhou	db3aa1d1fb	[FRONTEND] Fix libdevice (#776 ) Fix two problems in libdevice and external dispatch: 1. Use static triton types (e.g., tl.int32) instead of creating new types. Otherwise, `tl.int32` and `tl.dtype('int32')` are not the same thing. 2. The name of an extern inst should be empty but not the symbol name of the inst. TTIR generator will assign names automatically. Otherwise, we have the same variable name when there are multiple same extern insts. Before the PR: ```bash __nv_exp = extern_elementwise f64<1024> %11; __nv_exp = extern_elementwise f64<1024> %11; ``` After the PR: ```bash %12 = extern_elementwise f64<1024> %11; %13 = extern_elementwise f64<1024> %11; ```	2022-10-13 17:18:16 -07:00
Yu Guo	71b46acc42	[IR] Added special-purpose `dequantize` instruction (#759 ) It is currently necessary for optimal performance in quantized workloads to add a special-purpose instruction in the IR. Backward compatibility with this instruction is NOT guaranteed.	2022-10-12 14:14:45 -07:00
Keren Zhou	4912916c11	[FRONTEND] Added support for element-wise function defined in external LLVM bitcode (e.g., libdevice) (#562 )	2022-07-13 15:52:21 -07:00
Philippe Tillet	5b4c8f221e	[BACKEND] Compiler improvements (#557 ) This PR adds several optimization capabilities in the compiler backend: - Now using inline PTX for `tl.store`, making it possible to use things like evict_last - For A100, mma layout can be directly converted to shared memory - For A100, an additional "transpose" argument in `dot` allows tensors to be loaded once and used both row- and col- major. - Fixed liveness analysis; this was broken. - Now can load/store directly mma layout without converting. Useful for when tl.dot accumulator is initialized with DRAM data inside of an inner loop. - `tl.dot` can now take LHS inputs in registers when it comes from a previous `tl.dot` instruction. Useful for e.g. fused attention.	2022-06-27 11:49:19 -07:00
Philippe Tillet	9f08ecd684	[FRONTEND] Semantic analysis refactor (#491 ) Moved dispatch.cc to semantic.py (@ptillet) Integer signedness analysis was moved from C++ to python (@daadaada) Cleaner frontend types (@daadaada) Moved SSA construction to a separate object (@ptillet) Co-authored-by: Yan Da <dyanab@connect.ust.hk>	2022-04-06 16:13:53 -07:00
Philippe Tillet	2bed6fc850	[LANG] Added support for device functions (#484 )	2022-04-03 20:58:16 -07:00
Philippe Tillet	e0cc488055	[FRONTEND] Added `tl.clock` and `tl.globaltimer` (#485 )	2022-03-28 16:15:43 -07:00
Philippe Tillet	76a9ee50a8	Revert "[FRONTEND] Semantic analysis refactor (#473 )" (#483 ) This reverts commit `539961072c`.	2022-03-24 17:16:50 -07:00
daadaada	539961072c	[FRONTEND] Semantic analysis refactor (#473 ) Moved dispatch.cc to semantic.py Integer signedness now moved from C++ to python Cleaner frontend type Co-authored-by: Phil Tillet <phil@openai.com>	2022-03-16 21:25:30 -07:00
Philippe Tillet	98ed7db8c1	[CODEGEN] Improvements and bugfixes (#463 )	2022-02-24 14:56:24 -08:00
Philippe Tillet	bef76b142a	[BACKEND] float division is now approximate by default (#446 )	2022-01-29 18:29:29 -08:00
daadaada	94a2e10fe5	[BACKEND] Add bf16 & tf32 mma supports (on A100) (#426 )	2022-01-11 10:20:31 -08:00
Philippe Tillet	03f1256f60	[FRONTEND] Added `volatile` flag for load (#407 )	2021-12-30 22:33:24 -08:00
Madeleine Thompson	985798f101	add missing bfloat16 repr and improve assertions (#403 ) - `BF16TyID` was missing a repr implementation. - Throw a better exception on impossible casts. - Add a few assertions. Tested with a debug build. - Add `pointer_dtype.__str__` to aid kernel debugging.	2021-12-23 17:01:17 -08:00
Philippe Tillet	5ce1b726dc	[CODEGEN] Various bugfixes that make it possible to fuse RNG in a matmul epilogue (#356 )	2021-10-24 02:30:46 -07:00
daadaada	858dec8372	[CODEGEN] Add cache modifier to tl.load (#351 ) * Add cache modifier to tl.load * Add comment to cache_modifier * Remove force_nc_cache * Update test	2021-10-17 22:14:04 -07:00
Philippe Tillet	5123db0b7d	[LANG] Various (relatively minor) improvements (#320 )	2021-10-04 18:39:40 -07:00
Philippe Tillet	4ff3714d61	[CODEGEN] Various bugfixes and stability improvements in compiler backend (#240 )	2021-08-30 11:50:35 -07:00
Philippe Tillet	226fde6ea1	[CODEGEN] Now using atomic_rmw code path for atomic_xchg (#222 )	2021-08-17 16:33:23 -07:00
Philippe Tillet	2824345065	[LANGUAGE] Added cos/sin (#132 )	2021-07-27 12:38:49 -07:00
Philippe Tillet	0274429429	[IR] Added IR and Codegen support for atomic_rmw (#120 )	2021-07-27 12:38:49 -07:00
Philippe Tillet	f81012a8cf	[CODEGEN] Fixed atomic_add issue (#112 ) * [CODEGEN] Fixed atomic_add issue * [CODEGEN] Fixed liveness analysis bug for instructions that are not DCE'd but have no users (e.g., atomic_cas)	2021-07-27 12:38:49 -07:00
daadaada	967e629c0c	[CODEGEN] Add a pass to prefetch operands of dot if applicable. (#105 ) * update membar pass when data is double buffered * Add instruction prefetch_s * prefetch tests pass (except the 1 warp case) * Fix the 1-warp bug * Add back prefetch files * Disable prefetch on a100 * Always add war barrier on sm>=80	2021-07-27 12:38:49 -07:00
Philippe Tillet	840140bf26	[CODEGEN] Removed dedicated reassociate pass to merge it into LLVM isel (#101 ) This massively simplifies implementation of `reassociate` and also fixes a bunch of bug. The pass could still be improved, but can already be used to generate constant pointer offsets in eg the matmul epilogue	2021-07-27 12:38:49 -07:00
Philippe Tillet	39f4730305	Deprecation of Triton-C and Replacement by decorated Python functions (#86 ) This PR implements a major overhaul of the frontend for Triton, and replaces Triton-C by a pure Python API in which kernels are defined as @triton.jit decorated functions. The documentation and tutorials have also been updated to accommodate these changes. See documentations for more information on the new API	2021-07-27 12:38:49 -07:00
Philippe Tillet	5b83259592	[CODEGEN] Major performance improvements on A100 (#70 ) Improved handling of asynchronous copy, scheduling and synchronization for A100. Now achieving CUTLASS-like performance on large square dense matrix multiplication tasks	2021-07-27 12:38:49 -07:00
Philippe Tillet	083bbd1e8d	[GENERAL] Merged v1.0alpha into master. Added features are: - A100 support via mma.16816 - Thread swizzling for conflict-free shared memory accesses without padding - Complete overhaul of the LLVM code generation in codegen/selection/generator.cc to remove overengineering - Added debugging capabilities in the Python binding - Compilation error for kernels that spill	2021-07-27 12:38:48 -07:00
Philippe Tillet	baa858aa74	[CODEGEN] Fixed bug in atomic_add	2021-07-27 12:38:48 -07:00
Philippe Tillet	f152150e7d	[LANG] Added log intrinsic	2021-07-27 12:38:48 -07:00
Philippe Tillet	049ab989b5	[GENERAL] Various improvements: * Sparse einsum in triton.ops.einsum * Hacky support for fixed-tile-size atomic-add * Various bugfixes in parser	2021-07-27 12:38:48 -07:00
Philippe Tillet	3304629de9	[CORE] Fixed several issues that arose in the development of the torch-blocksparse package: * Now using warp shuffle in reductions when possible * Various bugfixes in layout inference * Added INFINITY, exponential and select * Better error messages for unimplemented constructs	2021-07-27 12:38:48 -07:00
Philippe Tillet	6d7cf35123	History prior to this date belonged to the now deprecated ISAAC project, and was deleted to save space	2021-07-27 12:38:38 -07:00

32 Commits