triton

Author	SHA1	Message	Date
Madeleine Thompson	985798f101	add missing bfloat16 repr and improve assertions (#403 ) - `BF16TyID` was missing a repr implementation. - Throw a better exception on impossible casts. - Add a few assertions. Tested with a debug build. - Add `pointer_dtype.__str__` to aid kernel debugging.	2021-12-23 17:01:17 -08:00
daadaada	39d4bfed83	[OPS] Add performance model for gemm/gemv (#397 ) Significantly improves the performance of `triton.ops.matmul` in memory-bound settings via the use of many more block configs coupled with a performance model to drive the auto-tuning process.	2021-12-21 09:56:10 -08:00
Madeleine Thompson	fa62b4a8f6	[FRONTEND] better stringification (#394 ) - Don't override `self.args` in `CompilationError`, and show the line number and column in error messages. This causes it to generate an easier-to-read backtrace. - Better `__str__` on `TensorWrapper`, `dtype`, and `block`.	2021-12-17 20:11:45 -08:00
Philippe Tillet	558555630f	[FRONTEND] Added xor_sum	2021-12-16 17:55:35 -08:00
Philippe Tillet	5ce1b726dc	[CODEGEN] Various bugfixes that make it possible to fuse RNG in a matmul epilogue (#356 )	2021-10-24 02:30:46 -07:00
daadaada	858dec8372	[CODEGEN] Add cache modifier to tl.load (#351 ) * Add cache modifier to tl.load * Add comment to cache_modifier * Remove force_nc_cache * Update test	2021-10-17 22:14:04 -07:00
Stephen McGroarty	c2e6b90ff1	[CODEGEN] Fixes masked load exception (#342 )	2021-10-13 13:31:52 -07:00
Philippe Tillet	6e5b0b4301	[FRONTEND] Added on-disk cache for compiled kernels (#287 )	2021-09-18 22:48:26 -07:00
Philippe Tillet	94c83d30ce	[GENERAL] Removed deprecated driver files and added basic compatibility with rocm (#268 ) - Removed driver module -- accelerator runtime is handled by pytorch - Added basic support for ROCM based on @micmelesse 's PR -- now can execute empty kernel on AMD devices without any compile-time changes - Now only using PREFER_SHARED for kernels when the size of shared memory is greater than 49k. Otherwise there can be poor L1 performance for broadcast tensors	2021-09-09 00:04:28 -07:00
daadaada	274d613488	[IR] Better printer (#256 )	2021-09-01 09:55:12 -07:00
Philippe Tillet	4ff3714d61	[CODEGEN] Various bugfixes and stability improvements in compiler backend (#240 )	2021-08-30 11:50:35 -07:00
daadaada	85426dbaf7	[DOCS] Add comments in layout.h (#249 )	2021-08-28 18:07:32 -07:00
milesial	5b29da719d	[DRIVER] Add CUDA P2P support (#209 )	2021-08-20 21:00:54 -07:00
Philippe Tillet	226fde6ea1	[CODEGEN] Now using atomic_rmw code path for atomic_xchg (#222 )	2021-08-17 16:33:23 -07:00
Philippe Tillet	bb1eebb4b4	[CODEGEN] Fixed bug for visit_reduce1d with 64-bit data-types (#207 )	2021-08-14 21:07:01 -07:00
Philippe Tillet	83da7065da	[DRIVER] Portability fixup (#195 )	2021-08-07 18:53:11 -07:00
Philippe Tillet	298da78058	[CODEGEN/DRIVER] Tweaks for performance optimization (#193 )	2021-08-07 16:41:44 -07:00
Philippe Tillet	76c6f24fb6	[CI] Made build-wheels compatible with system LLVM setup (#138 ) This speeds up wheelhouse build time by ~10x	2021-07-27 12:38:49 -07:00
Philippe Tillet	01276b5153	[FRONTEND] Added compilation flag to force use of `.nc` cache modifier (#134 ) in DRAM loads. /!\ USE CAREFULLY - THIS CAN BREAK CORRECTNESS IF MISUSED /!\	2021-07-27 12:38:49 -07:00
Philippe Tillet	2824345065	[LANGUAGE] Added cos/sin (#132 )	2021-07-27 12:38:49 -07:00
Philippe Tillet	8cea583109	[IR] Preliminary support for BF16 (#129 ) This PR adds a BF16 data-type, along with FP32 <-> BF16 conversion instructions in the LLVM codegen. Other kinds of ops on bfloat16 are not yet supported.	2021-07-27 12:38:49 -07:00
daadaada	d8d6b715c8	[CODEGEN] Performance improvement on A100 (#125 ) Improved codegen for the Ampere GPUs. * Make the layout pass recognize the multistage pipelined pattern. * Now the pipeline pass can automate the multistage pipelining transformation. * Remove extra barriers (from the prefetch pass & WAR) on Ampere. * Update the code generator (generator.cc) to make Triton generate n-buffered shared memory loads/stores.	2021-07-27 12:38:49 -07:00
Philippe Tillet	5a51f3e529	[CODEGEN] Bugfix in membar pass (#124 ) Membar pass on top of master is buggy with asynchronous copy. For example, it doesn't wait for asynchronous copies to complete before recoalescing accumulator in GEMM, which leads to undefined behavior when the program doesn't enter the loop. This PR proposes	2021-07-27 12:38:49 -07:00
Philippe Tillet	b7b05a560e	[DRIVER] Now giving the option to use system ptxas through environment variable (#123 )	2021-07-27 12:38:49 -07:00
Philippe Tillet	80c86ecf4a	[LANG] Minor semantic changes (#121 ) * Now using unordered instead of ordered float (fixes NaN issues) * Bool -> int32 now converts to 1 rather than -1 * Reduce extend arguments to 32-bits if possible	2021-07-27 12:38:49 -07:00
Philippe Tillet	0274429429	[IR] Added IR and Codegen support for atomic_rmw (#120 )	2021-07-27 12:38:49 -07:00
Philippe Tillet	59b0ac672a	[LANGUAGE] Added support for bitcast (#119 )	2021-07-27 12:38:49 -07:00
Philippe Tillet	f81012a8cf	[CODEGEN] Fixed atomic_add issue (#112 ) * [CODEGEN] Fixed atomic_add issue * [CODEGEN] Fixed liveness analysis bug for instructions that are not DCE'd but have no users (e.g., atomic_cas)	2021-07-27 12:38:49 -07:00
Philippe Tillet	325ee38581	[PYTHON] Fixed bug in scoping mechanism (#111 ) Inline functions didn't restore scope of parents. Also some control flow structure still had the scoping semantics of C++	2021-07-27 12:38:49 -07:00
Philippe Tillet	288b4f7f58	[PYTHON] Added frontend to print sass using turingas disasm.py (#109 )	2021-07-27 12:38:49 -07:00
daadaada	967e629c0c	[CODEGEN] Add a pass to prefetch operands of dot if applicable. (#105 ) * update membar pass when data is double buffered * Add instruction prefetch_s * prefetch tests pass (except the 1 warp case) * Fix the 1-warp bug * Add back prefetch files * Disable prefetch on a100 * Always add war barrier on sm>=80	2021-07-27 12:38:49 -07:00
Philippe Tillet	840140bf26	[CODEGEN] Removed dedicated reassociate pass to merge it into LLVM isel (#101 ) This massively simplifies implementation of `reassociate` and also fixes a bunch of bug. The pass could still be improved, but can already be used to generate constant pointer offsets in eg the matmul epilogue	2021-07-27 12:38:49 -07:00
Philippe Tillet	7355efa745	[LANG] Preliminary FP8 support (#96 )	2021-07-27 12:38:49 -07:00
Philippe Tillet	2b75158426	[PYTHON] Added atomic_add (#94 )	2021-07-27 12:38:49 -07:00
Philippe Tillet	39f4730305	Deprecation of Triton-C and Replacement by decorated Python functions (#86 ) This PR implements a major overhaul of the frontend for Triton, and replaces Triton-C by a pure Python API in which kernels are defined as @triton.jit decorated functions. The documentation and tutorials have also been updated to accommodate these changes. See documentations for more information on the new API	2021-07-27 12:38:49 -07:00
Philippe Tillet	5ba5a77561	[BUILD] Remove compilation warnings	2021-07-27 12:38:49 -07:00
Philippe Tillet	5b9afaa688	[CODEGEN] Fixed bug that caused conditional operator to not always properly mask load operations Also includes minor improvement to benchmarking infrastructure	2021-07-27 12:38:49 -07:00
Philippe Tillet	62835a0979	[RUNTIME] Added auto-alignment mechanism (#71 ) This PR adds an automatic memory alignment mechanism in the Triton runtime. Specifically, the JIT compiler detects the alignment (in bytes) of each pointer argument as well as the largest power of two divisor (between 1 and 16) of each integer argument. Proper .aligned and .multipleof attributes are then added to the Triton-IR on-the-fly for all auto-tunable kernels. There is a cache that remembers all the kernels compiled for each possible configuration. This PR also includes substantial cleaning of the Python API. This adds 2-3us overhead, mostly due to accessing integer #defines from the auto-tuned compilation options. The previous solution was slightly faster but hacky and potentially unsafe, so this is preferred for now.	2021-07-27 12:38:49 -07:00
Philippe Tillet	5b83259592	[CODEGEN] Major performance improvements on A100 (#70 ) Improved handling of asynchronous copy, scheduling and synchronization for A100. Now achieving CUTLASS-like performance on large square dense matrix multiplication tasks	2021-07-27 12:38:49 -07:00
Philippe Tillet	3ca40b05cf	[DRIVER] Added options for developers to cache PTX file so that ti can be manually modified	2021-07-27 12:38:49 -07:00
Philippe Tillet	b8a52c70c9	[LANG] Now requiring tiles have power of 2 number of elements	2021-07-27 12:38:48 -07:00
Philippe Tillet	6fb4800f57	Improvements w/ Auto-Tuning and standard benchmarks (#57 ) [PYTHON] Bug-fixes in the auto-tuning module and improvement of the existing API for it	2021-07-27 12:38:48 -07:00
Philippe Tillet	3fde4b8f5b	[RUNTIME] Auto-tuning now works as expected when the values of autotune_key change	2021-07-27 12:38:48 -07:00
Philippe Tillet	0b025db2ee	[RUNTIME] Added option to print LLVM-IR Also includes appropriate driver code change for that	2021-07-27 12:38:48 -07:00
Philippe Tillet	f81da73b6a	[PYTHON] Added utility to read single Triton kernel from provided file in triton.read	2021-07-27 12:38:48 -07:00
Philippe Tillet	9f9d7b8840	[LANG] Fixed parsing error for built-in functions exp/log/sqrtf	2021-07-27 12:38:48 -07:00
Philippe Tillet	269ebc12e5	[PYTHON][TESTS][DOC] Various improvement of the API and code quality: * Simplified `triton.kernel` API to achieve lower latency: > .data_ptr() must now be passed as kernel argument. No more implicit conversion from torch.tensor > compilation options are now constant attributes, i.e., opt.d('VAR') becomes opt.VAR > torch.device must now be passed explicitly to triton.kernel (no longer inferred from torch.tensor arguments) * C++ tests moved to `python/tests/` * C++ tutorial created in `tutorials/` * Python tutorial created in python/tutorials/ * Version changed to 1.0alpha * No longer copying C++ headers into the Python package * added python/triton/ops/ package for pre-written Triton ops	2021-07-27 12:38:48 -07:00
Philippe Tillet	083bbd1e8d	[GENERAL] Merged v1.0alpha into master. Added features are: - A100 support via mma.16816 - Thread swizzling for conflict-free shared memory accesses without padding - Complete overhaul of the LLVM code generation in codegen/selection/generator.cc to remove overengineering - Added debugging capabilities in the Python binding - Compilation error for kernels that spill	2021-07-27 12:38:48 -07:00
Yan Da	05b95b7fa6	[LANG] Add support for PREFIX_INC and PREFIX_DEC.	2021-07-27 12:38:48 -07:00
Philippe Tillet	44ca2c0cb8	[DRIVER] Removed deprecated files and functions	2021-07-27 12:38:48 -07:00

1 2 3 4 5

225 Commits