triton

Author	SHA1	Message	Date
Michael Melesse	39381d99f8	send amdgcn to cache	2022-10-26 17:18:33 +00:00
Michael Melesse	32dbc08c05	fix llvm build errors	2022-10-17 18:29:15 +00:00
Michael Melesse	5c548fb57e	Merge branch 'master' into rcom52_fixes	2022-10-17 17:53:48 +00:00
Daniil Fukalov	406d03bfaf	Improve ROCm support. (#780 ) - updates to support ROCm 5.2 - workarounds in tests where NV tools were used unconditionally - implemented `get_num_blocks()` and `add_memfence()` for AMD GPU - backported from history some atomics - added bf16 support - minor warnings cleanup - added dockerfile to run on a ROCm enabled machine Co-authored-by: B1tway <andrew.shukshov@gmail.com> Co-authored-by: Andrey Shukshov <36711069+B1tway@users.noreply.github.com>	2022-10-14 11:33:42 -07:00
Philippe Tillet	33e6f0df7f	[DRIVER] Bumped CUDA requirement to 11.4+. This is to avoid bad performance surprises as older `ptxas` are much slower. (#769 ) This also makes codegen simpler by avoiding special handling of eviction policies	2022-10-12 12:02:30 -07:00
Jason Ansel	998fd5f9af	[FRONTEND] Make triton.compile work without a cuda context (#708 ) This allows compiling in a subprocess. I'm not seeing a ton of speedup from this, but figure it is a good change anyway.	2022-09-24 13:41:47 -07:00
Shintaro Iwasaki	c668d6596e	[DOCS] Fix spelling (#664 ) This PR applies minor spelling fix in comments and string literals to `master`. It shouldn't hurt anything.	2022-09-16 12:26:40 -07:00
Keren Zhou	4912916c11	[FRONTEND] Added support for element-wise function defined in external LLVM bitcode (e.g., libdevice) (#562 )	2022-07-13 15:52:21 -07:00
Philippe Tillet	2bed6fc850	[LANG] Added support for device functions (#484 )	2022-04-03 20:58:16 -07:00
apd10	e85c7a7fc7	Bugfix in ptxas path. (#487 ) Bug: "ret" value is destroyed when a failing "ptxas --version" is run overwriting the previous valid "ret" value. Fix: keep rets only for those runs which are successful. Pick the first one	2022-03-30 20:45:41 -07:00
Philippe Tillet	e0cc488055	[FRONTEND] Added `tl.clock` and `tl.globaltimer` (#485 )	2022-03-28 16:15:43 -07:00
Philippe Tillet	ea6d1f1b85	[DRIVER] LLVM driver fixup (#482 ) Current way of doing things is probably not super thread safe. init is shared between threads and some threads my not call the LLVMInitialize* function.	2022-03-23 00:24:45 -07:00
Philippe Tillet	98ed7db8c1	[CODEGEN] Improvements and bugfixes (#463 )	2022-02-24 14:56:24 -08:00
Philippe Tillet	2922dc141c	Merge branch 'master' into v2.0	2022-01-30 20:25:01 -08:00
Philippe Tillet	807d8a1945	[ALL] Merge master (#447 )	2022-01-30 20:21:20 -08:00
daadaada	59d371c6eb	[BACKEND] Added Int8 mma (#440 )	2022-01-27 09:12:44 -08:00
daadaada	94a2e10fe5	[BACKEND] Add bf16 & tf32 mma supports (on A100) (#426 )	2022-01-11 10:20:31 -08:00
Philippe Tillet	2509124dd0	[DRIVER] Fixed some issue with how ptxas is used (#399 ) Now using tmpnam and properly deleting temporaries when an exception is raised	2021-12-21 14:31:51 -08:00
Philippe Tillet	4e93b41c52	[GENERAL] Some minor fixups (#393 ) * [RUNTIME] Now displaying error message when generated PTX is invalid * [CODEGEN] Now converting `if` condition to bool implicitly	2021-12-17 18:06:21 -08:00
Victor	eb077fc993	[RUNTIME] fixed NVidia DLL names on Windows (#392 )	2021-12-16 22:09:52 -08:00
Michael Melesse	94d5c2e8b5	[ROCM] enable matmul(dot) and others (#391 )	2021-12-13 12:28:15 -08:00
Victor	73b04d71b2	Fixes for building on Windows (#382 ) * make C++ code compatible with Windows + MSVC * added dlfcn-win32 for cross-platform dlopen * fixed building and pip install on Windows * fixed shared library file name under Windows	2021-12-07 14:10:58 -08:00
Philippe Tillet	2acaa4d0dd	[LANG] Added support for constexpr (#361 )	2021-10-30 00:32:58 -07:00
Philippe Tillet	b7f0e87dc2	[DRIVER] Removed std::cout log message	2021-10-29 10:42:10 -07:00
Philippe Tillet	d3e584d4ba	Revert "[DRIVER] Fixed CUDA 10.1 bug (#357 )" (#358 ) This reverts commit `d35014ba47`.	2021-10-26 15:04:49 -07:00
Philippe Tillet	d35014ba47	[DRIVER] Fixed CUDA 10.1 bug (#357 )	2021-10-26 11:17:06 -07:00
Philippe Tillet	5ce1b726dc	[CODEGEN] Various bugfixes that make it possible to fuse RNG in a matmul epilogue (#356 )	2021-10-24 02:30:46 -07:00
Philippe Tillet	e22d92c63c	[RUNTIME] removed obsolete putenv call (#305 )	2021-09-23 17:51:58 -07:00
Philippe Tillet	6e5b0b4301	[FRONTEND] Added on-disk cache for compiled kernels (#287 )	2021-09-18 22:48:26 -07:00
Philippe Tillet	94c83d30ce	[GENERAL] Removed deprecated driver files and added basic compatibility with rocm (#268 ) - Removed driver module -- accelerator runtime is handled by pytorch - Added basic support for ROCM based on @micmelesse 's PR -- now can execute empty kernel on AMD devices without any compile-time changes - Now only using PREFER_SHARED for kernels when the size of shared memory is greater than 49k. Otherwise there can be poor L1 performance for broadcast tensors	2021-09-09 00:04:28 -07:00
Philippe Tillet	4ff3714d61	[CODEGEN] Various bugfixes and stability improvements in compiler backend (#240 )	2021-08-30 11:50:35 -07:00
milesial	5b29da719d	[DRIVER] Add CUDA P2P support (#209 )	2021-08-20 21:00:54 -07:00
Philippe Tillet	298da78058	[CODEGEN/DRIVER] Tweaks for performance optimization (#193 )	2021-08-07 16:41:44 -07:00
Philippe Tillet	e8031fe61f	[DRIVER] More robust support of unsupported CUDA version (#179 )	2021-08-02 09:06:55 -07:00
Philippe Tillet	2f0f51be50	[DRIVER] No longer crashing when encountering CUDA version >11.4	2021-07-29 11:27:55 -07:00
Philippe Tillet	8eb63bcb01	[CI] Various improvements to CI (#137 ) Add clean-up before CI runs. Now using static LLVM-11 libraries from system rather than recompilation. Still no run-time LLVM dependencies	2021-07-27 12:38:49 -07:00
Philippe Tillet	94ce6aa80f	[DRIVER] Added support for CUDA 11.4 (#135 )	2021-07-27 12:38:49 -07:00
Philippe Tillet	8cea583109	[IR] Preliminary support for BF16 (#129 ) This PR adds a BF16 data-type, along with FP32 <-> BF16 conversion instructions in the LLVM codegen. Other kinds of ops on bfloat16 are not yet supported.	2021-07-27 12:38:49 -07:00
daadaada	0b05e06c0d	cu_device::max_shared_memory() now returns max dynamic shared memory size (#127 )	2021-07-27 12:38:49 -07:00
daadaada	d8d6b715c8	[CODEGEN] Performance improvement on A100 (#125 ) Improved codegen for the Ampere GPUs. * Make the layout pass recognize the multistage pipelined pattern. * Now the pipeline pass can automate the multistage pipelining transformation. * Remove extra barriers (from the prefetch pass & WAR) on Ampere. * Update the code generator (generator.cc) to make Triton generate n-buffered shared memory loads/stores.	2021-07-27 12:38:49 -07:00
Philippe Tillet	b7b05a560e	[DRIVER] Now giving the option to use system ptxas through environment variable (#123 )	2021-07-27 12:38:49 -07:00
Philippe Tillet	9f30af76fb	[GENERAL] Minor improvements: (#110 ) * Load libcuda.so.1 if libcuda.so is not there. Error if both aren't there. * Support for multiple grad_to_none in triton.testing.do_bench * Benchmark dataframe printed along with name	2021-07-27 12:38:49 -07:00
Philippe Tillet	288b4f7f58	[PYTHON] Added frontend to print sass using turingas disasm.py (#109 )	2021-07-27 12:38:49 -07:00
daadaada	967e629c0c	[CODEGEN] Add a pass to prefetch operands of dot if applicable. (#105 ) * update membar pass when data is double buffered * Add instruction prefetch_s * prefetch tests pass (except the 1 warp case) * Fix the 1-warp bug * Add back prefetch files * Disable prefetch on a100 * Always add war barrier on sm>=80	2021-07-27 12:38:49 -07:00
Philippe Tillet	1e844ba78d	[CODEGEN] Switching to predicated inline PTX for LDGs (#103 )	2021-07-27 12:38:49 -07:00
Philippe Tillet	840140bf26	[CODEGEN] Removed dedicated reassociate pass to merge it into LLVM isel (#101 ) This massively simplifies implementation of `reassociate` and also fixes a bunch of bug. The pass could still be improved, but can already be used to generate constant pointer offsets in eg the matmul epilogue	2021-07-27 12:38:49 -07:00
Philippe Tillet	39f4730305	Deprecation of Triton-C and Replacement by decorated Python functions (#86 ) This PR implements a major overhaul of the frontend for Triton, and replaces Triton-C by a pure Python API in which kernels are defined as @triton.jit decorated functions. The documentation and tutorials have also been updated to accommodate these changes. See documentations for more information on the new API	2021-07-27 12:38:49 -07:00
Philippe Tillet	5b83259592	[CODEGEN] Major performance improvements on A100 (#70 ) Improved handling of asynchronous copy, scheduling and synchronization for A100. Now achieving CUTLASS-like performance on large square dense matrix multiplication tasks	2021-07-27 12:38:49 -07:00
Philippe Tillet	3ca40b05cf	[DRIVER] Added options for developers to cache PTX file so that ti can be manually modified	2021-07-27 12:38:49 -07:00
Philippe Tillet	0b025db2ee	[RUNTIME] Added option to print LLVM-IR Also includes appropriate driver code change for that	2021-07-27 12:38:48 -07:00

1 2

73 Commits