triton

Author	SHA1	Message	Date
Rohit Santhanam	8cc448d92e	Changes to eliminate the need for the MI_GPU_ARCH environment variable. The AMDGPU arch is now parsed out of the rocminfo dump.	2022-11-18 18:51:57 +00:00
rsanthanam-amd	fecc7ce248	Fix for test_bitwise subtests for ROCm. (#16 ) The issue was that the kernel names were colliding with each other in the cache. Since the kernel names were based on the date and time, the kernels were getting compiled so fast that a subsequent kernel would end up with the same name as the previous one and would therefore overwrite it in the cache. It seems to run the same test multiple times but the subsequent runs would end up using the wrong kernel because of the collisions. It is fixed by appending a randomly generated alphanumeric string to keep the kernel names unique.	2022-10-31 15:24:08 -04:00
Michael Melesse	39381d99f8	send amdgcn to cache	2022-10-26 17:18:33 +00:00
Michael Melesse	32dbc08c05	fix llvm build errors	2022-10-17 18:29:15 +00:00
Michael Melesse	5c548fb57e	Merge branch 'master' into rcom52_fixes	2022-10-17 17:53:48 +00:00
Daniil Fukalov	406d03bfaf	Improve ROCm support. (#780 ) - updates to support ROCm 5.2 - workarounds in tests where NV tools were used unconditionally - implemented `get_num_blocks()` and `add_memfence()` for AMD GPU - backported from history some atomics - added bf16 support - minor warnings cleanup - added dockerfile to run on a ROCm enabled machine Co-authored-by: B1tway <andrew.shukshov@gmail.com> Co-authored-by: Andrey Shukshov <36711069+B1tway@users.noreply.github.com>	2022-10-14 11:33:42 -07:00
Philippe Tillet	33e6f0df7f	[DRIVER] Bumped CUDA requirement to 11.4+. This is to avoid bad performance surprises as older `ptxas` are much slower. (#769 ) This also makes codegen simpler by avoiding special handling of eviction policies	2022-10-12 12:02:30 -07:00
Jason Ansel	998fd5f9af	[FRONTEND] Make triton.compile work without a cuda context (#708 ) This allows compiling in a subprocess. I'm not seeing a ton of speedup from this, but figure it is a good change anyway.	2022-09-24 13:41:47 -07:00
Shintaro Iwasaki	c668d6596e	[DOCS] Fix spelling (#664 ) This PR applies minor spelling fix in comments and string literals to `master`. It shouldn't hurt anything.	2022-09-16 12:26:40 -07:00
Keren Zhou	4912916c11	[FRONTEND] Added support for element-wise function defined in external LLVM bitcode (e.g., libdevice) (#562 )	2022-07-13 15:52:21 -07:00
Philippe Tillet	2bed6fc850	[LANG] Added support for device functions (#484 )	2022-04-03 20:58:16 -07:00
apd10	e85c7a7fc7	Bugfix in ptxas path. (#487 ) Bug: "ret" value is destroyed when a failing "ptxas --version" is run overwriting the previous valid "ret" value. Fix: keep rets only for those runs which are successful. Pick the first one	2022-03-30 20:45:41 -07:00
Philippe Tillet	e0cc488055	[FRONTEND] Added `tl.clock` and `tl.globaltimer` (#485 )	2022-03-28 16:15:43 -07:00
Philippe Tillet	ea6d1f1b85	[DRIVER] LLVM driver fixup (#482 ) Current way of doing things is probably not super thread safe. init is shared between threads and some threads my not call the LLVMInitialize* function.	2022-03-23 00:24:45 -07:00
Philippe Tillet	98ed7db8c1	[CODEGEN] Improvements and bugfixes (#463 )	2022-02-24 14:56:24 -08:00
Philippe Tillet	2922dc141c	Merge branch 'master' into v2.0	2022-01-30 20:25:01 -08:00
Philippe Tillet	807d8a1945	[ALL] Merge master (#447 )	2022-01-30 20:21:20 -08:00
daadaada	59d371c6eb	[BACKEND] Added Int8 mma (#440 )	2022-01-27 09:12:44 -08:00
daadaada	94a2e10fe5	[BACKEND] Add bf16 & tf32 mma supports (on A100) (#426 )	2022-01-11 10:20:31 -08:00
Philippe Tillet	2509124dd0	[DRIVER] Fixed some issue with how ptxas is used (#399 ) Now using tmpnam and properly deleting temporaries when an exception is raised	2021-12-21 14:31:51 -08:00
Philippe Tillet	4e93b41c52	[GENERAL] Some minor fixups (#393 ) * [RUNTIME] Now displaying error message when generated PTX is invalid * [CODEGEN] Now converting `if` condition to bool implicitly	2021-12-17 18:06:21 -08:00
Victor	eb077fc993	[RUNTIME] fixed NVidia DLL names on Windows (#392 )	2021-12-16 22:09:52 -08:00
Michael Melesse	94d5c2e8b5	[ROCM] enable matmul(dot) and others (#391 )	2021-12-13 12:28:15 -08:00
Victor	73b04d71b2	Fixes for building on Windows (#382 ) * make C++ code compatible with Windows + MSVC * added dlfcn-win32 for cross-platform dlopen * fixed building and pip install on Windows * fixed shared library file name under Windows	2021-12-07 14:10:58 -08:00
Philippe Tillet	2acaa4d0dd	[LANG] Added support for constexpr (#361 )	2021-10-30 00:32:58 -07:00
Philippe Tillet	b7f0e87dc2	[DRIVER] Removed std::cout log message	2021-10-29 10:42:10 -07:00
Philippe Tillet	d3e584d4ba	Revert "[DRIVER] Fixed CUDA 10.1 bug (#357 )" (#358 ) This reverts commit `d35014ba47`.	2021-10-26 15:04:49 -07:00
Philippe Tillet	d35014ba47	[DRIVER] Fixed CUDA 10.1 bug (#357 )	2021-10-26 11:17:06 -07:00
Philippe Tillet	5ce1b726dc	[CODEGEN] Various bugfixes that make it possible to fuse RNG in a matmul epilogue (#356 )	2021-10-24 02:30:46 -07:00
Philippe Tillet	e22d92c63c	[RUNTIME] removed obsolete putenv call (#305 )	2021-09-23 17:51:58 -07:00
Philippe Tillet	6e5b0b4301	[FRONTEND] Added on-disk cache for compiled kernels (#287 )	2021-09-18 22:48:26 -07:00
Philippe Tillet	94c83d30ce	[GENERAL] Removed deprecated driver files and added basic compatibility with rocm (#268 ) - Removed driver module -- accelerator runtime is handled by pytorch - Added basic support for ROCM based on @micmelesse 's PR -- now can execute empty kernel on AMD devices without any compile-time changes - Now only using PREFER_SHARED for kernels when the size of shared memory is greater than 49k. Otherwise there can be poor L1 performance for broadcast tensors	2021-09-09 00:04:28 -07:00
Philippe Tillet	4ff3714d61	[CODEGEN] Various bugfixes and stability improvements in compiler backend (#240 )	2021-08-30 11:50:35 -07:00
milesial	5b29da719d	[DRIVER] Add CUDA P2P support (#209 )	2021-08-20 21:00:54 -07:00
Philippe Tillet	298da78058	[CODEGEN/DRIVER] Tweaks for performance optimization (#193 )	2021-08-07 16:41:44 -07:00
Philippe Tillet	e8031fe61f	[DRIVER] More robust support of unsupported CUDA version (#179 )	2021-08-02 09:06:55 -07:00
Philippe Tillet	2f0f51be50	[DRIVER] No longer crashing when encountering CUDA version >11.4	2021-07-29 11:27:55 -07:00
Philippe Tillet	8eb63bcb01	[CI] Various improvements to CI (#137 ) Add clean-up before CI runs. Now using static LLVM-11 libraries from system rather than recompilation. Still no run-time LLVM dependencies	2021-07-27 12:38:49 -07:00
Philippe Tillet	94ce6aa80f	[DRIVER] Added support for CUDA 11.4 (#135 )	2021-07-27 12:38:49 -07:00
Philippe Tillet	8cea583109	[IR] Preliminary support for BF16 (#129 ) This PR adds a BF16 data-type, along with FP32 <-> BF16 conversion instructions in the LLVM codegen. Other kinds of ops on bfloat16 are not yet supported.	2021-07-27 12:38:49 -07:00
daadaada	0b05e06c0d	cu_device::max_shared_memory() now returns max dynamic shared memory size (#127 )	2021-07-27 12:38:49 -07:00
daadaada	d8d6b715c8	[CODEGEN] Performance improvement on A100 (#125 ) Improved codegen for the Ampere GPUs. * Make the layout pass recognize the multistage pipelined pattern. * Now the pipeline pass can automate the multistage pipelining transformation. * Remove extra barriers (from the prefetch pass & WAR) on Ampere. * Update the code generator (generator.cc) to make Triton generate n-buffered shared memory loads/stores.	2021-07-27 12:38:49 -07:00
Philippe Tillet	b7b05a560e	[DRIVER] Now giving the option to use system ptxas through environment variable (#123 )	2021-07-27 12:38:49 -07:00
Philippe Tillet	9f30af76fb	[GENERAL] Minor improvements: (#110 ) * Load libcuda.so.1 if libcuda.so is not there. Error if both aren't there. * Support for multiple grad_to_none in triton.testing.do_bench * Benchmark dataframe printed along with name	2021-07-27 12:38:49 -07:00
Philippe Tillet	288b4f7f58	[PYTHON] Added frontend to print sass using turingas disasm.py (#109 )	2021-07-27 12:38:49 -07:00
daadaada	967e629c0c	[CODEGEN] Add a pass to prefetch operands of dot if applicable. (#105 ) * update membar pass when data is double buffered * Add instruction prefetch_s * prefetch tests pass (except the 1 warp case) * Fix the 1-warp bug * Add back prefetch files * Disable prefetch on a100 * Always add war barrier on sm>=80	2021-07-27 12:38:49 -07:00
Philippe Tillet	1e844ba78d	[CODEGEN] Switching to predicated inline PTX for LDGs (#103 )	2021-07-27 12:38:49 -07:00
Philippe Tillet	840140bf26	[CODEGEN] Removed dedicated reassociate pass to merge it into LLVM isel (#101 ) This massively simplifies implementation of `reassociate` and also fixes a bunch of bug. The pass could still be improved, but can already be used to generate constant pointer offsets in eg the matmul epilogue	2021-07-27 12:38:49 -07:00
Philippe Tillet	39f4730305	Deprecation of Triton-C and Replacement by decorated Python functions (#86 ) This PR implements a major overhaul of the frontend for Triton, and replaces Triton-C by a pure Python API in which kernels are defined as @triton.jit decorated functions. The documentation and tutorials have also been updated to accommodate these changes. See documentations for more information on the new API	2021-07-27 12:38:49 -07:00
Philippe Tillet	5b83259592	[CODEGEN] Major performance improvements on A100 (#70 ) Improved handling of asynchronous copy, scheduling and synchronization for A100. Now achieving CUTLASS-like performance on large square dense matrix multiplication tasks	2021-07-27 12:38:49 -07:00

1 2

75 Commits