triton

Author	SHA1	Message	Date
Philippe Tillet	59b0ac672a	[LANGUAGE] Added support for bitcast (#119 )	2021-07-27 12:38:49 -07:00
Philippe Tillet	3ab121dbdb	[PYTHON] Added support for tuples (#116 )	2021-07-27 12:38:49 -07:00
Philippe Tillet	f81012a8cf	[CODEGEN] Fixed atomic_add issue (#112 ) * [CODEGEN] Fixed atomic_add issue * [CODEGEN] Fixed liveness analysis bug for instructions that are not DCE'd but have no users (e.g., atomic_cas)	2021-07-27 12:38:49 -07:00
Philippe Tillet	bfc0a7587d	[PYTHON] Renamed triton.core -> triton.language (#92 )	2021-07-27 12:38:49 -07:00
Philippe Tillet	39f4730305	Deprecation of Triton-C and Replacement by decorated Python functions (#86 ) This PR implements a major overhaul of the frontend for Triton, and replaces Triton-C by a pure Python API in which kernels are defined as @triton.jit decorated functions. The documentation and tutorials have also been updated to accommodate these changes. See documentations for more information on the new API	2021-07-27 12:38:49 -07:00
Philippe Tillet	2f80a98776	[BUILD] Added automatic nightly build releases to pip in CI; removed build-time dependence on LLVM and PyTorch (#77 ) Recently there has been more and more report about installation issues: - Installing Triton before upgrading pytorch can create some issues because Triton uses some torch headers - llvm-10-dev not available on some platform; llvm-11-dev not available on e.g. Ubuntu. absence of nightly builds This PR should fix all these issues. Some CMake tricks are used to download and install llvm at build time. Triton Python bindings were modified to remove dependence on pytorch ops. Midnight CI job added to generate binary wheels for all Triton version and update them on pypi's new triton-nightly project. This PR will also make it very easy to use LLVM forks in the future for whatever needs we have.	2021-07-27 12:38:49 -07:00
Philippe Tillet	62835a0979	[RUNTIME] Added auto-alignment mechanism (#71 ) This PR adds an automatic memory alignment mechanism in the Triton runtime. Specifically, the JIT compiler detects the alignment (in bytes) of each pointer argument as well as the largest power of two divisor (between 1 and 16) of each integer argument. Proper .aligned and .multipleof attributes are then added to the Triton-IR on-the-fly for all auto-tunable kernels. There is a cache that remembers all the kernels compiled for each possible configuration. This PR also includes substantial cleaning of the Python API. This adds 2-3us overhead, mostly due to accessing integer #defines from the auto-tuned compilation options. The previous solution was slightly faster but hacky and potentially unsafe, so this is preferred for now.	2021-07-27 12:38:49 -07:00
Philippe Tillet	567a1a3d17	[CODEGEN] Bugfixes with FP32 async copy	2021-07-27 12:38:49 -07:00
Philippe Tillet	5b83259592	[CODEGEN] Major performance improvements on A100 (#70 ) Improved handling of asynchronous copy, scheduling and synchronization for A100. Now achieving CUTLASS-like performance on large square dense matrix multiplication tasks	2021-07-27 12:38:49 -07:00
Jared Kaplan	045ab5d62a	[PYTHON] Add Blocksparse Attention Fwd/Bwd Test (#69 ) Also includes small bugfix for block-sparse softmax	2021-07-27 12:38:49 -07:00
Philippe Tillet	ce8aa2a41a	[CI] Added benchmarking to CI script (#65 )	2021-07-27 12:38:49 -07:00
Philippe Tillet	5e3c7f5a60	[PYTHON] Added automated benchmark script (#63 ) This adds a bench functionality to the setup.py that can be used to run the benchmark suite and generates a bunch of csv files (and optionally plots) python setup.py bench python setup.py bench --with-plots python setup.py bench --filter=cross_entropy	2021-07-27 12:38:48 -07:00

1 2 3

112 Commits