triton

Author	SHA1	Message	Date
Keren Zhou	a428cf0bb2	[FRONTEND] Fix pytorch warning. (#560 ) UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc').	2022-06-20 20:12:09 -07:00
Jiabao Lei	abea3dc2c6	[FRONTEND] provide device kwargs && fix fstring error for py<3.8 (#515 ) Co-authored-by: Philippe Tillet <phil@openai.com>	2022-05-14 16:21:46 -07:00
Madeleine Thompson	efdabe6073	[STYLE] check python with flake8 (#424 ) I've been using this locally to find errors without running tests, and now that we're using autopep8, it passes with minimal suppressions. This is also what turned up the issues with the tutorials, which were fixed in #422.	2022-01-07 15:28:36 -08:00
Madeleine Thompson	8bf551ae7a	[STYLE] run autopep8 and isort (#421 ) Run: ``` isort ./python autopep8 -i --ignore E501,E701,E731 $(find ./python/ -name '*.py') ``` with an `.isort.cfg` and then clean up a few warts. This PR should be a no-op; the idea is that this is all boring whitespace changes, and any config file changes will be in a different change to make it easier to review.	2022-01-06 14:34:17 -08:00
Philippe Tillet	2c287544cb	[OPS] Faster and cleaner block-sparse implementation (#311 )	2021-09-27 18:25:16 -07:00
Benjamin Lefaudeux	bd855ac13d	[DOCS] Adding some doc on the benchmarks + requirements file (#285 )	2021-09-18 16:37:30 -07:00
Philippe Tillet	9f30af76fb	[GENERAL] Minor improvements: (#110 ) * Load libcuda.so.1 if libcuda.so is not there. Error if both aren't there. * Support for multiple grad_to_none in triton.testing.do_bench * Benchmark dataframe printed along with name	2021-07-27 12:38:49 -07:00
Philippe Tillet	bfc0a7587d	[PYTHON] Renamed triton.core -> triton.language (#92 )	2021-07-27 12:38:49 -07:00
Philippe Tillet	1fdb465b71	[DOCS] Various improvements and typo fixes	2021-07-27 12:38:49 -07:00
Philippe Tillet	50e58d73db	[DOCS] Improved plots in tutorials	2021-07-27 12:38:49 -07:00
Philippe Tillet	eacbb73968	[PYTHON] CUTLASS wrapper for fair benchmarks (#75 ) Before this commit, the benchmarking infrastructure used heterogeneous protocols between library (e.g., CUTLASS uses a C++ binary that reports mean TFLOPS; torch and triton use python call and report 10th, 50th and 90th quantiles). For the sake of uniformity and fair benchmark practices, this PR adds a python wrapper for auto-tuned CUTLASS matrix multiplication. Benchmarks have been rewritten to use this wrapper with `triton.testing.do_bench` rather than system calls to CUTLASS profiler. Importantly, this also ensures that all the matmuls are done on the same input data which should stabilize clock across providers.	2021-07-27 12:38:49 -07:00
Philippe Tillet	58a5c87c53	[PYTHON] Made `bench_blocksparse` and `bench_cross_entropy` compatible with the new performance report API	2021-07-27 12:38:49 -07:00
Philippe Tillet	5b9afaa688	[CODEGEN] Fixed bug that caused conditional operator to not always properly mask load operations Also includes minor improvement to benchmarking infrastructure	2021-07-27 12:38:49 -07:00
Philippe Tillet	62835a0979	[RUNTIME] Added auto-alignment mechanism (#71 ) This PR adds an automatic memory alignment mechanism in the Triton runtime. Specifically, the JIT compiler detects the alignment (in bytes) of each pointer argument as well as the largest power of two divisor (between 1 and 16) of each integer argument. Proper .aligned and .multipleof attributes are then added to the Triton-IR on-the-fly for all auto-tunable kernels. There is a cache that remembers all the kernels compiled for each possible configuration. This PR also includes substantial cleaning of the Python API. This adds 2-3us overhead, mostly due to accessing integer #defines from the auto-tuned compilation options. The previous solution was slightly faster but hacky and potentially unsafe, so this is preferred for now.	2021-07-27 12:38:49 -07:00
Philippe Tillet	ff62f7fffc	[PYTHON] bugfix in bench_cross_entropy	2021-07-27 12:38:49 -07:00
Philippe Tillet	5b83259592	[CODEGEN] Major performance improvements on A100 (#70 ) Improved handling of asynchronous copy, scheduling and synchronization for A100. Now achieving CUTLASS-like performance on large square dense matrix multiplication tasks	2021-07-27 12:38:49 -07:00
Philippe Tillet	ce8aa2a41a	[CI] Added benchmarking to CI script (#65 )	2021-07-27 12:38:49 -07:00
Philippe Tillet	5e3c7f5a60	[PYTHON] Added automated benchmark script (#63 ) This adds a bench functionality to the setup.py that can be used to run the benchmark suite and generates a bunch of csv files (and optionally plots) python setup.py bench python setup.py bench --with-plots python setup.py bench --filter=cross_entropy	2021-07-27 12:38:48 -07:00

18 Commits