Philippe Tillet
92242ace2c
[DOCS] Re-structured documentation hierarchy
2021-07-27 12:38:49 -07:00
Philippe Tillet
ca04da3575
[DOCS] Switched tutorials to Python and use Sphinx Gallery
2021-07-27 12:38:49 -07:00
Philippe Tillet
5172792543
[DOCS] Added .ipynb tutorials in docs
2021-07-27 12:38:49 -07:00
Philippe Tillet
0c13b8ff0e
[DOCS] Updated and improved docs ( #73 )
2021-07-27 12:38:49 -07:00
Philippe Tillet
3ecf834a69
[PYTHON] Deleted 01-vector-add.py: it is an unnecessary duplicate of
...
01-vector-add.ipynb
2021-07-27 12:38:49 -07:00
Philippe Tillet
62835a0979
[RUNTIME] Added auto-alignment mechanism ( #71 )
...
This PR adds an automatic memory alignment mechanism in the Triton runtime. Specifically, the JIT compiler detects the alignment (in bytes) of each pointer argument as well as the largest power of two divisor (between 1 and 16) of each integer argument. Proper .aligned and .multipleof attributes are then added to the Triton-IR on-the-fly for all auto-tunable kernels. There is a cache that remembers all the kernels compiled for each possible configuration.
This PR also includes substantial cleaning of the Python API. This adds 2-3us overhead, mostly due to accessing integer #defines from the auto-tuned compilation options. The previous solution was slightly faster but hacky and potentially unsafe, so this is preferred for now.
2021-07-27 12:38:49 -07:00
Philippe Tillet
ff62f7fffc
[PYTHON] bugfix in bench_cross_entropy
2021-07-27 12:38:49 -07:00
Philippe Tillet
50ff1aea86
[DOCS] Added Python 02-fused-softmax.ipynb tutorial
2021-07-27 12:38:49 -07:00
Philippe Tillet
f64b779b0d
[PYTHON] Bugfix on FP32 blocksparse matmul
2021-07-27 12:38:49 -07:00
Philippe Tillet
567a1a3d17
[CODEGEN] Bugfixes with FP32 async copy
2021-07-27 12:38:49 -07:00
Philippe Tillet
11215f0f03
[CODEGEN] Now initializing cp.async to zero when predicate is false
...
WARNING: case for non-zero initialization is still not handled. Will
require manual copy to shared
2021-07-27 12:38:49 -07:00
Philippe Tillet
5b83259592
[CODEGEN] Major performance improvements on A100 ( #70 )
...
Improved handling of asynchronous copy, scheduling and synchronization for A100. Now achieving CUTLASS-like performance on large square dense matrix multiplication tasks
2021-07-27 12:38:49 -07:00
Jared Kaplan
045ab5d62a
[PYTHON] Add Blocksparse Attention Fwd/Bwd Test ( #69 )
...
Also includes small bugfix for block-sparse softmax
2021-07-27 12:38:49 -07:00
Tom B Brown
7aa4d080b3
[PYTHON] Avoid dangerous global variables in kwarg default values ( #68 )
2021-07-27 12:38:49 -07:00
Jason Ansel
dcd14c4e8d
[DOCS] Fix links in README.md ( #66 )
2021-07-27 12:38:49 -07:00
Philippe Tillet
8e0d198787
[CI] Continuous integration now publishes performance plots
2021-07-27 12:38:49 -07:00
Philippe Tillet
d190285d89
[PYTHON][OPS] Added compiler hints to improve performance of
...
cross-entropy
2021-07-27 12:38:49 -07:00
Philippe Tillet
b301c2d199
[CI] Updated build status URL
2021-07-27 12:38:49 -07:00
Philippe Tillet
ce8aa2a41a
[CI] Added benchmarking to CI script ( #65 )
2021-07-27 12:38:49 -07:00
Philippe Tillet
3ca40b05cf
[DRIVER] Added options for developers to cache PTX file so that ti can
...
be manually modified
2021-07-27 12:38:49 -07:00
Philippe Tillet
5e3c7f5a60
[PYTHON] Added automated benchmark script ( #63 )
...
This adds a bench functionality to the setup.py that can be used to run the benchmark suite and generates a bunch of csv files (and optionally plots)
python setup.py bench
python setup.py bench --with-plots
python setup.py bench --filter=cross_entropy
2021-07-27 12:38:48 -07:00
Philippe Tillet
66c94f21d7
[PYTHON] Removed .softmax from ops/__init__.py following previous commit
2021-07-27 12:38:48 -07:00
Philippe Tillet
b0647cfd52
[PYTHON] Removed support for dense softmax
...
Interest seems limited now that it is fused in cross_entropy. Will
likely re-add once it's easier to share code between ops
2021-07-27 12:38:48 -07:00
Jared Kaplan
682ac4c60e
Added a Softmax Xent Op ( #53 )
...
Also includes a bugfix in kernel.py to set the device before registering the c++ function object
2021-07-27 12:38:48 -07:00
Philippe Tillet
dffd66bc83
[PYTHON] Made codebase pep8 compliant
2021-07-27 12:38:48 -07:00
Philippe Tillet
2a02fabdac
[PYTHON] Some cleaning of the PyBind11 wrappers ( #62 )
2021-07-27 12:38:48 -07:00
Philippe Tillet
08909b49c8
[CI] Fixed unmerged YAML file
2021-07-27 12:38:48 -07:00
Philippe Tillet
3a4638f14d
[CI] Updated trigger rules and status badge link
2021-07-27 12:38:48 -07:00
Philippe Tillet
358c7c239e
[CI] Improvement of CI pipeline ( #60 )
2021-07-27 12:38:48 -07:00
Philippe Tillet
04c7bb75ce
[GITHUB] Modified run-on
property of run-pipeline.yaml
2021-07-27 12:38:48 -07:00
Philippe Tillet
e0568c35f0
[CI] Added Github Actions ( #59 )
2021-07-27 12:38:48 -07:00
Philippe Tillet
5522a8bdb4
[GENERAL] Added continuous integration
2021-07-27 12:38:48 -07:00
Philippe Tillet
80e8a2f1f2
[PYTHON][OPS][BLOCKSPARSE] Now rounding softmax tile sizes to next power
...
of 2
2021-07-27 12:38:48 -07:00
Philippe Tillet
b8a52c70c9
[LANG] Now requiring tiles have power of 2 number of elements
2021-07-27 12:38:48 -07:00
Philippe Tillet
cc84a476a3
[TESTS] test_matmul.py now plots benchmarks
2021-07-27 12:38:48 -07:00
Philippe Tillet
fedbe6f439
[PYTHON] Added triton.__version__ string
2021-07-27 12:38:48 -07:00
Philippe Tillet
6fb4800f57
Improvements w/ Auto-Tuning and standard benchmarks ( #57 )
...
[PYTHON] Bug-fixes in the auto-tuning module and improvement of the existing API for it
2021-07-27 12:38:48 -07:00
Philippe Tillet
ad005d49ac
[PYTHON] Added benchmark code for CUTLASS
2021-07-27 12:38:48 -07:00
Philippe Tillet
ad5a30bae1
[LANG] Added __debug_barrier() call to force insertion of a CUDA
...
__syncthreads
2021-07-27 12:38:48 -07:00
Philippe Tillet
3fde4b8f5b
[RUNTIME] Auto-tuning now works as expected when the values of
...
autotune_key change
2021-07-27 12:38:48 -07:00
Philippe Tillet
52af8cda34
[PYTHON] Fixed issue with IS_TK_DIV_K
2021-07-27 12:38:48 -07:00
Philippe Tillet
7cf358a352
[TUTORIALS] Fixed TYPO in CMakeLists.txt
2021-07-27 12:38:48 -07:00
Philippe Tillet
9b31244897
[PYTHON] Added benchmarking code
2021-07-27 12:38:48 -07:00
Philippe Tillet
0b025db2ee
[RUNTIME] Added option to print LLVM-IR
...
Also includes appropriate driver code change for that
2021-07-27 12:38:48 -07:00
Philippe Tillet
7ba242fcce
[PYTHON][OPS] Added block-sparse softmax
2021-07-27 12:38:48 -07:00
Philippe Tillet
f81da73b6a
[PYTHON] Added utility to read single Triton kernel from provided file
...
in triton.read
2021-07-27 12:38:48 -07:00
Philippe Tillet
9f9d7b8840
[LANG] Fixed parsing error for built-in functions exp/log/sqrtf
2021-07-27 12:38:48 -07:00
Philippe Tillet
269ebc12e5
[PYTHON][TESTS][DOC] Various improvement of the API and code quality:
...
* Simplified `triton.kernel` API to achieve lower latency:
> .data_ptr() must now be passed as kernel argument. No more implicit
conversion from torch.tensor
> compilation options are now constant attributes, i.e., opt.d('VAR')
becomes opt.VAR
> torch.device must now be passed explicitly to triton.kernel (no
longer inferred from torch.tensor arguments)
* C++ tests moved to `python/tests/`
* C++ tutorial created in `tutorials/`
* Python tutorial created in python/tutorials/
* Version changed to 1.0alpha
* No longer copying C++ headers into the Python package
* added python/triton/ops/ package for pre-written Triton ops
2021-07-27 12:38:48 -07:00
Philippe Tillet
a5a477c36b
[CODEGEN] Fixed bug in recoalesce_inst LLVM codegen
2021-07-27 12:38:48 -07:00
Philippe Tillet
376c876eb8
[RUNTIME] Disable error on spills
2021-07-27 12:38:48 -07:00