Philippe Tillet
f64b779b0d
[PYTHON] Bugfix on FP32 blocksparse matmul
2021-07-27 12:38:49 -07:00
Philippe Tillet
567a1a3d17
[CODEGEN] Bugfixes with FP32 async copy
2021-07-27 12:38:49 -07:00
Philippe Tillet
11215f0f03
[CODEGEN] Now initializing cp.async to zero when predicate is false
...
WARNING: case for non-zero initialization is still not handled. Will
require manual copy to shared
2021-07-27 12:38:49 -07:00
Philippe Tillet
5b83259592
[CODEGEN] Major performance improvements on A100 ( #70 )
...
Improved handling of asynchronous copy, scheduling and synchronization for A100. Now achieving CUTLASS-like performance on large square dense matrix multiplication tasks
2021-07-27 12:38:49 -07:00
Jared Kaplan
045ab5d62a
[PYTHON] Add Blocksparse Attention Fwd/Bwd Test ( #69 )
...
Also includes small bugfix for block-sparse softmax
2021-07-27 12:38:49 -07:00
Tom B Brown
7aa4d080b3
[PYTHON] Avoid dangerous global variables in kwarg default values ( #68 )
2021-07-27 12:38:49 -07:00
Jason Ansel
dcd14c4e8d
[DOCS] Fix links in README.md ( #66 )
2021-07-27 12:38:49 -07:00
Philippe Tillet
8e0d198787
[CI] Continuous integration now publishes performance plots
2021-07-27 12:38:49 -07:00
Philippe Tillet
d190285d89
[PYTHON][OPS] Added compiler hints to improve performance of
...
cross-entropy
2021-07-27 12:38:49 -07:00
Philippe Tillet
b301c2d199
[CI] Updated build status URL
2021-07-27 12:38:49 -07:00
Philippe Tillet
ce8aa2a41a
[CI] Added benchmarking to CI script ( #65 )
2021-07-27 12:38:49 -07:00
Philippe Tillet
3ca40b05cf
[DRIVER] Added options for developers to cache PTX file so that ti can
...
be manually modified
2021-07-27 12:38:49 -07:00
Philippe Tillet
5e3c7f5a60
[PYTHON] Added automated benchmark script ( #63 )
...
This adds a bench functionality to the setup.py that can be used to run the benchmark suite and generates a bunch of csv files (and optionally plots)
python setup.py bench
python setup.py bench --with-plots
python setup.py bench --filter=cross_entropy
2021-07-27 12:38:48 -07:00
Philippe Tillet
66c94f21d7
[PYTHON] Removed .softmax from ops/__init__.py following previous commit
2021-07-27 12:38:48 -07:00
Philippe Tillet
b0647cfd52
[PYTHON] Removed support for dense softmax
...
Interest seems limited now that it is fused in cross_entropy. Will
likely re-add once it's easier to share code between ops
2021-07-27 12:38:48 -07:00
Jared Kaplan
682ac4c60e
Added a Softmax Xent Op ( #53 )
...
Also includes a bugfix in kernel.py to set the device before registering the c++ function object
2021-07-27 12:38:48 -07:00
Philippe Tillet
dffd66bc83
[PYTHON] Made codebase pep8 compliant
2021-07-27 12:38:48 -07:00
Philippe Tillet
2a02fabdac
[PYTHON] Some cleaning of the PyBind11 wrappers ( #62 )
2021-07-27 12:38:48 -07:00
Philippe Tillet
08909b49c8
[CI] Fixed unmerged YAML file
2021-07-27 12:38:48 -07:00
Philippe Tillet
3a4638f14d
[CI] Updated trigger rules and status badge link
2021-07-27 12:38:48 -07:00
Philippe Tillet
358c7c239e
[CI] Improvement of CI pipeline ( #60 )
2021-07-27 12:38:48 -07:00
Philippe Tillet
04c7bb75ce
[GITHUB] Modified run-on
property of run-pipeline.yaml
2021-07-27 12:38:48 -07:00
Philippe Tillet
e0568c35f0
[CI] Added Github Actions ( #59 )
2021-07-27 12:38:48 -07:00
Philippe Tillet
5522a8bdb4
[GENERAL] Added continuous integration
2021-07-27 12:38:48 -07:00
Philippe Tillet
80e8a2f1f2
[PYTHON][OPS][BLOCKSPARSE] Now rounding softmax tile sizes to next power
...
of 2
2021-07-27 12:38:48 -07:00
Philippe Tillet
b8a52c70c9
[LANG] Now requiring tiles have power of 2 number of elements
2021-07-27 12:38:48 -07:00
Philippe Tillet
cc84a476a3
[TESTS] test_matmul.py now plots benchmarks
2021-07-27 12:38:48 -07:00
Philippe Tillet
fedbe6f439
[PYTHON] Added triton.__version__ string
2021-07-27 12:38:48 -07:00
Philippe Tillet
6fb4800f57
Improvements w/ Auto-Tuning and standard benchmarks ( #57 )
...
[PYTHON] Bug-fixes in the auto-tuning module and improvement of the existing API for it
2021-07-27 12:38:48 -07:00
Philippe Tillet
ad005d49ac
[PYTHON] Added benchmark code for CUTLASS
2021-07-27 12:38:48 -07:00
Philippe Tillet
ad5a30bae1
[LANG] Added __debug_barrier() call to force insertion of a CUDA
...
__syncthreads
2021-07-27 12:38:48 -07:00
Philippe Tillet
3fde4b8f5b
[RUNTIME] Auto-tuning now works as expected when the values of
...
autotune_key change
2021-07-27 12:38:48 -07:00
Philippe Tillet
52af8cda34
[PYTHON] Fixed issue with IS_TK_DIV_K
2021-07-27 12:38:48 -07:00
Philippe Tillet
7cf358a352
[TUTORIALS] Fixed TYPO in CMakeLists.txt
2021-07-27 12:38:48 -07:00
Philippe Tillet
9b31244897
[PYTHON] Added benchmarking code
2021-07-27 12:38:48 -07:00
Philippe Tillet
0b025db2ee
[RUNTIME] Added option to print LLVM-IR
...
Also includes appropriate driver code change for that
2021-07-27 12:38:48 -07:00
Philippe Tillet
7ba242fcce
[PYTHON][OPS] Added block-sparse softmax
2021-07-27 12:38:48 -07:00
Philippe Tillet
f81da73b6a
[PYTHON] Added utility to read single Triton kernel from provided file
...
in triton.read
2021-07-27 12:38:48 -07:00
Philippe Tillet
9f9d7b8840
[LANG] Fixed parsing error for built-in functions exp/log/sqrtf
2021-07-27 12:38:48 -07:00
Philippe Tillet
269ebc12e5
[PYTHON][TESTS][DOC] Various improvement of the API and code quality:
...
* Simplified `triton.kernel` API to achieve lower latency:
> .data_ptr() must now be passed as kernel argument. No more implicit
conversion from torch.tensor
> compilation options are now constant attributes, i.e., opt.d('VAR')
becomes opt.VAR
> torch.device must now be passed explicitly to triton.kernel (no
longer inferred from torch.tensor arguments)
* C++ tests moved to `python/tests/`
* C++ tutorial created in `tutorials/`
* Python tutorial created in python/tutorials/
* Version changed to 1.0alpha
* No longer copying C++ headers into the Python package
* added python/triton/ops/ package for pre-written Triton ops
2021-07-27 12:38:48 -07:00
Philippe Tillet
a5a477c36b
[CODEGEN] Fixed bug in recoalesce_inst LLVM codegen
2021-07-27 12:38:48 -07:00
Philippe Tillet
376c876eb8
[RUNTIME] Disable error on spills
2021-07-27 12:38:48 -07:00
Philippe Tillet
3b36a1e60c
[CODEGEN] Fixed issue in traversal order for atomic_add and store_inst
2021-07-27 12:38:48 -07:00
Philippe Tillet
fdc8e8ef61
[TESTS] Fixed bug in how test arguments are enqueued
2021-07-27 12:38:48 -07:00
Philippe Tillet
083bbd1e8d
[GENERAL] Merged v1.0alpha into master. Added features are:
...
- A100 support via mma.16816
- Thread swizzling for conflict-free shared memory accesses without
padding
- Complete overhaul of the LLVM code generation in
codegen/selection/generator.cc to remove overengineering
- Added debugging capabilities in the Python binding
- Compilation error for kernels that spill
2021-07-27 12:38:48 -07:00
Philippe Tillet
c0bc7ed8b0
[PYTHON] Added TRITON_DEBUG_MODE which reallocates input tensors outside of the pytorch memory pool to spot out-of-bounds accesses more easily
2021-07-27 12:38:48 -07:00
Philippe Tillet
c4fceeea49
[LANG] Added hacky min/max
2021-07-27 12:38:48 -07:00
Philippe Tillet
d70f54fd6a
Merge pull request #45 from daadaada/master
...
[LANG] Add support for PREFIX_INC, PREFIX_DEC, POSTFIX_INC and POSTFIX_DEC
2021-07-27 12:38:48 -07:00
Philippe Tillet
547a99a5d4
[VERSION] 0.2.3 -> 0.3.0
2021-07-27 12:38:48 -07:00
Yan Da
27dc780871
[IR] Check constant_int type
2021-07-27 12:38:48 -07:00