triton

Author	SHA1	Message	Date
Chenggang Zhao	f16138d447	[Frontend] Interface fixes for libdevice (#830 ) - Unifying several interfaces with different types to a single one, e.g. `fsub_ru` and `dsub_ru` -> `sub_ru`; - Minor bug fix: `fast_pow` is incorrectly classified into the `pow` interface, of which arguments are the same as `powf`; - Explicit interfaces for casting functions, e.g. decoupling `ll2float_ru` to `ll2float_ru` and `ull2float_ru`; - Removing interfaces that are not in NVIDIA's official documents, e.g. `fmaf_ieee_rn`, which is confusing together with `fmaf_rn`. Note that this PR for the master branch is different from #829, which is for the MLIR branch.	2022-11-01 10:51:58 -07:00
Mark Saroufim	578ada7740	[DOCS] Add install from source instructions to README (#821 )	2022-10-31 11:08:18 -07:00
Phil Tillet	6311d70406	Revert "[BUILD] Now using cibuildwheel default" This reverts commit `584086f08c`.	2022-10-29 17:15:47 -07:00
Phil Tillet	584086f08c	[BUILD] Now using cibuildwheel default	2022-10-29 16:59:06 -07:00
Keren Zhou	3ca667dfa8	[Frontend] Return a scalar if all input args are scalar (#816 )	2022-10-28 23:27:06 -07:00
Yanbo Liang	5ca1ed0101	Add bf16/fp16/fp64 support for ty_to_cpp (#800 ) In ```torch._inductor```, we [convert 0d CPU tensor to scalar during triton codegen](https://github.com/pytorch/pytorch/pull/87329), so need add missing triton support for bf16/fp16/fp64.	2022-10-24 19:41:25 -07:00
Keren Zhou	db3aa1d1fb	[FRONTEND] Fix libdevice (#776 ) Fix two problems in libdevice and external dispatch: 1. Use static triton types (e.g., tl.int32) instead of creating new types. Otherwise, `tl.int32` and `tl.dtype('int32')` are not the same thing. 2. The name of an extern inst should be empty but not the symbol name of the inst. TTIR generator will assign names automatically. Otherwise, we have the same variable name when there are multiple same extern insts. Before the PR: ```bash __nv_exp = extern_elementwise f64<1024> %11; __nv_exp = extern_elementwise f64<1024> %11; ``` After the PR: ```bash %12 = extern_elementwise f64<1024> %11; %13 = extern_elementwise f64<1024> %11; ```	2022-10-13 17:18:16 -07:00
Twizzes	ddae106c0e	[DOCS] Update installation.rst to fix windows build error (#747 )	2022-10-13 13:27:15 -07:00
Keren Zhou	bc98aead33	[Backend] Fix for mov.u8 (#766 ) Init a potential fix for mov.u8 which is not supported by ptx for now. Use mov.u16 instead and cast it to u8.	2022-10-12 14:32:27 -07:00
Yu Guo	71b46acc42	[IR] Added special-purpose `dequantize` instruction (#759 ) It is currently necessary for optimal performance in quantized workloads to add a special-purpose instruction in the IR. Backward compatibility with this instruction is NOT guaranteed.	2022-10-12 14:14:45 -07:00
Philippe Tillet	33e6f0df7f	[DRIVER] Bumped CUDA requirement to 11.4+. This is to avoid bad performance surprises as older `ptxas` are much slower. (#769 ) This also makes codegen simpler by avoiding special handling of eviction policies	2022-10-12 12:02:30 -07:00
Philippe Tillet	af76c989eb	[RUNTIME] Make entry point cache key depend on triton version hash (#765 )	2022-10-11 13:24:30 -07:00
Bin Bao	09cc2d454b	[FRONTEND] Fix a bool tensor storing problem (#746 )	2022-10-10 12:11:50 -07:00
Felipe Petroski Such	5d4b26d380	[RUNTIME] support multiple devices in the same process (#757 )	2022-10-09 20:30:04 -07:00
Chris	9a11a567ce	[DOCS] Fixed typos in 01-vector-add.py (#751 )	2022-10-09 18:12:46 -07:00
Keren Zhou	11345e9b74	[RUNTIME] Add callback functions for external tools (#738 )	2022-10-05 14:46:55 -07:00
Philippe Tillet	bdfdb9a1d2	[RUNTIME] Fixed JIT bug that leg some constexpr values to be overriden by specialization parameters (#742 )	2022-10-05 11:00:32 -07:00
shenggan	77c752dc78	[RUNTIME] remove fixed cu_include_dir (#739 ) Use environment variable `CUDA_HOME` with default value`/usr/local/cuda` for `cu_include_dir` #731	2022-10-04 19:49:57 -07:00
Natalia Gimelshein	d3c925db8a	[FRONTEND] properly broadcast scalar where condition (#736 )	2022-10-04 12:44:03 -07:00
fdrocha	2b0f877fad	[RUNTIME] Support environments with multiple cudalibs (#733 )	2022-10-03 18:36:24 +00:00
Keren Zhou	4a2d3b7d79	[RUNTIME] Dump llvm, ttir, and sass to help debugging (#732 )	2022-10-03 00:39:52 +00:00
Natalia Gimelshein	f55960e773	[FRONTEND] fix broadcasting for where (#729 ) Fixes #532, all 3 inputs to where have to be broadcast together.	2022-10-01 13:18:47 -07:00
Phil Tillet	b244db06da	[TUTORIALS] Attention tutorial fixup	2022-09-30 19:31:43 -07:00
Shintaro Iwasaki	7b61303ea1	[CODEGEN] Fix extract_N_bufferable in layout analysis (#728 )	2022-09-30 12:21:22 -07:00
Shintaro Iwasaki	ae59f51c2d	[CODEGEN] Fix an inliner to call a function with a phi-node (#727 )	2022-09-29 21:36:40 -07:00
albanD	f45e31ba7c	[FRONTEND] Make sure to hold the gil when creating python objects (#726 ) Without this patch, a debug version of python complains that: ``` Fatal Python error: Python memory allocator called without holding the GIL Python runtime state: initialized ```	2022-09-29 18:06:22 -07:00
Philippe Tillet	dad97528b2	[TESTING] allclose fixup (#724 )	2022-09-28 22:49:05 +00:00
Jason Ansel	998fd5f9af	[FRONTEND] Make triton.compile work without a cuda context (#708 ) This allows compiling in a subprocess. I'm not seeing a ton of speedup from this, but figure it is a good change anyway.	2022-09-24 13:41:47 -07:00
Shintaro Iwasaki	3ac929b48b	[BUILD] Download pybind11 in setup.py (#703 ) Based on the discussion in #700, this PR enables downloading pybind11 in `setup.py` without `git submodule` instead of copy-pasting pybind11 code. The downloaded pybind11 will be in `~/.triton/pybind` (like `llvm`).	2022-09-23 15:54:07 -07:00
Jason Ansel	579c03615d	[FRONTEND] Reduce number of compiles in JITFunction (#704 ) I suspect this was the cause of the "new compiles even on a warm cache" behavior I was seeing, though haven't 100% confirmed it. Python `set()` iteration order is nondeterministic when you create a new process. So the same args could produce different `instance_descriptor`s and have false cache misses.	2022-09-23 21:44:52 +00:00
Philippe Tillet	25e1b36785	Revert "[pybind11] Use git-submodule for pybind11" (#701 ) Reverts openai/triton#699	2022-09-23 12:25:38 -07:00
Shintaro Iwasaki	61d104ab3a	[FRONTEND] Use git-submodule for pybind11 (#699 ) This PR changes the `pybind11` source code management from copy-paste to a package controlled by git-submodule. See the discussion in #694 for details.	2022-09-23 09:55:03 -07:00
Philippe Tillet	8c3d4d5749	[RUNTIME] now decoupling entry point from cubin (#696 )	2022-09-22 16:44:22 -07:00
Shintaro Iwasaki	df67068bb0	[pybind11] Update pybind11 to 2.10.0 (#691 ) This PR updates the version of pybind11 to 2.10.0 (the latest stable).	2022-09-21 20:18:02 -07:00
Philippe Tillet	677ddae618	[FRONTEND] Add warmup for triton.jit() (#684 ) This revives #671 , removing the static functions that may unnecessarily hold a reference to the grid and the JITFunction object Co-authored-by: Jason Ansel <jansel@jansel.net>	2022-09-21 19:13:20 +00:00
Jason Ansel	6abe813d1c	Fix issue breaking cudagraphs (#685 ) @ngimel figured this one out. The errors we were seeing from cudagraphs capture were coming from `cuStreamGetCtx` which is not allowed while a stream is capturing. It appears the result of `cuStreamGetCtx()` isn't even used, so I believe it can just be removed.	2022-09-21 10:20:48 -07:00
Philippe Tillet	e318185eb4	[DOCS] Improved README.md wording (#683 ) Initial wording dates from a time where nobody knew Triton, and comparing it to CUDA helped differentiate it from other existing DSLs. But nowadays this comparison doesn't make much sense; Triton is its own thing, and some people may even still be more productive in CUDA than Triton -- language preferences are subjective after all.	2022-09-20 18:09:43 -07:00
Philippe Tillet	7dc2a70edb	Revert "Add .warmup() for triton.jit()" (#682 ) Reverts openai/triton#671 It seems like for some reason this caused out-of-memory errors on some of our internal workloads. I'm reverting this so that HEAD can be used in production at OpenAI, and I will work on digging into this issue asynchronously.	2022-09-20 16:05:14 -07:00
Philippe Tillet	48f30550f1	[FRONTEND] Now using raw compiler syscalls when possible (#678 )	2022-09-19 21:01:36 -07:00
Jason Ansel	93b1adc53b	[FRONTEND] Add .warmup() for triton.jit() (#671 )	2022-09-18 23:09:34 -07:00
Phil Tillet	82956e5d6b	[PACKAGING] Added missing package	2022-09-18 17:34:05 -07:00
Philippe Tillet	2baf333d44	[DOCS] Fixed typos (#670 )	2022-09-18 17:13:12 -07:00
Jason Ansel	49f6bc3f2b	[FRONTEND] Fix filename too long error in new runtime (#669 )	2022-09-18 21:26:29 +00:00
Phil Tillet	00f4ef6958	[CI] wheel/docs workflows now only run on V100 machine	2022-09-18 13:28:35 -07:00
Jason Ansel	e647402fd3	Fix warning in generated C code (#667 )	2022-09-18 12:57:32 -07:00
Philippe Tillet	4a77dfb042	[FRONTEND] Complete rewrite of the runtime (#644 ) This PR completely rewrites the runtime of Triton to be more lean and clearly separate the compilation step from the just-in-time caching logic. This should substantially reduce launch overhead.	2022-09-18 08:51:48 -07:00
Ian Bearman	889d9e34a1	[REPO] update gitignore (#666 ) Update `.gitignore` to include `.vs` and `.vscode`	2022-09-17 14:25:28 -07:00
Shintaro Iwasaki	c668d6596e	[DOCS] Fix spelling (#664 ) This PR applies minor spelling fix in comments and string literals to `master`. It shouldn't hurt anything.	2022-09-16 12:26:40 -07:00
Sophia Wisdom	4580a04710	[FRONTEND] Improve error message for CPU tensors (#654 ) Redo of #651 against master. Fixes #525 by catching CUDA error when we check pytorch tensor size and rethrowing a more informative error that says why we failed.	2022-09-14 14:26:42 -07:00
Philippe Tillet	cfbbc7b43a	[CI] Added V100 tag to disambiguate self-hosted runners (#653 )	2022-09-14 13:47:50 -07:00

1 2 3 4 5 ...

576 Commits