Fix two problems in libdevice and external dispatch:
1. Use static triton types (e.g., tl.int32) instead of creating new
types. Otherwise, `tl.int32` and `tl.dtype('int32')` are not the same
thing.
2. The name of an extern inst should be empty but not the symbol name of
the inst. TTIR generator will assign names automatically. Otherwise, we
have the same variable name when there are multiple same extern insts.
Before the PR:
```bash
__nv_exp = extern_elementwise f64<1024> %11;
__nv_exp = extern_elementwise f64<1024> %11;
```
After the PR:
```bash
%12 = extern_elementwise f64<1024> %11;
%13 = extern_elementwise f64<1024> %11;
```
It is currently necessary for optimal performance in quantized workloads to add a special-purpose instruction in the IR. Backward compatibility with this instruction is *NOT* guaranteed.
Without this patch, a debug version of python complains that:
```
Fatal Python error: Python memory allocator called without holding the GIL
Python runtime state: initialized
```
Based on the discussion in #700, this PR enables downloading pybind11 in
`setup.py` without `git submodule` instead of copy-pasting pybind11
code. The downloaded pybind11 will be in `~/.triton/pybind` (like
`llvm`).
I suspect this was the cause of the "new compiles even on a warm cache"
behavior I was seeing, though haven't 100% confirmed it.
Python `set()` iteration order is nondeterministic when you create a new
process. So the same args could produce different `instance_descriptor`s
and have false cache misses.
This PR changes the `pybind11` source code management from copy-paste to
a package controlled by git-submodule.
See the discussion in #694 for details.
This revives #671 , removing the static functions that may unnecessarily hold a reference to the grid and the JITFunction object
Co-authored-by: Jason Ansel <jansel@jansel.net>
@ngimel figured this one out.
The errors we were seeing from cudagraphs capture were coming from
`cuStreamGetCtx` which is not allowed while a stream is capturing.
It appears the result of `cuStreamGetCtx()` isn't even used, so I
believe it can just be removed.
Initial wording dates from a time where nobody knew Triton, and
comparing it to CUDA helped differentiate it from other existing DSLs.
But nowadays this comparison doesn't make much sense; Triton is its own
thing, and some people may even still be more productive in CUDA than
Triton -- language preferences are subjective after all.
Reverts openai/triton#671
It seems like for some reason this caused out-of-memory errors on some
of our internal workloads. I'm reverting this so that HEAD can be used
in production at OpenAI, and I will work on digging into this issue
asynchronously.
This PR completely rewrites the runtime of Triton to be more lean and
clearly separate the compilation step from the just-in-time caching logic.
This should substantially reduce launch overhead.
Redo of #651 against master. Fixes#525 by catching CUDA error when we
check pytorch tensor size and rethrowing a more informative error that
says why we failed.