Bug: "ret" value is destroyed when a failing "ptxas --version" is run
overwriting the previous valid "ret" value.
Fix: keep rets only for those runs which are successful. Pick the first
one
Current way of doing things is probably not super thread safe. init is shared between threads and some threads my not call the LLVMInitialize* function.
* make C++ code compatible with Windows + MSVC
* added dlfcn-win32 for cross-platform dlopen
* fixed building and pip install on Windows
* fixed shared library file name under Windows
- Removed driver module -- accelerator runtime is handled by pytorch
- Added basic support for ROCM based on @micmelesse 's PR -- now can execute empty kernel on AMD devices without any compile-time changes
- Now only using PREFER_SHARED for kernels when the size of shared memory is greater than 49k. Otherwise there can be poor L1 performance for broadcast tensors