* make C++ code compatible with Windows + MSVC
* added dlfcn-win32 for cross-platform dlopen
* fixed building and pip install on Windows
* fixed shared library file name under Windows
- Removed driver module -- accelerator runtime is handled by pytorch
- Added basic support for ROCM based on @micmelesse 's PR -- now can execute empty kernel on AMD devices without any compile-time changes
- Now only using PREFER_SHARED for kernels when the size of shared memory is greater than 49k. Otherwise there can be poor L1 performance for broadcast tensors