- fix apple mps compile problem
- fix cuda 11.4 compile problem
- better cuda library finding for windows
- fix cuda 11.4 compile problem
- fix cuda 11.4 compile problem
- fix cuda 11 compile problem
- fix missing windows cpu prebuilt
- fix linux compile problem
- fix windows dll load problem and msvc compile problem
- fix missing linux cpu prebuilt
- fix some cuda compile problem
- fix metal compile problem
- fix ci
- fix ci
- drop python 3.8, add python 3.13.
- change cuda prebuilt to 11.4, 11.8, 12.1, 12.4, 12.6.
- change all linux prebuilt to
manylinux_2_28
, drop manylinux2014 support.
- add fast obb-grid overlap for 3dgs.
- add sympy helper functions
- fix bug in mac in inline kernels
- fix mac os bug
- debug macos ci
- add cuda-like kernel in apple silicon, only RTC compile is supported.
- BREAKING CHANGE: rename macro
TV_CUDA
toTV_ENABLE_HARDWARE_ACC
- BREAKING CHANGE: change managed tensor behavior, now you must specify cpu device when you use managed flag.
- fix compile problem in cuda 12.x
- add
run_in_process
support for inliner to debug some unrecoverable cuda errors such as invalid memory access (700) without restart whole process. this option will copy all tensor data to cpu, copy them to child process (spawn mode), run in child process, and copy back to cpu and main process. this will slow down the performance, but it's very useful for debugging. - add macro
TV_ASSERT_WITH_PRINT
to perform print in assert. - change inliner function name with user-provided name for debug.
- fix a small bug in
mp_helper.h
- Add std flag to NVRTCInlinerBuilder
- add
get_nvrtc_kernel_attrs
to NVRTCInlinerBuilder - add prompt for inliner, use
python -m cumm.inliner.cuda
orpython -m cumm.inliner.cpu
(clang must be installed) - add rich message print support for nvrtc compile powered by awesome
rich
library. (don't support llvm)
- change nvrtc tuple impl to support std::tie
- change supported cuda version, remove cuda 10.2 and 11.6, add cuda 12.1 and 12.2
- remove python 3.7, add python 3.12.
- fix a small bug when using c++17 in nvrtc
- add simple perf tools
- fix a bug in when compile code with arch < sm_75
- add tv::TensorView capture support in nvrtc inliner
- add better error support for cumm nvrtc
- fix a bug in CummNVRTCModule, we need to keep flag order
- fix a small bug in tv::Tensor::empty.
- fix a small bug in nvrtc tuple.
- fix a small bug in nvrtc
- fix a compile problem in msvc
- fix unsupported arch in cuda 12.0
- fix compile problem
- fix some compile problem in cpu only
- change version to rebuild due to pypi server problem
- Add cuda 12.0
- Add int8 inference for sparse conv
- Fix some problem in cuda 12.0
- Fix bug in ConvProblem introduced in 0.3.6
- Add int64 support for TensorGeneric
- Add flags for H100 and RTX 4090
- fix nvrtc launch problem when smem size is large
- fix nvrtc constant variable parse problem
- Change gemm/conv main function to splited version
- Fix problem in CompileInfo
- Change nlohmann json to 3.11.2
- Fix build problem in cuda 10.2
- Fix some bug related to nvrtc
- Fix cpu build problem
- Add Ampere support. faster fp16, faster tf32 and greatly faster int8 kernels in Ampere GPUs.
- Add nvrtc support for conv kernel.
- drop python 3.6 support.
- BREAKING CHANGE: change dtype enum value for some important reason.
- Fix missing sm37 in supported arch
- add sm37 for cu102.
- add compile info (cuda arch) for better error information.
- Fix a small bug that incorrectly limit arch of simt to sm52.
- add cpu support for CUDAKernelTimer.
- add non-contiguous support for tv::Tensor.
- add tsl hash map, refine cuda hash impl.
- raise error instead of exit program when cuda error occurs.
- gemm kernel now use stride, this enable us perform gemm with non-contiguous tensor
- Fix bugs for gemm kernel when use non-contiguous operand.
- Fix bugs for implicit gemm
- add support for python 3.6, but cudasim don't support python 3.6.
- add profile tool for all gemm and conv kernels.
- Fix some bug of implicit gemm
- add implicit gemm algorithm for all kind of convolution with kernel volume <= 32. this algorithm is very fast with float16.
- add cuda 11.3 build
- remove python 3.6 support