Releases: xiaoyeli/superlu_dist
v9.1.0
v9.1.0 Release Note
This includes the following updates:
-
Improved batched interface to solve many independent systems at the same time.
Internally it uses C++ template to support multiple datatypes, e.g., complex.
Please cite this IJHPCA paper when you use the batched functions. -
"SolveOnly" interface: you can input your own LU (or ILU) factored matrices,
but use our parallel, multi-GPU capable sparse triangular solve routine.
This is achieved by setting: options->SolveOnly = YES;
The user still inputs matrix A. Internally, we will treat the lower triangle
of A as the L factor, and upper triangle (including diagonal) of A as the U factor.
See an example program EXAMPLE/pddrive3d.c -
Python interface, currently only support double precision.
See PYTHON/README -
Fix memory leaks in the 3D multi-GPU routines in SRC/CplusplusFactor/
What's Changed
- Fix the sizeof and add casting to trf3d partition structs by @abagusetty in #162
- Fix memory error when using parallel symbolic factorization (ParMETIS) by @sebastiangrimberg in #164
- Avoid cuda device compiling step when linking against the library. by @eromero-vlc in #170
New Contributors
- @abagusetty made their first contribution in #162
- @sebastiangrimberg made their first contribution in #164
- @eromero-vlc made their first contribution in #170
Full Changelog: v9.0.0...v9.1.0
v9.0.0 release
V9.0.0 Release note
An example program is EXAMPLE/pddrive3d.c, calling the driver routine: SRC/double/pgssvx3d.c (or pdgssvx3d_csc_batch.c)
Please cite this ACM TOMS paper when you use these new features.
OpenMP performance hit:
On many systems, the default OMP_NUM_THREADS is set to to be the total number of CPU cores on a node. For example, it is set to be 128 on Perlmutter at NERSC. This is too high, because most of the algorithms are not efficient in the pure threading mode. We recommend users to experiment with mixed MPI and OpenMP mode, starting with smaller thread count, by settiing:
export OMP_NUM_THREADS=1, or 2, or 3, ....
The new features include the following:
-
LU factorization: diagonal factorization, panel factorization, & Schur-complement update
can all offloaded to GPU
Environment variables:- export SUPERLU_ACC_OFFLOAD=1 (default setting: enable GPU)
- export GPU3DVERSION=1 (default setting; use code in CplusplusFactor/ for all offload )
- export GPU3DVERSION=0 (only Schur-complement updates are offloaded)
- export SUPERLU_ACC_OFFLOAD=1 (default setting: enable GPU)
-
Triangular solve: new 3D communication-avoiding code
Environment variable:
export SUPERLU_ACC_SOLVE=0 (default setting; only on CPU)
export SUPERLU_ACC_SOLVE=1 (offload to GPU)** NOTE: when using multiple NVIDIA GPUs per 2D grid for GPU triangular solve, we use NVSHMEM for fast
inter-GPU communication. You need to configure NVSHMEM properly.
For example, on Perlmutter at NERSC, we need the following setup:
module load nvshmem/2.11.0
export NVSHMEM_HOME=/global/common/software/nersc9/nvshmem/2.11.0export NVSHMEM_USE_GDRCOPY=1 export NVSHMEM_MPI_SUPPORT=1 export MPI_HOME=${MPICH_DIR} export NVSHMEM_LIBFABRIC_SUPPORT=1 export LIBFABRIC_HOME=/opt/cray/libfabric/1.15.2.0 export LD_LIBRARY_PATH=$NVSHMEM_HOME/lib:$LD_LIBRARY_PATH export NVSHMEM_DISABLE_CUDA_VMM=1 export FI_CXI_OPTIMIZED_MRS=false export NVSHMEM_BOOTSTRAP_TWO_STAGE=1 export NVSHMEM_BOOTSTRAP=MPI export NVSHMEM_REMOTE_TRANSPORT=libfabric
-
Batched interface to solve many independent systems at the same time
Driver routine: p[d,s,z]gssvx3d_csc_batch.c
Example program: p[d,s,z]drive3d.c [ -b batchCount ] -
Julia interface
https://github.com/JuliaSparse/SuperLUDIST.jl
Dependencies: the following shows what needs to be defined in CMake build script
- Highly recommended:
- BLAS:
-DTPL_ENABLE_INTERNAL_BLASLIB=OFF
-DTPL__BLAS_LIBRARIES=”path to your BLAS library file” - ParMETIS:
-DTPL_PARMETIS_LIBRARIES=ON
-DTPL_PARMETIS_INCLUDE_DIRS=”path to metis and parmetis header files”
-DTPL_PARMETIS_LIBRARIES=”path to metis and parmetis library files”
- If you use GPU triangular solve, need the following:
- LAPACK
-DTPL_ENABLE_LAPACKLIB=ON
-DTPL_LAPACK_LIBRARIES=”path to lapack library file” - NVSHMEM is needed when using multiple GPUs
-DTPL_ENABLE_NVSHMEM=ON
-DTPL_NVSHMEM_LIBRARIES=”path to nvshmem files”
- If you use batched interface, need MAGMA
-DTPL_ENABLE_MAGMALIB=ON
-DTPL_MAGMA_INCLUDE_DIRS=”path to magma header files”
-DTPL_MAGMA_LIBRARIES=”path to magma library file”
What's Changed
New Contributors
Full Changelog: v8.2.1...v9.0.0
v8.2.1
A patch release to correct version string, now 8.2.1.
Full Changelog: v8.2.0...v8.2.1
v8.2.0
- more accurate memory counting for parallel symbolic and distribution routines
- improved NERSC Perlmutter build and run scripts
- added OLCF Frontier build and run scripts
- updated superlu_enum_consts.h, compatible with serial SupeLU
- added routine PStatClear()
- fixes for taskloop in triangular solve
- CMake: add target_compile_features() to specify C standard lower bound
- a number of bug fixes
Update: version strings in several files.
What's Changed
- Remove unused ptr and associated free by @jeanlucf22 in #131
- Fix -Wstrict-prototypes by @prj- in #139
- -Wundef by @prj- in #149
- Last necessary fix for -Wundef by @prj- in #151
New Contributors
- @jeanlucf22 made their first contribution in #131
Full Changelog: v8.1.2...v8.2.0
v8.1.2
- In SRC/
** add an env variable COMM_TREE_MPI_WAIT in comm_tree.c
** replace a taskloop by parallel for in pxgstrs_lsum.c - In EXAMPLE/
** drivers: only initialise cublas if GPU offloading is
enabled at runtime (James Trott)
** global interface drivers, P0 generates random Xtrue and RHS - Support 64-bit indexing for input matrix A
version 8.1.1
- bug fix for CPU trisolve:
** fix omp taskloop bug for certain Intel compilers
** change mpi_test to mpi_wait in broadcast tree - fixing error related to MPI communicator reordering
- correct memory allocation for GEMM buffer
- disable internal copy of COLAMD code, link with external library
- add single precision HWPM option
- fixes for Fortran parallel build and CMakeLists.txt
- add automatic CUDA architecture detection in cmake
What's Changed
New Contributors
Full Changelog: v8.1.0...v8.1.1
version 8.1.0
- Improved GPU U-solve performance
** A compile-time CPP flag "-DGPU_SOLVE" is needed to use this function
** Currently GPU trisolve works with 1 MPI rank - Updated FORTRAN/CMakeLists.txt:
** parallel build
** use Fortran linker
** allow disable Fortran/ buiild when not needed - Added single precision interface to HWPM pivoting code
- Temporary bug work-around for GPU trisolve
- Updated a number of scripts in example_scripts/
Full Changelog: v8.0.0...v8.1.0
v8.0.0
- Include support for AMD GPUs with HIP programming.
- Allow runtime SUPERLU_ACC_OFFLOAD = 0 to disable GPU offload for both
2D and 3D codes. - Include mixed-precision routines: 'psdrive' (single working precision)
can take double-precision iterative refinement as an option. - Add the fields in 'options' input structure, corresponding to the
parameters that are changeable by environment variables. - Add GPU stats variables in SuperLUStat_t{}, print the same way as CPU.