Releases: openucx/ucx
Releases · openucx/ucx
v1.11.0-rc4
1.11.0 RC4 (July 19, 2021)
Features:
Core
- Added support for UCX monitoring using virtual file system (VFS)/FUSE
- Added support for applications with static CUDA runtime linking
- Added support for a configuration file
- Updated clang format configuration
UCP
- Added rendezvous API for active messages
- Added user-defined name to context, worker, and endpoint objects
- Added flag to silence request leak check
- Added API for endpoint performance evaluation
- Added API - ucp_request_query
- Added API - ucp_lib_query
- Ported connection manager to a new UCT API
- Added bandwidth optimizations for new protocols multi-lane
- Added support for multi-rail over lanes with BW ratio >= 1/4
- Added support for tracking outstanding requests and aborting those in case of connection failure
- Refactored keep-alive protocol
- Added device id to wireup protocol
- Added support up to 128 transport layer resources in UCP context
- Added support CUDA memory allocations with ucp_mem_map
- Increased UCP_WORKER_MAX_EP_CONFIG to 64
- Adjusted memory type zcopy threshold when UCX_ZCOPY_THRESH set
- Refactored wireup protocols, rendezvous, get, zcopy protocols
- Added put zcopy multi-rail
- Improved logging for new protocols
- Added system topology information
- Added new protocols for eager offload protocols
UCT
- Extended connection establishment API
- Added active message AM alignment in iface params
- Added active message short IOV API.
- Added support for interface query by operation and memory type
- Added API to get allocation base address and length
- Added md_dereg_v2 API
UCS
- Added log filter by source file name.
- Added checking for last element in fraglist queue
- Added a method to get IP address from sockaddr.
- Added memory usage limits to registration cache
UCM
- Improved x86 parser to recognize some mov flavors
CUDA
- Added registration for whole CUDA allocations
- Added CUDA-IPC keepalive
- Adjusted performance estimations
- Added Improve logging
- Added allocation methods for CUDA pinned/managed memory
- Added support for a global cuda_ipc cache
RDMA CORE (IB, ROCE, etc.)
- Added report of QP info in case of completion with error
- Refactored of FC send operations
- Added support for DevX unique QPN allocation
- Optimized endpoint lookup for DCI
- Added support for RDMA sub-function (SF)
- Added support for DCI via DEVX
- Added DCI pool per LAG port
- Added support for RoCE IP reachability check using a subnet mask
- Added active message short IOV for UD/DC/RC mlx, UD/RC verbs
- Added endpoint keep alive check for UD
- Suppressed warning if device can't be opened
- Added support for multiple flush cancel without completion
- Added ignore for devices with invalid GID
- Added support for SRQ linked list reordering
- Added flush by flow control on old devices
- Added support for configurable rdma_resolve_addr/route timeout
Shared memory
- Added active message short IOV support for posix, sysv, and self transports
TCP
- Added support for peer failure in case of CONNECT_TO_EP
- Added support for active message short IOV
Java
- Added full support for UCP Java API
Tests
- Added length/mem_type for UCP client server example
- Added port sockaddr tests for a new API
- Added test send-recv between client/server with diff UCX_IB_NUM_PATHS
- Added support for CUDA and CUDA managed memory in io_demoo
- Added support for a custom watchdog timeout from command line
- Extended memtype hook tests
Tools
- Added UCP active message support to perftest
- Added error handling option to perftest
- Added wakeup option
- Added performance tests for am short iov
CI
- Added RHEL 7.6 with MOFED 4.7
- Added Fedora 34, RHEL 7.2, 7.4
- Added PGI support from HPC-SDK module
- Added docker image with CUDA 11.2
- Added IODEMO test
- Added Ubuntu 20.4
- Added test for connection manager fallback in client-server testing
- Added loopback interface for tcp testing
Bugfixes:
Build
- Fixes in libnuma detection macro
- Fixes for cross compilation support
- Fixes for --without-dc compilation
Continues Integration
- Fixes in Azure pipeline build system
- Fixes in Coverity CI
- Fixes in Azure release pipeline
Packaging
- Fixed in DEB package - added essential system dependencies
Documentation
- Fixes in UCP, UCT, Readme, FAQ, and Read-the-docs documentation
Tests
- Fixes in CMA peer failure test
- Fixes in SRQ tests
- Fixes in the usage requests_wait
- Fixes in test_uct_query
- Fixes addressing race conditions on client user data in test_uct_sockaddr
- Fixes in IODEMO app
- Fixes in error handling flow for perftest
- Fixes in perftest batch tests
- Fixes addressing hang issues for rendezvous protocol in UCP client server example
UCP
- Fixes in endpoint error handling
- Fixes in error reporting failed CM lanes
- Fixes in progress worker flush
- Fixes in rendezvous pipeline flow
- Fixes in recursive protocol selection
- Fixes in error handling for AM_ZCOPY
- Fixes in length check condition in RMA PUT short
- Fixes in failure handling rendezvous offload send
- Fixes in offload completion with inlined data
- Fixes in statistics calculations for rendezvous protocol
- Fixes in ucp_worker_query() thread mode for SERIALIZED
- Fixes preventing leaks of UCP requests
ROCM
- Fixes in device memory registration and de-registration
- Fixes in missing mem_query definition for rocm_copy
- Fixes addressing build failure due to const violation
- Fixes in sockaddr_accessibility test for rocm_copy and rocm_ipc
- Fixes in bandwidth estimation for rocm_ipc
RDMA CORE (IB, ROCE, etc.)
- Fixes addressing deadlock between DCI resources and RDMA_READ credits
- Fixes in DSCP for RoCE DCT
- Fixes in flush(cancel) flow
- Fixes preventing segfault in uct_rdmacm_cm_ep_str
- Fixes in scatter-gather entries logging
- Fixes for compilation with experimental verbs
- Fixes in UD dgid filtering
- Fixes in domain resources destroying
- Fixes in PCIe bandwidth calculation
- Fixes addressing CQ creation failure using legacy ibv API
- Fixes in iov2sge converter
- Fixes in port width check on HDR100
- Fixes in SL selection
- Fixes in hardware tag matching compilation
- Fixes in uct_rdmacm_cm_cqs hash key
- Fixes for compilation with rdma-core 20
Java
- Fixes in tag sender mask
UCT
- Fixes in reachability of loopback ifaces
- Fixes addressing possible uninitialized memory accesses
- Fixes in error flow for endpoints created upon receiving connection request
- Fixes in TCP keepalive to avoid false-positive error detection
UCM
- Fixes addressing heap corruption caused by ucp_set_event_handler()
- Fixes in mmap events test
v1.11.0-rc3
Features:
Core
- Added support for UCX monitoring using virtual file system (VFS)/FUSE
- Added support for applications with static CUDA runtime linking
- Added support for a configuration file
- Updated clang format configuration
UCP
- Added rendezvous API for active messages
- Added user-defined name to context, worker, and endpoint objects
- Added flag to silence request leak check
- Added API for endpoint performance evaluation
- Added API - ucp_request_query
- Added API - ucp_lib_query
- Ported connection manager to a new UCT API
- Added bandwidth optimizations for new protocols multi-lane
- Added support for multi-rail over lanes with BW ratio >= 1/4
- Added support for tracking outstanding requests and aborting those in case of connection failure
- Refactored keep-alive protocol
- Added device id to wireup protocol
- Added support up to 128 transport layer resources in UCP context
- Added support CUDA memory allocations with ucp_mem_map
- Increased UCP_WORKER_MAX_EP_CONFIG to 64
- Adjusted memory type zcopy threshold when UCX_ZCOPY_THRESH set
- Refactored wireup protocols, rendezvous, get, zcopy protocols
- Added put zcopy multi-rail
- Improved logging for new protocols
- Added system topology information
- Added new protocols for eager offload protocols
UCT
- Extended connection establishment API
- Added active message AM alignment in iface params
- Added active message short IOV API.
- Added support for interface query by operation and memory type
- Added API to get allocation base address and length
- Added md_dereg_v2 API
UCS
- Added log filter by source file name.
- Added checking for last element in fraglist queue
- Added a method to get IP address from sockaddr.
- Added memory usage limits to registration cache
UCM
- Improved x86 parser to recognize some mov flavors
CUDA
- Added registration for whole CUDA allocations
- Added CUDA-IPC keepalive
- Adjusted performance estimations
- Added Improve logging
- Added allocation methods for CUDA pinned/managed memory
- Added support for a global cuda_ipc cache
RDMA CORE (IB, ROCE, etc.)
- Added report of QP info in case of completion with error
- Refactored of FC send operations
- Added support for DevX unique QPN allocation
- Optimized endpoint lookup for DCI
- Added support for RDMA sub-function (SF)
- Added support for DCI via DEVX
- Added DCI pool per LAG port
- Added support for RoCE IP reachability check using a subnet mask
- Added active message short IOV for UD/DC/RC mlx, UD/RC verbs
- Added endpoint keep alive check for UD
- Suppressed warning if device can't be opened
- Added support for multiple flush cancel without completion
- Added ignore for devices with invalid GID
- Added support for SRQ linked list reordering
- Added flush by flow control on old devices
- Added support for configurable rdma_resolve_addr/route timeout
Shared memory
- Added active message short IOV support for posix, sysv, and self transports
TCP
- Added support for peer failure in case of CONNECT_TO_EP
- Added support for active message short IOV
Java
- Added full support for UCP Java API
Tests
- Added length/mem_type for UCP client server example
- Added port sockaddr tests for a new API
- Added test send-recv between client/server with diff UCX_IB_NUM_PATHS
- Added support for CUDA and CUDA managed memory in io_demoo
- Added support for a custom watchdog timeout from command line
- Extended memtype hook tests
Tools
- Added UCP active message support to perftest
- Added error handling option to perftest
- Added wakeup option
- Added performance tests for am short iov
CI
- Added RHEL 7.6 with MOFED 4.7
- Added Fedora 34, RHEL 7.2, 7.4
- Added PGI support from HPC-SDK module
- Added docker image with CUDA 11.2
- Added IODEMO test
- Added Ubuntu 20.4
- Added test for connection manager fallback in client-server testing
- Added loopback interface for tcp testing
Bugfixes:
Build
- Fixes in libnuma detection macro
- Fixes for cross compilation support
- Fixes for --without-dc compilation
Continues Integration
- Fixes in Azure pipeline build system
- Fixes in Coverity CI
- Fixes in Azure release pipeline
Packaging
- Fixed in DEB package - added essential system dependencies
Documentation
- Fixes in UCP, UCT, Readme, FAQ, and Read-the-docs documentation
Tests
- Fixes in CMA peer failure test
- Fixes in SRQ tests
- Fixes in the usage requests_wait
- Fixes in test_uct_query
- Fixes addressing race conditions on client user data in test_uct_sockaddr
- Fixes in IODEMO app
- Fixes in error handling flow for perftest
- Fixes in perftest batch tests
- Fixes addressing hang issues for rendezvous protocol in UCP client server example
UCP
- Fixes in endpoint error handling
- Fixes in error reporting failed CM lanes
- Fixes in progress worker flush
- Fixes in rendezvous pipeline flow
- Fixes in recursive protocol selection
- Fixes in error handling for AM_ZCOPY
- Fixes in length check condition in RMA PUT short
- Fixes in failure handling rendezvous offload send
- Fixes in offload completion with inlined data
- Fixes in statistics calculations for rendezvous protocol
- Fixes in ucp_worker_query() thread mode for SERIALIZED
- Fixes preventing leaks of UCP requests
ROCM
- Fixes in device memory registration and de-registration
- Fixes in missing mem_query definition for rocm_copy
- Fixes addressing build failure due to const violation
- Fixes in sockaddr_accessibility test for rocm_copy and rocm_ipc
- Fixes in bandwidth estimation for rocm_ipc
RDMA CORE (IB, ROCE, etc.)
- Fixes addressing deadlock between DCI resources and RDMA_READ credits
- Fixes in DSCP for RoCE DCT
- Fixes in flush(cancel) flow
- Fixes preventing segfault in uct_rdmacm_cm_ep_str
- Fixes in scatter-gather entries logging
- Fixes for compilation with experimental verbs
- Fixes in UD dgid filtering
- Fixes in domain resources destroying
- Fixes in PCIe bandwidth calculation
- Fixes addressing CQ creation failure using legacy ibv API
- Fixes in iov2sge converter
- Fixes in port width check on HDR100
- Fixes in SL selection
- Fixes in hardware tag matching compilation
- Fixes in uct_rdmacm_cm_cqs hash key
Java
- Fixes in tag sender mask
UCT
- Fixes in reachability of loopback ifaces
- Fixes addressing possible uninitialized memory accesses
- Fixes in error flow for endpoints created upon receiving connection request
UCM
- Fixes addressing heap corruption caused by ucp_set_event_handler()
- Fixes in mmap events test
v1.11.0-rc1
TBD
v1.10.1
1.10.1 (May 12, 2021)
Bugfixes:
- Fixes in Infiniband port speed detection for HDR100
- Fixes in building gtest-all.cc and sock.c with GCC11
- Fixes addressing performance degradation with cuda memory on a self endpoint
- Fixes in JUCX listener connection handler
- Fixed in configuration of loopback TCP transport (disable by default)
- Fixes in RPM dependency on libibverbs
- Fixes in ABI backward compatibility for active message protocol
- Fixes in the DC transport - adding support for full-handshake mode (off by default)
- Fixes in Active Messages short reply protocol
- Fixes for segmentation fault while listening for connections
v1.10.1-rc2
1.10.1 RC2 (May 10, 2021)
Bugfixes:
- Fixes in Infiniband port speed detection for HDR100
- Fixes in building gtest-all.cc and sock.c with GCC11
- Fixes addressing performance degradation with cuda memory on a self endpoint
- Fixes in JUCX listener connection handler
- Fixed in configuration of loopback TCP transport (disable by default)
- Fixes in RPM dependency on libibverbs
- Fixes in ABI backward compatibility for active message protocol
- Add support for DC full-handshake mode (off by default)
- Fixes in Active Messages short reply protocol
- Fixes for segmentation fault while listening for connections
v1.10.1-rc1
1.10.1-rc1
Bugfixes:
- Fix Infiniband port speed detection for HDR100
- Fix build issues in gtest-all.cc and sock.c with GCC11
- Fix performance degradation with cuda memory on self endpoint
- Fix bug in JUCX listener connection handler.
v1.10.0
Features:
Core
- Added support for Nvidia HPC SDK
- Added support for latest PGI and Clang
- Added support for ROCM-3.7+ (warning generated if older version detected)
- Added support for GCC11
Architecture
- Added Arm SVE memcpy()
- Redesigned Arm WFE support
- Improved clear_cache performance for Arm
- Added architecture detection for Zhaoxin CPU
CI
- Added release builds on CUDA 11
- Enabled performance validation in gtest
- Added new OS for release CI
UCP
- Added locality awareness to the transport selection logic for GPU devices
- Added put/offload/short and put/offload/zcopy protocols
- Added receive message nbx routine
- Reworked AM implementation and API, which adds support for RNDV semantics
- Added support for multi-lane connection manager over TCP
- Added support for printing AM tls with info log level
- Implement flush and destroy for UCT EPs on UCP worker
- Reduced UCP request size
- Added support for keepalive protocol
- Added support for multi-fragment protocol
- Added implementation for protocol progress for eager, bcopy, and multicopy
- Improved selection logic for protocol selection
- Added new protocols for UCP get operation
- Added bcopy protocols with support for GPU memory
- Added RNDV protocol implementation for GPU devices (CUDA, ROCm)
- Set SOCKADDR_CM_ENABLE=y by default
- Added support for fast-path short with new tag protocols
- Added a new parameter to control the CM listener's backlog
- Added support sending AM RTS over short message protocol
- Added support for shared memory multi-lane when CM is used
- Added missing async locks
UCT
- Added API for keepalive_timeout value
- Added add uct_completion.status
- Allowed transports to access multiple mem_types
- Removed status arg from uct_completion_callback_t
- Restructured uct_mem_alloc/uct_md_mem_alloc to use mem_type
- Updated documentation for uct_listener_params
- Lowered the log level for certain network errors
- Added cuda_copy wakeup feature
- Added wakeup support for shared memory
UCS
- Added "inf" and "auto" values to time units
- Added on-stack constructors for array and string buffer
- Added ucs_ptr_map_t data structure
- Added bool CSWAP
- Improved logging
- Added optimization for namespace processing
- Fixes for connection matching functionality
CUDA
- Added support for global IPC cache
RDMA CORE (IB, ROCE, etc.)
- Added support for auto detection of adapative routing settings
- Added an option to poll TX CQ every progress iteration
- Added local and remote addresses to the reject error message
- Added support for UAR allocation with non-cacheable memory type
- Added support for multiple flush cancel without completion
- Added async events callback support
- Added detection for ConnectX-6, ConnectX-7 and BlueField-1/2 devices
- Added support for connection matching for UD
- Added a check for AM ordering
- Added better support for non-4K MTU values
Java (preview)
- Added support for a different javadoc executable path for different java versions
- Added UCS memory type constants
- Added support build on Java10+
- Added support for io-vector datatype.
- Removed libjucx from packages.
Tests
- Added CI for CUDA 11
- Added test_ucp_sockaddr_protocols.stream_short
- Reimplemented tests using NBX API
- Added flush(cancel) test
- Added memory_wait mode to perftest
- Added support for clang 10
- Refactored RMA and atomic tests, add memtype support
- Added test for uct_md_mem_query()
- Added request interrupt support
- Added support for connection manager fallbacks
- Added new ucp request test checking for leaks from the ptr_map
Documentation
- Added glossaries
Bugfixes:
Portability
- Fixes in print functions to use format string like PRIx64, etc.
- Fixes for Arm v8 cross compilation support
Continues Integration:
- Fixes in Github release flow
- Fixes in docker image
Packaging
- Removed deb package dependencies
- Fixes in SPEC to make the RPM relocatable
Documentation
- Fixes in documentation for ucp_am_recv_data_nbx
- Fixes in quick start example
- Fixes in installation instruction
- Fixes in updates in author list
Tests
- Fixes for failures under valgrind runtime
- Fixes in mmap tests for 0-length RMA
- Fixes in definition of LAST_WQE wait timeout
- Fixes in ROCm for mem_buffer test
- Fixes in test name printing format
- Fixes in tcp_sockcm test
UCP
- Fixes in worker cleanup flow
- Fixes in RNDV RTS flow
- Fix in length check condition for RMA PUT short
- Fixes in handling failures from AM Bcopy
- Fix in a release flow of deferred data
- Fixes for invalid ID and handling of status in RNDV
- Fixes in short active message reply protocol
CUDA
- Fixes in managed memory support
- Fixes in topology detection
RDMA CORE (IB, ROCE, etc.)
- Fixes in assert definitions
- Fixes in printing an error about invalid AM Bcopy length for UD
- Fixes for thread safety support
- Fixes to get ROCE device name according to GID
- Fixes for SL selection
- Fixes in create STRICT_ORDER key
- Fixes addressing performance degradation in UD transport due to excess async events
- Fixes in QP destroy
- Fixes for CQ creation failure using old Verbs API
UGNI
- Fixing disable logic in config
- Fixing clang 11 warnings
Java
- Fixes in build dependencies
- Fixes in constructing UcpRequest object on error
- Fixes in exception handling on endpoint closure request
- Fixes for segfault in UcpErrorHandler
UCP
- Fixes in datatype support for get_zcopy RNDV
- Fixes in connection manager disconnect
- Fixes in assert definitions
- Fixes in completion flow for failed EP
- Fixes in flush error handling flow
- Fixes in latency calculations for wireup protocol
- Fixes in offload completion with inlined data
- Fixes in unpacking flow
- Fixes in error handling for various protocols
UCT
- Fixes in flush TX
- Fixes in checks for enabling GPU Direct RDMA
UCS
- Fixes for crashes on incorrect value set in config
- Fixes in ptr_array
- Fixes in maximal size for ucs_snprintf_safe()
- Fixes in compilation warning
- Fixes in ucs_aarch64_dsb(_op) definition
TCP
- Fixes in default route interface confirmation flow
- Fixes in PUT protocol
- Fixes in max connection limit and improved error reporting
UCM
- Fixing crash on prevent unload
- Fixes in libucm_rocm
- Fixes for few racing conditions
v1.10.0-rc5
1.10.0-rc5 (February 26, 2021)
Features:
Core
- Added support for Nvidia HPC SDK
- Added support for latest PGI and Clang
- Added support for ROCM-3.7+ (warning generated if older version detected)
- Added support for GCC11
Architecture
- Added Arm SVE memcpy()
- Redesigned Arm WFE support
- Improved clear_cache performance for Arm
- Added architecture detection for Zhaoxin CPU
CI
- Added release builds on CUDA 11
- Enabled performance validation in gtest
- Added new OS for release CI
UCP
- Added locality awareness to the transport selection logic for GPU devices
- Added put/offload/short and put/offload/zcopy protocols
- Added receive message nbx routine
- Reworked AM implementation and API, which adds support for RNDV semantics
- Added support for multi-lane connection manager over TCP
- Added support for printing AM tls with info log level
- Implement flush and destroy for UCT EPs on UCP worker
- Reduced UCP request size
- Added support for keepalive protocol
- Added support for multi-fragment protocol
- Added implementation for protocol progress for eager, bcopy, and multicopy
- Improved selection logic for protocol selection
- Added new protocols for UCP get operation
- Added bcopy protocols with support for GPU memory
- Added RNDV protocol implementation for GPU devices (CUDA, ROCm)
- Set SOCKADDR_CM_ENABLE=y by default
- Added support for fast-path short with new tag protocols
- Added a new parameter to control the CM listener's backlog
- Added support sending AM RTS over short message protocol
- Added support for shared memory multi-lane when CM is used
- Added missing async locks
UCT
- Added API for keepalive_timeout value
- Added add uct_completion.status
- Allowed transports to access multiple mem_types
- Removed status arg from uct_completion_callback_t
- Restructured uct_mem_alloc/uct_md_mem_alloc to use mem_type
- Updated documentation for uct_listener_params
- Lowered the log level for certain network errors
- Added cuda_copy wakeup feature
- Added wakeup support for shared memory
UCS
- Added "inf" and "auto" values to time units
- Added on-stack constructors for array and string buffer
- Added ucs_ptr_map_t data structure
- Added bool CSWAP
- Improved logging
- Added optimization for namespace processing
- Fixes for connection matching functionality
CUDA
- Added support for global IPC cache
RDMA CORE (IB, ROCE, etc.)
- Added support for auto detection of adapative routing settings
- Added an option to poll TX CQ every progress iteration
- Added local and remote addresses to the reject error message
- Added support for UAR allocation with non-cacheable memory type
- Added support for multiple flush cancel without completion
- Added async events callback support
- Added detection for ConnectX-6, ConnectX-7 and BlueField-1/2 devices
- Added support for connection matching for UD
- Added a check for AM ordering
- Added better support for non-4K MTU values
Java (preview)
- Added support for a different javadoc executable path for different java versions
- Added UCS memory type constants
- Added support build on Java10+
- Added support for io-vector datatype.
Tests
- Added CI for CUDA 11
- Added test_ucp_sockaddr_protocols.stream_short
- Reimplemented tests using NBX API
- Added flush(cancel) test
- Added memory_wait mode to perftest
- Added support for clang 10
- Refactored RMA and atomic tests, add memtype support
- Added test for uct_md_mem_query()
- Added request interrupt support
- Added support for connection manager fallbacks
- Added new ucp request test checking for leaks from the ptr_map
Documentation
- Added glossaries
Bugfixes:
Portability
- Fixes in print functions to use format string like PRIx64, etc.
- Fixes for Arm v8 cross compilation support
Continues Integration:
- Fixes in Github release flow
- Fixes in docker image
Packaging
- Removed deb package dependencies
- Fixes in SPEC to make the RPM relocatable
Documentation
- Fixes in documentation for ucp_am_recv_data_nbx
- Fixes in quick start example
- Fixes in installation instruction
- Fixes in updates in author list
Tests
- Fixes for failures under valgrind runtime
- Fixes in mmap tests for 0-length RMA
- Fixes in definition of LAST_WQE wait timeout
- Fixes in ROCm for mem_buffer test
- Fixes in test name printing format
- Fixes in tcp_sockcm test
UCP
- Fixes in worker cleanup flow
- Fixes in RNDV RTS flow
- Fix in length check condition for RMA PUT short
- Fixes in handling failures from AM Bcopy
- Fix in a release flow of deferred data
- Fixes for invalid ID and handling of status in RNDV
- Fixes in short active message reply protocol
CUDA
- Fixes in managed memory support
- Fixes in topology detection
RDMA CORE (IB, ROCE, etc.)
- Fixes in assert definitions
- Fixes in printing an error about invalid AM Bcopy length for UD
- Fixes for thread safety support
- Fixes to get ROCE device name according to GID
- Fixes for SL selection
- Fixes in create STRICT_ORDER key
- Fixes addressing performance degradation in UD transport due to excess async events
- Fixes in QP destroy
- Fixes for CQ creation failure using old Verbs API
UGNI
- Fixing disable logic in config
- Fixing clang 11 warnings
Java
- Fixes in build dependencies
- Fixes in constructing UcpRequest object on error
- Fixes in exception handling on endpoint closure request
- Fixes for segfault in UcpErrorHandler
UCP
- Fixes in datatype support for get_zcopy RNDV
- Fixes in connection manager disconnect
- Fixes in assert definitions
- Fixes in completion flow for failed EP
- Fixes in flush error handling flow
- Fixes in latency calculations for wireup protocol
- Fixes in offload completion with inlined data
- Fixes in unpacking flow
- Fixes in error handling for various protocols
UCT
- Fixes in flush TX
- Fixes in checks for enabling GPU Direct RDMA
UCS
- Fixes for crashes on incorrect value set in config
- Fixes in ptr_array
- Fixes in maximal size for ucs_snprintf_safe()
- Fixes in compilation warning
- Fixes in ucs_aarch64_dsb(_op) definition
TCP
- Fixes in default route interface confirmation flow
- Fixes in PUT protocol
- Fixes in max connection limit and improved error reporting
UCM
- Fixing crash on prevent unload
- Fixes in libucm_rocm
- Fixes for few racing conditions
v1.10.0-rc4
1.10.0-rc4 (February 20, 2021)
Features:
Core
- Added support for Nvidia HPC SDK
- Added support for latest PGI and Clang
- Added support for ROCM-3.7+ (warning generated if older version detected)
- Added support for GCC11
Architecture
- Added Arm SVE memcpy()
- Redesigned Arm WFE support
- Improved clear_cache performance for Arm
- Added architecture detection for Zhaoxin CPU
CI
- Added release builds on CUDA 11
- Enabled performance validation in gtest
- Added new OS for release CI
UCP
- Added locality awareness to the transport selection logic for GPU devices
- Added put/offload/short and put/offload/zcopy protocols
- Added receive message nbx routine
- Reworked AM implementation and API, which adds support for RNDV semantics
- Added support for multi-lane connection manager over TCP
- Added support for printing AM tls with info log level
- Implement flush and destroy for UCT EPs on UCP worker
- Reduced UCP request size
- Added support for keepalive protocol
- Added support for multi-fragment protocol
- Added implementation for protocol progress for eager, bcopy, and multicopy
- Improved selection logic for protocol selection
- Added new protocols for UCP get operation
- Added bcopy protocols with support for GPU memory
- Added RNDV protocol implementation for GPU devices (CUDA, ROCm)
- Set SOCKADDR_CM_ENABLE=y by default
- Added support for fast-path short with new tag protocols
- Added a new parameter to control the CM listener's backlog
- Added support sending AM RTS over short message protocol
- Added support for shared memory multi-lane when CM is used
- Added missing async locks
UCT
- Added API for keepalive_timeout value
- Added add uct_completion.status
- Allowed transports to access multiple mem_types
- Removed status arg from uct_completion_callback_t
- Restructured uct_mem_alloc/uct_md_mem_alloc to use mem_type
- Updated documentation for uct_listener_params
- Lowered the log level for certain network errors
- Added cuda_copy wakeup feature
- Added wakeup support for shared memory
UCS
- Added "inf" and "auto" values to time units
- Added on-stack constructors for array and string buffer
- Added ucs_ptr_map_t data structure
- Added bool CSWAP
- Improved logging
- Added optimization for namespace processing
- Fixes for connection matching functionality
CUDA
- Added support for global IPC cache
RDMA CORE (IB, ROCE, etc.)
- Added support for auto detection of adapative routing settings
- Added an option to poll TX CQ every progress iteration
- Added local and remote addresses to the reject error message
- Added support for UAR allocation with non-cacheable memory type
- Added support for multiple flush cancel without completion
- Added async events callback support
- Added detection for ConnectX-6, ConnectX-7 and BlueField-1/2 devices
- Added support for connection matching for UD
- Added a check for AM ordering
- Added better support for non-4K MTU values
Java (preview)
- Added support for a different javadoc executable path for different java versions
- Added UCS memory type constants
- Added support build on Java10+
- Added support for io-vector datatype.
Tests
- Added CI for CUDA 11
- Added test_ucp_sockaddr_protocols.stream_short
- Reimplemented tests using NBX API
- Added flush(cancel) test
- Added memory_wait mode to perftest
- Added support for clang 10
- Refactored RMA and atomic tests, add memtype support
- Added test for uct_md_mem_query()
- Added request interrupt support
- Added support for connection manager fallbacks
- Added new ucp request test checking for leaks from the ptr_map
Documentation
- Added glossaries
Bugfixes:
Portability
- Fixes in print functions to use format string like PRIx64, etc.
- Fixes for Arm v8 cross compilation support
Continues Integration:
- Fixes in Github release flow
- Fixes in docker image
Packaging
- Removed deb package dependencies
- Fixes in SPEC to make the RPM relocatable
Documentation
- Fixes in documentation for ucp_am_recv_data_nbx
- Fixes in quick start example
- Fixes in installation instruction
- Fixes in updates in author list
Tests
- Fixes for failures under valgrind runtime
- Fixes in mmap tests for 0-length RMA
- Fixes in definition of LAST_WQE wait timeout
- Fixes in ROCm for mem_buffer test
- Fixes in test name printing format
- Fixes in tcp_sockcm test
UCP
- Fixes in worker cleanup flow
- Fixes in RNDV RTS flow
- Fix in length check condition for RMA PUT short
- Fixes in handling failures from AM Bcopy
- Fix in a release flow of deferred data
- Fixes for invalid ID and handling of status in RNDV
CUDA
- Fixes in managed memory support
- Fixes in topology detection
RDMA CORE (IB, ROCE, etc.)
- Fixes in assert definitions
- Fixes in printing an error about invalid AM Bcopy length for UD
- Fixes for thread safety support
- Fixes to get ROCE device name according to GID
- Fixes for SL selection
- Fixes in create STRICT_ORDER key
- Fixes addressing performance degradation in UD transport due to excess async events
- Fixes in QP destroy
- Fixes for CQ creation failure using old Verbs API
UGNI
- Fixing disable logic in config
- Fixing clang 11 warnings
Java
- Fixes in build dependencies
- Fixes in constructing UcpRequest object on error
- Fixes in exception handling on endpoint closure request
- Fixes for segfault in UcpErrorHandler
UCP
- Fixes in datatype support for get_zcopy RNDV
- Fixes in connection manager disconnect
- Fixes in assert definitions
- Fixes in completion flow for failed EP
- Fixes in flush error handling flow
- Fixes in latency calculations for wireup protocol
- Fixes in offload completion with inlined data
- Fixes in unpacking flow
- Fixes in error handling for various protocols
UCT
- Fixes in flush TX
- Fixes in checks for enabling GPU Direct RDMA
UCS
- Fixes for crashes on incorrect value set in config
- Fixes in ptr_array
- Fixes in maximal size for ucs_snprintf_safe()
- Fixes in compilation warning
- Fixes in ucs_aarch64_dsb(_op) definition
TCP
- Fixes in default route interface confirmation flow
- Fixes in PUT protocol
- Fixes in max connection limit and improved error reporting
UCM
- Fixing crash on prevent unload
- Fixes in libucm_rocm
- Fixes for few racing conditions
v1.10.0-rc3
1.10.0-rc3 (February 15, 2021)
Features:
Core
- Added support for Nvidia HPC SDK
- Added support for latest PGI and Clang
- Added support for ROCM-3.7+ (warning generated if older version detected)
- Added support for GCC11
Architecture
- Added Arm SVE memcpy()
- Redesigned Arm WFE support
- Improved clear_cache performance for Arm
- Added architecture detection for Zhaoxin CPU
CI
- Added release builds on CUDA 11
- Enabled performance validation in gtest
UCP
- Added locality awareness to the transport selection logic for GPU devices
- Added put/offload/short and put/offload/zcopy protocols
- Added receive message nbx routine
- Reworked AM implementation and API, which adds support for RNDV semantics
- Added support for multi-lane connection manager over TCP
- Added support for printing AM tls with info log level
- Implement flush and destroy for UCT EPs on UCP worker
- Reduced UCP request size
- Added support for keepalive protocol
- Added support for multi-fragment protocol
- Added implementation for protocol progress for eager, bcopy, and multicopy
- Improved selection logic for protocol selection
- Added new protocols for UCP get operation
- Added bcopy protocols with support for GPU memory
- Added RNDV protocol implementation for GPU devices (CUDA, ROCm)
- Set SOCKADDR_CM_ENABLE=y by default
- Added support for fast-path short with new tag protocols
- Added a new parameter to control the CM listener's backlog
- Added support sending AM RTS over short message protocol
- Added support for shared memory multi-lane when CM is used
UCT
- Added API for keepalive_timeout value
- Added add uct_completion.status
- Allowed transports to access multiple mem_types
- Removed status arg from uct_completion_callback_t
- Restructured uct_mem_alloc/uct_md_mem_alloc to use mem_type
- Updated documentation for uct_listener_params
- Lowered the log level for certain network errors
- Added cuda_copy wakeup feature
- Added wakeup support for shared memory
UCS
- Added "inf" and "auto" values to time units
- Added on-stack constructors for array and string buffer
- Added ucs_ptr_map_t data structure
- Added bool CSWAP
- Improved logging
- Added optimization for namespace processing
- Fixes for connection matching functionality
RDMA CORE (IB, ROCE, etc.)
- Added support for auto detection of adapative routing settings
- Added an option to poll TX CQ every progress iteration
- Added local and remote addresses to the reject error message
- Added support for UAR allocation with non-cacheable memory type
- Added support for multiple flush cancel without completion
- Added async events callback support
- Added detection for ConnectX-6, ConnectX-7 and BlueField-1/2 devices
- Added support for connection matching for UD
- Added a check for AM ordering
- Added better support for non-4K MTU values
Java (preview)
- Added support for a different javadoc executable path for different java versions
- Added UCS memory type constants
- Added support build on Java10+
- Added support for io-vector datatype.
Tests
- Added CI for CUDA 11
- Added test_ucp_sockaddr_protocols.stream_short
- Reimplemented tests using NBX API
- Added flush(cancel) test
- Added memory_wait mode to perftest
- Added support for clang 10
- Refactored RMA and atomic tests, add memtype support
- Added test for uct_md_mem_query()
- Added request interrupt support
- Added support for connection manager fallbacks
- Added new ucp request test checking for leaks from the ptr_map
Documentation
- Added glossaries
Bugfixes:
Portability
- Fixes in print functions to use format string like PRIx64, etc.
- Fixes for Arm v8 cross compilation support
Continues Integration:
- Fixes in Github release flow
- Fixes in docker image
Packaging
- Removed deb package dependencies
- Fixes in SPEC to make the RPM relocatable
Documentation
- Fixes in documentation for ucp_am_recv_data_nbx
- Fixes in quick start example
- Fixes in installation instruction
- Fixes in updates in author list
Tests
- Fixes for failures under valgrind runtime
- Fixes in mmap tests for 0-length RMA
- Fixes in definition of LAST_WQE wait timeout
- Fixes in ROCm for mem_buffer test
- Fixes in test name printing format
- Fixes in tcp_sockcm test
UCP
- Fixes in worker cleanup flow
- Fixes in RNDV RTS flow
- Fix in length check condition for RMA PUT short
- Fixes in handling failures from AM Bcopy
- Fix in a release flow of deferred data
- Fixes for invalid ID and handling of status in RNDV
CUDA
- Fixes in managed memory support
RDMA CORE (IB, ROCE, etc.)
- Fixes in assert definitions
- Fixes in printing an error about invalid AM Bcopy length for UD
- Fixes for thread safety support
- Fixes to get ROCE device name according to GID
- Fixes for SL selection
- Fixes in create STRICT_ORDER key
- Fixes addressing performance degradation in UD transport due to excess async events
- Fixes in QP destroy
- Fixes for CQ creation failure using old Verbs API
UGNI
- Fixing disable logic in config
- Fixing clang 11 warnings
Java
- Fixes in build dependencies
- Fixes in constructing UcpRequest object on error
- Fixes in exception handling on endpoint closure request
- Fixes for segfault in UcpErrorHandler
UCP
- Fixes in datatype support for get_zcopy RNDV
- Fixes in connection manager disconnect
- Fixes in assert definitions
- Fixes in completion flow for failed EP
- Fixes in flush error handling flow
- Fixes in latency calculations for wireup protocol
- Fixes in offload completion with inlined data
- Fixes in unpacking flow
- Fixes in error handling for various protocols
UCT
- Fixes in flush TX
- Fixes in checks for enabling GPU Direct RDMA
UCS
- Fixes for crashes on incorrect value set in config
- Fixes in ptr_array
- Fixes in maximal size for ucs_snprintf_safe()
- Fixes in compilation warning
- Fixes in ucs_aarch64_dsb(_op) definition
TCP
- Fixes in default route interface confirmation flow
- Fixes in PUT protocol
- Fixes in max connection limit and improved error reporting
UCM
- Fixing crash on prevent unload
- Fixes in libucm_rocm
- Fixes for few racing conditions