You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
2 nodes, 32 processes per node worked fine.
2 nodes, 64 processes per node triggered this error. export LCI_IBV_ENABLE_TD=0 fixed this error, so it has something to do with hardware resource limitation related to ibv thread domain (likely uUAR). The limitation is between [32x64=2048, 64x128=8192) tds per node.
LCI does not need to create a thread domain per ibv qp anyway. Ideally it only need one thread domain per thread (assuming static mapping between queue pairs and threads.)
Need to figure out a better way for thread domain allocation.
The text was updated successfully, but these errors were encountered:
Each process needs a QP for each other process: for n processes, each has n or n - 1 QPs (I don't remember if we create a loopback QP or just short-circuit self-sends), for a total of O(n^2) QPs across all nodes. Given N nodes and P processes per node, there are O(N * P^2) QPs per node: as you accurately point out, this hardware limit (for TDs) is between 2048 to 8192 QPs.
Even if we resolve the TD limit issue by reusing TDs, it won't take long to reach the hardware QP limit with larger number of nodes (16 nodes is 64k QPs)—this is precisely the reason that XRC (and DC) QP types exist. In general, LCI is not designed for running with high PPN, and we'd need to implement some of the Nvidia/Mellanox extensions to support it properly.
2 nodes, 32 processes per node worked fine.
2 nodes, 64 processes per node triggered this error.
export LCI_IBV_ENABLE_TD=0
fixed this error, so it has something to do with hardware resource limitation related to ibv thread domain (likely uUAR). The limitation is between [32x64=2048, 64x128=8192) tds per node.LCI does not need to create a thread domain per ibv qp anyway. Ideally it only need one thread domain per thread (assuming static mapping between queue pairs and threads.)
Need to figure out a better way for thread domain allocation.
The text was updated successfully, but these errors were encountered: