forked from redis/redis
-
Notifications
You must be signed in to change notification settings - Fork 3
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Main changes in this patch: * introduce *Redis Over RDMA* protocol, see *Protocol* section in RDMA.md * implement server side of connection module only, this means we can *NOT* compile RDMA support as built-in. * add necessary information in RDMA.md * support 'CONFIG SET/GET', for example, 'CONFIG Set rdma.port 6380', then check this by 'rdma res show cm_id' and redis-cli(with RDMA support, but not implemented in this patch) * the full listeners show like(): listener0:name=tcp,bind=*,bind=-::*,port=6379 listener1:name=unix,bind=/var/run/redis.sock listener2:name=rdma,bind=xx.xx.xx.xx,bind=yy.yy.yy.yy,port=6379 listener3:name=tls,bind=*,bind=-::*,port=16379 valgrind test works fine: valgrind --track-origins=yes --suppressions=./src/valgrind.sup --show-reachable=no --show-possibly-lost=no --leak-check=full --log-file=err.txt ./src/redis-server --port 6379 --loadmodule src/redis-rdma.so port=6379 bind=xx.xx.xx.xx --loglevel verbose --protected-mode no --server_cpulist 2 --bio_cpulist 3 --aof_rewrite_cpulist 3 --bgsave_cpulist 3 --appendonly no performance test: server side: ./src/redis-server --port 6379 # TCP port 6379 has no conflict with RDMA port 6379 --loadmodule src/redis-rdma.so port=6379 bind=xx.xx.xx.xx bind=yy.yy.yy.yy --loglevel verbose --protected-mode no --server_cpulist 2 --bio_cpulist 3 --aof_rewrite_cpulist 3 --bgsave_cpulist 3 --appendonly no build a redis-benchmark with RDMA support(not implemented in this patch), run on a x86(Intel Platinum 8260) with RoCEv2 interface(Mellanox ConnectX-5): client side: ./src/redis-benchmark -h xx.xx.xx.xx -p 6379 -c 30 -n 10000000 --threads 4 -d 1024 -t ping,get,set --rdma ====== PING_INLINE ====== 480561.28 requests per second, 0.060 msec avg latency. ====== PING_MBULK ====== 540482.06 requests per second, 0.053 msec avg latency. ====== SET ====== 399952.00 requests per second, 0.073 msec avg latency. ====== GET ====== 443498.31 requests per second, 0.065 msec avg latency. Signed-off-by: zhenwei pi <[email protected]>
- Loading branch information
Showing
3 changed files
with
2,142 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,174 @@ | ||
RDMA Support | ||
============ | ||
|
||
Getting Started | ||
--------------- | ||
Note that Redis Over RDMA is only supported by Linux. | ||
|
||
## Building | ||
|
||
To build with RDMA support you'll need RDMA development libraries (e.g. | ||
librdmacm-dev and libibverbs-dev on Debian/Ubuntu). | ||
|
||
For now, Redis only supports RDMA as connection module mode. | ||
Run `make BUILD_RDMA=module`. | ||
|
||
## Running manually | ||
|
||
To manually run a Redis server with RDMA mode: | ||
|
||
./src/redis-server --protected-mode no \ | ||
--loadmodule src/redis-rdma.so bind=192.168.122.100 port=6379 | ||
|
||
It's possible to change bind address/port of RDMA by runtime command: | ||
10.2.16.101:6379> CONFIG SET rdma-port 6380 | ||
|
||
It's also possible to have both RDMA and TCP available, and there is no | ||
conflict of TCP(6379) and RDMA(6379), Ex: | ||
|
||
./src/redis-server --protected-mode no \ | ||
--loadmodule src/redis-rdma.so bind=192.168.122.100 port=6379 \ | ||
--port 6379 | ||
|
||
Note that the network card (192.168.122.100 of this example) should support | ||
RDMA. To test a server supports RDMA or not: | ||
|
||
~# rdma res show (a new version iproute2 package) | ||
Or: | ||
|
||
~# ibv_devices | ||
|
||
Connections | ||
----------- | ||
|
||
RDMA operations also go through a connection abstraction layer that hides | ||
I/O and read/write event handling from the caller. | ||
|
||
Redis works under a stream-oriented protocol while RDMA is a message protocol, so additional work is required to support RDMA-based Redis. | ||
|
||
## Protocol | ||
In Redis, separate control-plane(to exchange control message) and data-plane(to | ||
transfer the real payload for Redis). | ||
|
||
### Control message | ||
For control message, use a fixed 32 bytes message which defines structures: | ||
``` | ||
typedef struct RedisRdmaFeature { | ||
/* defined as following Opcodes */ | ||
uint16_t opcode; | ||
/* select features */ | ||
uint16_t select; | ||
uint8_t rsvd[20]; | ||
/* feature bits */ | ||
uint64_t features; | ||
} RedisRdmaFeature; | ||
typedef struct RedisRdmaKeepalive { | ||
/* defined as following Opcodes */ | ||
uint16_t opcode; | ||
uint8_t rsvd[30]; | ||
} RedisRdmaKeepalive; | ||
typedef struct RedisRdmaMemory { | ||
/* defined as following Opcodes */ | ||
uint16_t opcode; | ||
uint8_t rsvd[14]; | ||
/* address of a transfer buffer which is used to receive remote streaming data, | ||
* aka 'RX buffer address'. The remote side should use this as 'TX buffer address' */ | ||
uint64_t addr; | ||
/* length of the 'RX buffer' */ | ||
uint32_t length; | ||
/* the RDMA remote key of 'RX buffer' */ | ||
uint32_t key; | ||
} RedisRdmaMemory; | ||
typedef union RedisRdmaCmd { | ||
RedisRdmaFeature feature; | ||
RedisRdmaKeepalive keepalive; | ||
RedisRdmaMemory memory; | ||
} RedisRdmaCmd; | ||
``` | ||
|
||
### Opcodes | ||
|Command| Value | Description | | ||
| :----: | :----: | :----: | | ||
| GetServerFeature | 0 | required, get the features offered by Redis server | | ||
| SetClientFeature | 1 | required, negotiate features and set it to Redis server | | ||
| Keepalive | 2 | required, detect unexpected orphan connection | | ||
| RegisterXferMemory | 3 | required, tell the 'RX transfer buffer' information to the remote side, and the remote side uses this as 'TX transfer buffer' | | ||
|
||
### Operations of RDMA | ||
- To send a control message by RDMA '**ibv_post_send**' with opcode '**IBV_WR_SEND**' with structure | ||
'RedisRdmaCmd'. | ||
- To receive a control message by RDMA '**ibv_post_recv**', and the received buffer | ||
size should be size of 'RedisRdmaCmd'. | ||
- To transfer stream data by RDMA '**ibv_post_send**' with opcode '**IBV_WR_RDMA_WRITE**'(optional) and | ||
'**IBV_WR_RDMA_WRITE_WITH_IMM**'(required), to write data segments into a connection by | ||
RDMA [WRITE][WRITE][WRITE]...[WRITE WITH IMM], the length of total buffer is described by | ||
immediate data(unsigned int 32). | ||
|
||
|
||
### Maximum WQE(s) of RDMA | ||
Currently no specific restriction is defined in this protocol. Recommended WQEs is 1024. | ||
Flow control for WQE MAY be defined/implemented in the future. | ||
|
||
|
||
### The workflow of this protocol | ||
``` | ||
server | ||
listen RDMA port | ||
client | ||
-------------------RDMA connect------------------> | ||
accept connection | ||
<--------------- Establish RDMA ------------------ | ||
--------Get server feature [@IBV_WR_SEND] -------> | ||
--------Set client feature [@IBV_WR_SEND] -------> | ||
setup RX buffer | ||
<---- Register transfer memory [@IBV_WR_SEND] ---- | ||
[@ibv_post_recv] | ||
setup TX buffer | ||
----- Register transfer memory [@IBV_WR_SEND] ---> | ||
[@ibv_post_recv] | ||
setup TX buffer | ||
-- Redis commands [@IBV_WR_RDMA_WRITE_WITH_IMM] -> | ||
<- Redis response [@IBV_WR_RDMA_WRITE_WITH_IMM] -- | ||
....... | ||
-- Redis commands [@IBV_WR_RDMA_WRITE_WITH_IMM] -> | ||
<- Redis response [@IBV_WR_RDMA_WRITE_WITH_IMM] -- | ||
....... | ||
RX is full | ||
------ Register Local buffer [@IBV_WR_SEND] -----> | ||
[@ibv_post_recv] | ||
setup TX buffer | ||
<- Redis response [@IBV_WR_RDMA_WRITE_WITH_IMM] -- | ||
....... | ||
RX is full | ||
<----- Register Local buffer [@IBV_WR_SEND] ------ | ||
[@ibv_post_recv] | ||
setup TX buffer | ||
-- Redis commands [@IBV_WR_RDMA_WRITE_WITH_IMM] -> | ||
<- Redis response [@IBV_WR_RDMA_WRITE_WITH_IMM] -- | ||
....... | ||
------------------RDMA disconnect----------------> | ||
<-----------------RDMA disconnect----------------- | ||
``` | ||
|
||
|
||
## Event handling | ||
There is no POLLOUT event of RDMA comp channel: | ||
1, if TX is not full, it's always writable. | ||
2, if TX is full, should wait a 'RegisterLocalAddr' message to refresh | ||
'TX buffer'. | ||
|
||
To-Do List | ||
---------- | ||
- [ ] hiredis | ||
- [ ] rdma client & benchmark | ||
- [ ] POLLOUT event emulation for hiredis | ||
- [ ] auto-test suite is not implemented currently |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.