Skip to content

Commit

Permalink
Introduce RDMA transport
Browse files Browse the repository at this point in the history
Main changes in this patch:
* introduce *Redis Over RDMA* protocol, see *Protocol* section in RDMA.md
* implement server side of connection module only, this means we can *NOT*
  compile RDMA support as built-in.
* add necessary information in RDMA.md
* support 'CONFIG SET/GET', for example, 'CONFIG Set rdma.port 6380', then
  check this by 'rdma res show cm_id' and redis-cli(with RDMA support,
  but not implemented in this patch)
* the full listeners show like():
    listener0:name=tcp,bind=*,bind=-::*,port=6379
    listener1:name=unix,bind=/var/run/redis.sock
    listener2:name=rdma,bind=xx.xx.xx.xx,bind=yy.yy.yy.yy,port=6379
    listener3:name=tls,bind=*,bind=-::*,port=16379

valgrind test works fine:
valgrind --track-origins=yes --suppressions=./src/valgrind.sup
         --show-reachable=no --show-possibly-lost=no --leak-check=full
         --log-file=err.txt ./src/redis-server --port 6379
         --loadmodule src/redis-rdma.so port=6379 bind=xx.xx.xx.xx
         --loglevel verbose --protected-mode no --server_cpulist 2
         --bio_cpulist 3 --aof_rewrite_cpulist 3 --bgsave_cpulist 3
         --appendonly no

performance test:
server side: ./src/redis-server --port 6379 # TCP port 6379 has no conflict with RDMA port 6379
             --loadmodule src/redis-rdma.so port=6379 bind=xx.xx.xx.xx bind=yy.yy.yy.yy
             --loglevel verbose --protected-mode no --server_cpulist 2 --bio_cpulist 3
             --aof_rewrite_cpulist 3 --bgsave_cpulist 3 --appendonly no

build a redis-benchmark with RDMA support(not implemented in this patch), run
on a x86(Intel Platinum 8260) with RoCEv2 interface(Mellanox ConnectX-5):
client side: ./src/redis-benchmark -h xx.xx.xx.xx -p 6379 -c 30 -n 10000000 --threads 4
             -d 1024 -t ping,get,set --rdma

====== PING_INLINE ======
480561.28 requests per second, 0.060 msec avg latency.

====== PING_MBULK ======
540482.06 requests per second, 0.053 msec avg latency.

====== SET ======
399952.00 requests per second, 0.073 msec avg latency.

====== GET ======
443498.31 requests per second, 0.065 msec avg latency.

Signed-off-by: zhenwei pi <[email protected]>
  • Loading branch information
pizhenwei committed Apr 3, 2024
1 parent 4df0379 commit 1528f29
Show file tree
Hide file tree
Showing 3 changed files with 2,142 additions and 1 deletion.
174 changes: 174 additions & 0 deletions RDMA.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,174 @@
RDMA Support
============

Getting Started
---------------
Note that Redis Over RDMA is only supported by Linux.

## Building

To build with RDMA support you'll need RDMA development libraries (e.g.
librdmacm-dev and libibverbs-dev on Debian/Ubuntu).

For now, Redis only supports RDMA as connection module mode.
Run `make BUILD_RDMA=module`.

## Running manually

To manually run a Redis server with RDMA mode:

./src/redis-server --protected-mode no \
--loadmodule src/redis-rdma.so bind=192.168.122.100 port=6379

It's possible to change bind address/port of RDMA by runtime command:
10.2.16.101:6379> CONFIG SET rdma-port 6380

It's also possible to have both RDMA and TCP available, and there is no
conflict of TCP(6379) and RDMA(6379), Ex:

./src/redis-server --protected-mode no \
--loadmodule src/redis-rdma.so bind=192.168.122.100 port=6379 \
--port 6379

Note that the network card (192.168.122.100 of this example) should support
RDMA. To test a server supports RDMA or not:

~# rdma res show (a new version iproute2 package)
Or:

~# ibv_devices

Connections
-----------

RDMA operations also go through a connection abstraction layer that hides
I/O and read/write event handling from the caller.

Redis works under a stream-oriented protocol while RDMA is a message protocol, so additional work is required to support RDMA-based Redis.

## Protocol
In Redis, separate control-plane(to exchange control message) and data-plane(to
transfer the real payload for Redis).

### Control message
For control message, use a fixed 32 bytes message which defines structures:
```
typedef struct RedisRdmaFeature {
/* defined as following Opcodes */
uint16_t opcode;
/* select features */
uint16_t select;
uint8_t rsvd[20];
/* feature bits */
uint64_t features;
} RedisRdmaFeature;
typedef struct RedisRdmaKeepalive {
/* defined as following Opcodes */
uint16_t opcode;
uint8_t rsvd[30];
} RedisRdmaKeepalive;
typedef struct RedisRdmaMemory {
/* defined as following Opcodes */
uint16_t opcode;
uint8_t rsvd[14];
/* address of a transfer buffer which is used to receive remote streaming data,
* aka 'RX buffer address'. The remote side should use this as 'TX buffer address' */
uint64_t addr;
/* length of the 'RX buffer' */
uint32_t length;
/* the RDMA remote key of 'RX buffer' */
uint32_t key;
} RedisRdmaMemory;
typedef union RedisRdmaCmd {
RedisRdmaFeature feature;
RedisRdmaKeepalive keepalive;
RedisRdmaMemory memory;
} RedisRdmaCmd;
```

### Opcodes
|Command| Value | Description |
| :----: | :----: | :----: |
| GetServerFeature | 0 | required, get the features offered by Redis server |
| SetClientFeature | 1 | required, negotiate features and set it to Redis server |
| Keepalive | 2 | required, detect unexpected orphan connection |
| RegisterXferMemory | 3 | required, tell the 'RX transfer buffer' information to the remote side, and the remote side uses this as 'TX transfer buffer' |

### Operations of RDMA
- To send a control message by RDMA '**ibv_post_send**' with opcode '**IBV_WR_SEND**' with structure
'RedisRdmaCmd'.
- To receive a control message by RDMA '**ibv_post_recv**', and the received buffer
size should be size of 'RedisRdmaCmd'.
- To transfer stream data by RDMA '**ibv_post_send**' with opcode '**IBV_WR_RDMA_WRITE**'(optional) and
'**IBV_WR_RDMA_WRITE_WITH_IMM**'(required), to write data segments into a connection by
RDMA [WRITE][WRITE][WRITE]...[WRITE WITH IMM], the length of total buffer is described by
immediate data(unsigned int 32).


### Maximum WQE(s) of RDMA
Currently no specific restriction is defined in this protocol. Recommended WQEs is 1024.
Flow control for WQE MAY be defined/implemented in the future.


### The workflow of this protocol
```
server
listen RDMA port
client
-------------------RDMA connect------------------>
accept connection
<--------------- Establish RDMA ------------------
--------Get server feature [@IBV_WR_SEND] ------->
--------Set client feature [@IBV_WR_SEND] ------->
setup RX buffer
<---- Register transfer memory [@IBV_WR_SEND] ----
[@ibv_post_recv]
setup TX buffer
----- Register transfer memory [@IBV_WR_SEND] --->
[@ibv_post_recv]
setup TX buffer
-- Redis commands [@IBV_WR_RDMA_WRITE_WITH_IMM] ->
<- Redis response [@IBV_WR_RDMA_WRITE_WITH_IMM] --
.......
-- Redis commands [@IBV_WR_RDMA_WRITE_WITH_IMM] ->
<- Redis response [@IBV_WR_RDMA_WRITE_WITH_IMM] --
.......
RX is full
------ Register Local buffer [@IBV_WR_SEND] ----->
[@ibv_post_recv]
setup TX buffer
<- Redis response [@IBV_WR_RDMA_WRITE_WITH_IMM] --
.......
RX is full
<----- Register Local buffer [@IBV_WR_SEND] ------
[@ibv_post_recv]
setup TX buffer
-- Redis commands [@IBV_WR_RDMA_WRITE_WITH_IMM] ->
<- Redis response [@IBV_WR_RDMA_WRITE_WITH_IMM] --
.......
------------------RDMA disconnect---------------->
<-----------------RDMA disconnect-----------------
```


## Event handling
There is no POLLOUT event of RDMA comp channel:
1, if TX is not full, it's always writable.
2, if TX is full, should wait a 'RegisterLocalAddr' message to refresh
'TX buffer'.

To-Do List
----------
- [ ] hiredis
- [ ] rdma client & benchmark
- [ ] POLLOUT event emulation for hiredis
- [ ] auto-test suite is not implemented currently
21 changes: 20 additions & 1 deletion src/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -315,6 +315,21 @@ ifeq ($(BUILD_TLS),module)
TLS_MODULE_CFLAGS+=-DUSE_OPENSSL=$(BUILD_MODULE) $(OPENSSL_CFLAGS) -DBUILD_TLS_MODULE=$(BUILD_MODULE)
endif

RDMA_MODULE=
RDMA_MODULE_NAME:=redis-rdma$(PROG_SUFFIX).so
RDMA_MODULE_CFLAGS:=$(FINAL_CFLAGS)
ifeq ($(BUILD_RDMA),module)
FINAL_CFLAGS+=-DUSE_RDMA=$(BUILD_MODULE)
RDMA_PKGCONFIG := $(shell $(PKG_CONFIG) --exists librdmacm libibverbs && echo $$?)
ifeq ($(RDMA_PKGCONFIG),0)
RDMA_LIBS=$(shell $(PKG_CONFIG) --libs librdmacm libibverbs)
else
RDMA_LIBS=-lrdmacm -libverbs
endif
RDMA_MODULE=$(RDMA_MODULE_NAME)
RDMA_MODULE_CFLAGS+=-DUSE_RDMA=$(BUILD_YES) -DBUILD_RDMA_MODULE $(RDMA_LIBS)
endif

ifndef V
define MAKE_INSTALL
@printf ' %b %b\n' $(LINKCOLOR)INSTALL$(ENDCOLOR) $(BINCOLOR)$(1)$(ENDCOLOR) 1>&2
Expand Down Expand Up @@ -363,7 +378,7 @@ REDIS_CHECK_RDB_NAME=redis-check-rdb$(PROG_SUFFIX)
REDIS_CHECK_AOF_NAME=redis-check-aof$(PROG_SUFFIX)
ALL_SOURCES=$(sort $(patsubst %.o,%.c,$(REDIS_SERVER_OBJ) $(REDIS_CLI_OBJ) $(REDIS_BENCHMARK_OBJ)))

all: $(REDIS_SERVER_NAME) $(REDIS_SENTINEL_NAME) $(REDIS_CLI_NAME) $(REDIS_BENCHMARK_NAME) $(REDIS_CHECK_RDB_NAME) $(REDIS_CHECK_AOF_NAME) $(TLS_MODULE)
all: $(REDIS_SERVER_NAME) $(REDIS_SENTINEL_NAME) $(REDIS_CLI_NAME) $(REDIS_BENCHMARK_NAME) $(REDIS_CHECK_RDB_NAME) $(REDIS_CHECK_AOF_NAME) $(TLS_MODULE) $(RDMA_MODULE)
@echo ""
@echo "Hint: It's a good idea to run 'make test' ;)"
@echo ""
Expand Down Expand Up @@ -427,6 +442,10 @@ $(REDIS_CHECK_AOF_NAME): $(REDIS_SERVER_NAME)
$(TLS_MODULE_NAME): $(REDIS_SERVER_NAME)
$(QUIET_CC)$(CC) -o $@ tls.c -shared -fPIC $(TLS_MODULE_CFLAGS) $(TLS_CLIENT_LIBS)

# redis-rdma.so
$(RDMA_MODULE_NAME): $(REDIS_SERVER_NAME)
$(QUIET_CC)$(CC) -o $@ rdma.c -shared -fPIC $(RDMA_MODULE_CFLAGS)

# redis-cli
$(REDIS_CLI_NAME): $(REDIS_CLI_OBJ)
$(REDIS_LD) -o $@ $^ ../deps/hiredis/libhiredis.a ../deps/linenoise/linenoise.o $(FINAL_LIBS) $(TLS_CLIENT_LIBS)
Expand Down
Loading

0 comments on commit 1528f29

Please sign in to comment.