-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write batching strategy #156
Comments
Hi @FZambia, Thank you very much for your benchmark. It is very impressive. Your benchmark clearly shows that the rueidis' Therefore, if we make it slow down a little bit, for example: diff --git a/pipe.go b/pipe.go
index 8807779..e800a0d 100644
--- a/pipe.go
+++ b/pipe.go
@@ -292,6 +292,7 @@ func (p *pipe) _backgroundWrite() (err error) {
err = p.Error()
} else {
err = p.w.Flush()
+ time.Sleep(time.Microsecond * 100)
}
if err == nil {
if atomic.LoadInt32(&p.state) == 1 { Then we can have a similar result: slowbatch.movThis approach is probably simpler than customizing bufio writer.
Indeed. Even the real network environment should also be taken into consideration. Although it is really hard or almost impossible for users to tweak this kind of delay, I think we can still have an option in |
Actually I also did Sleep initially while looking for a reason but then decided it's too aggressive change to the library. Also, I was unsure how blocking on this stage may affect rueidis. In additional buffer case I showed we do not block writer loop. But you think blocking it with sleep wont introduce any downsides outside the increased latency right? |
Yes, I think so. Maybe the sleep should be added under the |
Thinking more... I still worrying it's not a strict equivalent to intermediate buffer. Because in the intermediate buffer approach we never block writing, timer is asynchronous there. So time interval to use in the intermediate buffer does not play the same role as it does if we add Sleep to backgroundWrite loop directly. 🤔 Possibly we can somehow get best from two worlds to avoid blocking. |
Probably sth like this? (optional diff --git a/pipe.go b/pipe.go
index 2eb242d..c7a50bd 100644
--- a/pipe.go
+++ b/pipe.go
@@ -46,6 +46,7 @@ type pipe struct {
cache cache
r *bufio.Reader
w *bufio.Writer
+ flushInterval time.Duration
close chan struct{}
onInvalidations func([]RedisMessage)
r2psFn func() (p *pipe, err error)
@@ -82,6 +83,8 @@ func _newPipe(connFn func() (net.Conn, error), option *ClientOption, r2ps bool)
r: bufio.NewReaderSize(conn, option.ReadBufferEachConn),
w: bufio.NewWriterSize(conn, option.WriteBufferEachConn),
+ flushInterval: time.Millisecond,
+
nsubs: newSubs(),
psubs: newSubs(),
ssubs: newSubs(),
@@ -286,13 +289,37 @@ func (p *pipe) _backgroundWrite() (err error) {
ch chan RedisResult
)
+ var mu sync.Mutex
+
+ if p.flushInterval > 0 {
+ go func() {
+ for {
+ select {
+ case <-time.After(p.flushInterval):
+ mu.Lock()
+ err = p.w.Flush()
+ mu.Unlock()
+ if err != nil {
+ // TODO.
+ }
+ case <-p.close:
+ return
+ }
+ }
+ }()
+ }
+
for atomic.LoadInt32(&p.state) < 3 {
if ones[0], multi, ch = p.queue.NextWriteCmd(); ch == nil {
+ mu.Lock()
if p.w.Buffered() == 0 {
err = p.Error()
} else {
- err = p.w.Flush()
+ if p.flushInterval == 0 {
+ err = p.w.Flush()
+ }
}
+ mu.Unlock()
if err == nil {
if atomic.LoadInt32(&p.state) == 1 {
ones[0], multi, ch = p.queue.WaitForWrite()
@@ -306,7 +333,9 @@ func (p *pipe) _backgroundWrite() (err error) {
multi = ones
}
for _, cmd := range multi {
+ mu.Lock()
err = writeCmd(p.w, cmd.Commands())
+ mu.Unlock()
}
if err != nil {
if err != ErrClosing { // ignore ErrClosing to allow final QUIT command to be sent (Locks may be avoided if |
Oh… Additional goroutine and lock looks too heavy. Furthermore, the lock should also be taken on the synchronous path.
The only difference I can think of is missing some chances to trigger implicit flush of bufio writer if there are some commands exceeding the size of its buffer during we are slept. Given that the sleeping interval should be small, I think this difference is not a big deal. |
I tested both approaches - with extra flush goroutine and with:
– after Flush (i.e. basically Sleep). For 128 parallelism (old is extra flush goroutine, new is pause in background write loop):
So approach with pause gives a slightly worse latency than flush in goroutine. For 1024 paralellism:
Similar picture. Though for app and Redis CPU usage the picture is like this: For 128 parallelism: Flush goroutine: App cpu 290%, Redis CPU 42% For 1024 parallelism: App cpu 388%, Redis CPU 40% App CPU reduction seems significant in sleep/pause case. So I think sleep/pause approach is OK. It's worth mentioning that when concurrency is small, then both approaches result into significantly less throughput. Sth like 8k requests per second instead of 245k rps on |
Also tested for sequential case: it's 1k rps vs 45k rps.
Actually it seems that for some scenarios this may be important.. Whether it's possible to avoid sleeping at all and still have good batches in all cases 🤔 Probably try to combine current rueidis approach with a smart batching technique I had before. |
Hi @FZambia,
I also got similar results of +10% latency. But I quickly realized pausing I think pausing it for 20 microseconds is enough for local redis server. Here are results: For 128 parallelism (old is extra flush goroutine, new is pause in background write loop):
For 1024 parallelism:
And still have comparable throughput when parallelism = 1:
However, for sequential usage, 20 microseconds is still too long. Thankfully it is possible to detect this case: diff --git a/pipe.go b/pipe.go
index 8807779..28a4bb7 100644
--- a/pipe.go
+++ b/pipe.go
@@ -291,7 +291,13 @@ func (p *pipe) _backgroundWrite() (err error) {
if p.w.Buffered() == 0 {
err = p.Error()
} else {
- err = p.w.Flush()
+ if atomic.LoadInt32(&p.waits) == 1 {
+ err = p.w.Flush()
+ } else {
+ ts := time.Now()
+ err = p.w.Flush()
+ time.Sleep(time.Microsecond*20 - time.Since(ts))
+ }
}
if err == nil {
if atomic.LoadInt32(&p.state) == 1 { Also note that in pausing case we probably should record the elapsed time of
I think it is possible and probably can support multiple strategies by swapping |
20 microseconds seem to work fine in all scenarios, right! (And actually produces better throughput for me for non-limited scenario). But I suppose it should be optional anyway? Or default since improves throughput a bit? And I think it would be nice to tune it for non-local setup where RTT is larger so we can afford larger time to collect batches? Should we also listen to pipe More advanced strategies are nice to have - though much more complex and I can't even imagine one now, and probably should be tested with all the parallelism, request rates, network latencies taken into account. |
Even though it produces better throughput for non-limited scenario, I think it should be optional because latency is still critical to many users.
Actually, we shouldn't and it won't have effect.
Yes, I propose that adding an optional
Sure. Thankfully, it is |
Hi @FZambia, many thanks! The |
@rueian Noticed the following being sent to Redis Enterprise when DisableCache: true with this change
|
Hi @lgothard, Would you mind providing the code snippet and how you set up Redis Enterprise? I have tried to reproduce the error but failed. In my case, Redis Enterprise 6.2.6 returns |
Hey @rueian, This was my bad. I had my windows confused. I got this error on Redis Stack v6.2.7. On our Redis Enterprise v6.0.16 test server, I don’t see it. Sorry for the confusion. |
Hi @lgothard, The latest version of Redis Stack seems to be v6.2.6-v0. I have tested the following snippet with both v6.2.6-v0 and 7.0.6-RC2 and it worked fine: package main
import (
"github.com/rueian/rueidis"
)
func main() {
c, err := rueidis.NewClient(rueidis.ClientOption{
InitAddress: []string{"127.0.0.1:6379"},
DisableCache: true,
})
if err != nil {
panic(err)
}
defer c.Close()
} Would you mind providing your code snippet as well?
The error message seemed to indicate that there was no username and password in the |
Hey @rueian, |
Hi @liuguili2020, did the CPU usage ever go down? |
Hey @rueian, Here are the client options we have set: Do we need to set other parameters, such as MaxFlushDelay, ConnWriteTimeout, and Dialer? Or could it be caused by the DB tracking feature? Do you have any recommended values? From the test results, it appears that this function has entered an busy loop: func (p *pipe) _backgroundWrite() (err error). |
Hi @liuguili2020, I guess the 15-minute busy loop is related to the current graceful shutdown of rueidis. I just released a patch to https://github.com/redis/rueidis/releases/tag/v1.0.31-cap which limits the duration of the graceful shutdown period. Would you like to give it a try? If the patch could shorten your busy loop, then we would have a better idea of how to fix the issue. |
Hi @rueian, thanks a lot for your response. I'm from the team of @liuguili2020, and want to make it clearer why always enter the logic of 'ErrClosing' Line 445 and Line 448. and What is the purpose of 'ErrClosing', I think the client is blocked on sending data or pending response on a dead connection. I haven't figured out the underlying connection management yet, so I'd like to know why these don't work if it's stuck in the graceful shutdown phase for fixed 15 minutes.
And about 15-minute busy loop, I found this https://supportportal.juniper.net/s/article/Juniper-PAA-TCP-retransmission-time-interval-in-Monitor?language=en_US TCP retransmits an unacknowledged packet up to tcp_retries2 sysctl setting times (defaults to 15) using an exponential backoff timeout for which each retransmission timeout is between TCP_RTO_MIN (200 ms) and TCP_RTO_MAX (120 seconds). Once the 15th retry expires (by default), the TCP stack will notify the layers above (ie. app) of a broken connection. The value of TCP_RTO_MIN and TCP_RTO_MAX is hardcoded in the Linux kernel and defined by the following constants: #define TCP_RTO_MAX ((unsigned)(120*HZ)) Linux 2.6+ uses HZ of 1000ms, so TCP_RTO_MIN is ~200 ms and TCP_RTO_MAX is ~120 seconds. Given a default value of tcp_retries set to 15, it means that it takes 924.6 seconds before a broken network link is notified to the upper layer (ie. application), since the connection is detected as broken when the last (15th) retry expires. |
Hi, again. I found this #108 Is it a similar issue? Line 227 in defba73
There would be a QUIT injected after _backgroundRead. It is replaced with PING, I think it's the final PING command mentioned in pipe.go If I understand correctly, since the error code ErrClosing has been received, indicating that the link has been closed, do we only need to retry a few times instead of constantly retrying? What is the purpose of the final PING that must be sent? |
Hi @rueian , Thank you for your reply and support. we got test result of above patch, The CPU usage only increased to 12% and then dropped after about 15 seconds (05:58:15-05:58:30): For comparison, before your modifications, our test results were as follows, it increased to 100% and remained there for 15 minutes before dropping: (The horizontal axis represents time, and the vertical axis represents CPU usage percentage.) Could you release an official version to fix this issue? Thank you very much. |
Hi @qihuaheng,
By receiving the final PONG response, we can make sure that all preceding commands are fulfilled. That is how we do a graceful shutdown.
Yes, I believe so. The v1.0.31-cap is a quick workaround to that and it indeed stops the busy loop according to your result. Hi @liuguili2020,
There will be an official fix to remove the busy loop completely, and it will take me some time. I think the fix will land in v1.0.40 if everything goes well. |
@liuguili2020, it seems that you can stably reproduce the busy loop. Could you also share how you reproduce it? |
Hi @rueian , Our runtime environment is a K8s container (IPVS), and then the Redis cluster was also restarted, and our program had just started. |
Hi @rueian , Thank you very much. |
Hi @liuguili2020, rueidis will only retry on read-only commands. PING will not be retried.
What happened after 10+ seconds?
The only supported timer for retrying read-only commands is the |
Hi @rueian, Thank you very much. |
It is necessary if you want to failover to another address without the help of Redis Cluster or Redis Sentinel.
rueidis will mark broken connections as |
Hi @rueian, thanks for your reply. I want to know what does it "the help of Redis Cluster" specifically refer to? Do you mean MOVED and ASK redirection or CLUSTER SLOTS? I think these have been supported by rueidis, it internally discover all other master nodes via MOVED/ASK/CLUSTER SLOTS and automatically connect to them, right? Is there a need for the application layer start a long-run liveness PING probe for Redis cluster failover? We are rethinking the design of the application layer and tend to remove the liveness ping mechanism at the application layer and rely entirely on rueidis's underlying connection management. We currently only count the errors returned by rueidis client.Do() and consider them to be temporary errors, which in turn lead to temporary failures of our service callers. This solution is built on the assumption that rueidis can perfectly handle the restart and removal of any Redis pod in the Redis Cluster on Kubernetes in the failover scenario. Live reconfiguration chapter in Redis cluster spec
Client connections and redirection handling
And the Challenge of Redis Setup on Kubernetes
|
Yes, it does.
No, no need to do that. rueidis already does it for you.
No, it only means a random Redis instance is alive because rueidis picks a random Redis for the PING. If you want to PING every instance of a Redis Cluster, you should do this: for _, c := range client.Nodes() {
_, err := c.Do(context.Background(), c.B().Ping().Build()).Error()
}
rueidis will do its best to follow Redis Cluster redirection and reconnection, but temporary failures may still happen due to various delays, including delays of the voting inside the Redis Cluster. If you find a better way to handle failover, please let us know. |
Hi @liuguili2020, We have removed the busy loop and refined the capped graceful shutdown at |
Hi @rueian, |
Hey @rueian, this is me again.
I was preparing Rueidis-based code for release and suddenly discovered an interesting thing. I did quite a lot of Go benchmarks to make sure the new implementation based on Rueidis produces a better operation latency and a better throughput. And it does.
I also expected that migration to Rueidis will provide Centrifugo a better CPU utilization since Rueidis produces less memory allocations. And here are dragons.
Before making release I decided to do macro-benchmarks and found that Centrifugo consumes more CPU than before in equal conditions. Moreover, Rueidis-based implementation results into more CPU usage on Redis instance than we had with previous implementation. I did not expect that at all. To investigate that I made a repo: https://github.com/FZambia/pipelines.
In that repo I implemented 3 benchmarks: for pipelined Redigo, pipelined Go-Redis and Rueidis.
After running benchmarks I observed the following:
input_1.mp4
Here we can see that CPU usage is:
Nothing too special here – all numbers are +/- expected. Rueidis produced better throughput so it loaded Redis more and the price for the better throughput is application CPU utilization.
But in Centrifugo case I compared CPU usage with Redigo and Rueidis in equal conditions. So I added rate limiter to benchmarks in the https://github.com/FZambia/pipelines repo to generate the same load in all cases. Limiting load to 100 commands per millisecond (100k per second).
input_2.mp4
This is more interesting. We are generating the same load in all benchmarks but both app and Redis CPU is the worst in Rueidis case.
Turned out the difference here is the result of different batch sizes we are sending to Redis. In Redigo/Goredis case we have larger batches than in Rueidis case. In Rueidis case we have smaller size batches and thus more syscalls in app and on Redis side. As we can see CPU is very sensitive to this.
There is a project called Twemproxy which acts as a proxy between applications and Redis and makes automatic batches thus reducing load on Redis, so in general pipelining is known not only to increase throughput but to reduce CPU usage of Redis. As Redis is single threaded its capacity is quite limited actually.
I tried to find a simple way to improve batching of Rueidis somehow. The simplest solution I found at this point is this one: main...FZambia:rueidis:GetWriterEachConn
I.e. introducing an option to provide custom bufio.Writer. I used it like this:
The code of delayed writer inspired by Caddy's code. It basically delays writes into connection.
We sacrifice latency for less syscalls.
input_3.mp4
From these results we can see that by better batching we can reduce both application and Redis CPU usage, as we make less read/write syscalls. For Rueidis CPU of benchmark process reduced from 118 to 51 %, for Redis process from 45 to 6 %. Extra millisecond latency seems tolerable for such a huge resource reduction.
Unfortunately, it may be that I missed sth – so would be interesting to listen to your opinion, whether you see potential issues with this approach. Actually under different level of parallelism results may be different – since batch sizes change. All libraries in the test may perform better or worse.
I think resource reduction like this is great to have. In Centrifugo case users tend to add more Centrifugo nodes that work with single Redis instance - so possibility to keep Redis CPU as low as possible seems nice. Probably you may suggest a better approach to achieve this.
The text was updated successfully, but these errors were encountered: