-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.txt
126 lines (94 loc) · 4.45 KB
/
README.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
kafkaesque
==========
"Marked by surreal distortion and often a sense of impending danger."
- https://en.wiktionary.org/wiki/Kafkaesque
Example
-------
python -m kafkaesque create words --page-size 4096
python -m kafkaesque produce --batch-size 1024 words < /usr/share/dict/words
python -m kafkaesque consume --fetch-size 512 words
Overview
--------
Benefits:
- Operationally straightforward.
Drawbacks:
- Reduced durability (dependent on `fsync` frequency, see `Durability` below.)
- Reduced throughput (*really* dependent on `fsync` frequency.)
- Reduced capacity (all data must be memory resident.)
- No direct support for topic partitioning or consumer group balancing.
- Reduced offset space per topic: the maximum offset is (2 ^ 53) - 1, the
largest unambiguous representation of an integer using double-precision
floating point numbers. Writes that are attempted after this value is reached
will fail. (This shouldn't be a practical concern, however -- at 10 million
writes/second it would take around is around 10,424 days -- 28.5 years -- to
hit this limit. Considering a more realistic write throughput -- around
70,000 writes/second, around as fast as I could publish batches of 100 4 KB
messages to a single server using a default Redis configuration on a Early
2015 MacBook Pro -- this would take about 4,080 years.)
Performance
-----------
A single Redis server will often reach maximum CPU utilization before any other
resources, and once maximized, the server will have a relatively consistent
throughput regardless of the number of clients. If N clients are producing or
consuming records, each client's individual maximum throughput will be roughly
the maximum throughput of a single client performing the same actions / N.
Durability
----------
Infrequent `fsync` calls can lead to data loss!
A server may return a successful response to a produce request without having
flushed the data to disk. If the primary server fails before `fsync`ing the
AOF, *that data will be lost.*
Compounding this problem, the now noncanonical data may have been received by
consumers downstream, who marked those offsets as committed. For example, let's
say a consumer received this data:
offset value `fsync`ed by server
------ ----- -------------------
0 alpha yes
1 beta no
2 gamma no
If the server `fsync`s after receiving the first value (zeroith offset), but
does not `fsync` after receiving the additional records, those subsequent
non-`fsync`ed records records will be lost by the server if it crashes.
After the server is restarted, it will then recycle those offsets, leading to
the server having an understanding of history that appears like this:
offset value
------ -------
0 alpha
1 delta
2 epsilon
3 zeta
Any clients who were able to fetch all of the records before the server failed
may end up with a noncanonical understanding of history that looks like this,
since they only retrieved records after the latest offset they had already
received -- in this case, anything after the offset `2`:
offset value
------ -----
0 alpha
1 beta
2 gamma
3 zeta
With a large number of consumers, a wide window for `fsync`s to occur, and a
high frequency of records being published, it's possible that *no consumer
could share a consistent view of the log with any other* due to inconsistent
consumption rates. Note that this isn't even eventually consistent insofar that
eventually all consumers would see the same version of history -- it's just
plainly inconsistent.
If this is a scary concept to you -- if you're dealing with finances, or other
data that requires stronger (read: any) consistency semantics -- it's a good
sign that you should probably bite the bullet and use Kafka.
Data Structures
---------------
{topic}
A MessagePack-encoded map storing configuration data for this topic,
including:
- size (integer): maximum size of a page in the topic
- ttl (integer or nil): number of seconds to retain pages after they have
been closed for writing
{topic}/pages
A sorted set, that acts as an index of the pages in the topic. This set can
be used to identify what page contains a particular offset, allowing page
sizes to be changed over time.
Items in the sorted set are page numbers, scored by the offset of the first
item in the page.
{topic}/pages/{number}
A list containing log records.