The Octez tezos-context
package maintains an on-disk representation of Tezos blocks. It is implemented using Irmin and the purpose-built Irmin backend ~irmin-pack~.
Performance of tezos-context
is important for overall performance of Octez and previous improvments in irmin-pack
have led to major performance improvements in Octez.
This report contains an audit of the irmin-pack
storage backend used by Tezos Octez. We investigate the performance of individual sub-components as well as access patterns used by Tezos.
Irmin is a content-addressable data store similar to Git. Data is stored in trees where nodes and leaves of the trees are identified by their cryptographic hash. At it’s core Irmin is a store of content-addressed objects (nodes and leaves).
irmin-pack
is a space-optimized Irmin backend inspired by Git packfiles that is currently used in Octez.
Conceptually irmin-pack
stores content-addressed objects to an append-only file. A persistent datastructure called the index is used to map hashes to positions in the append-only file.
If not noted all tests were run on StarLabs Starbook Mk VI with a AMD Ryzen 7 5800U and a PCIe-3 connected 960GiB SSD running Debain Linux 11 and the Linux Kernel version 6.1.0.
For replays a recording of 100000 blocks starting at block level 2981990 was used.
We use the fio tool to simulate disk access patterns and measure performance. A baseline, based on observed access patterns of irmin-pack
is provided in the section Baseline fio
job.
The index is a persistent datastructure that maps the hash of an object to its offset in the append-only pack file.
Since Irmin version 3.0.0 objects in the pack file hold direct references to other objects in the pack file [irmin #1659] this removes the necessity to always query the index when traversing objects. This also allows a smaller index as now only top-level objects (i.e. commits) need to be in the index [irmin #1664]. This is called minimal
indexing and has allowed considerable performance improvements for the Octez storage layer.
The index datastructure is split into two major parts: the log and the data. The log is a small and bounded structure that holds recently-added bindings. The log is kept in memory after initialization. The data part holds older bindings. When a certain number of bindings are added to the log, the bindings are merged with the bindings in the data part. The datastructure is similar to a two-level Log-structured merge-tree.
The default number of bindings to keep in the log before moving them to the data part in Octez and in ~irmin-pack~ is set to 2500000.
Tezos archive nodes hold the full history since genesis. When using minimal indexing references to at most 2500000 commits/blocks are kept in the log of the index. In July 2022 the Tezos block chain reached that block height. So since at least July 2022 the index of archive nodes will also have a data
part.
Archive nodes that were bootstrapped before the new minimal indexing strategy was adopted have many more entries in the index:
dune exec audits/index/count_bindings.exe -- inputs/archive-node-index/
Number of bindings in Index: 2036584177
Archive nodes bootstrapped before the new minimal indexing strategy was adopted use around 90GiB of space just for the index structure. With the new minimal indexing the size of a freshly bootstrapped archive node is about 2.4MiB.
Tezos rolling and full nodes keep a default of 6 cycles which corresponds to about 98304 blocks (16834 blocks per cycle) or, in Irmin terminology, to about 98304 commits.
If for every commit a single index entry is maintained, rolling and full nodes will by default never create a data part and every index lookup performed is an in-memory operation.
However, currently entries in the index are never removed. So size of the index is not a function of the history mode and settings of the node, but the time since it was bootstrapped.
We measure the performance of index lookups by creating an empty index structure with same parameters as used in Tezos Octez (key size of 30 bytes, value size of 13 bytes and log size of 2500000), adding c
random bindings and performing c
lookups for random bindings. We use the LexiFi/landmarks library to measure performance in CPU cycles (as well as system time).
count | cpu time | sys time | cpu time per entry |
---|---|---|---|
250000 | 2132773826.000000 | 1.126439 | 8531.0953 |
500000 | 4009158441.000000 | 2.119049 | 8018.3169 |
1000000 | 8235106179.000000 | 4.351689 | 8235.1062 |
2000000 | 16838605030.000000 | 8.893822 | 8419.3025 |
2499999 | 20421445765.000000 | 10.791509 | 8168.5816 |
2500001 | 23511451504.000000 | 12.446721 | 9404.5768 |
3000000 | 31077192851.000000 | 16.442429 | 10359.064 |
4000000 | 39358430251.000000 | 20.823590 | 9839.6076 |
5000000 | 48122118368.000000 | 25.448949 | 9624.4237 |
6000000 | 60941841080.000000 | 32.247097 | 10156.974 |
7000000 | 72898690458.000000 | 38.564382 | 10414.099 |
We note a sharp increase in CPU cycles needed to lookup an entry when the number of bindings jumps over the log size (2500000). The lookup performance stays relatively constant at a higher level for up to 7000000 entries.
With the currently implemented minimal indexing scheme no performance issues are expected when using the default Tezos Octez configurations. No considerable performance degradation is expected up to at least block level 7000000.
For nodes running in history modes rolling
and full
a mechanism should be implemented to remove entries for commits that were garbage collected, this would allow the index to remain a purely in-memory structure for such nodes.
Archive nodes that were bootstrapped using non-minimal indexing have a very large index structure. For better disk-usage it is recommended to re-bootstrap these nodes using minimal indexing.
Irmin stores values in a tree where paths are sequences of strings. In order to de-duplicate commonly occurring path segments, some path segments are stored in an auxiliary persistent structure called the Dict
.
The Dict
maintains a mapping from path segments (strings) to integer identifiers. In the Irmin tree the integer identifiers can be used instead of the string path segment. As integer identifiers are in general smaller than the string path segments this results in a more compact on-disk representation.
In irmin-pack
the Dict
is limited to 100000 entries that were added on a first-come-first-serve basis. The Dict
is currently full.
We count the number of occurrences of path segments that appear in the Dict
in the store of the Tezos block 3081990.
It seems like the some common path segments are de-duplicated a considerable amount of times:
Path segment | Occurrences in store |
---|---|
4a1cf11667fa0165eac9963333b883a80bcfdfebde09b79bfc740680e986bab6 | 108903 |
053f610929e2b6ea458c54dfd8b29716d379c13f5c8fd82d5c793a9e31271743 | 90893 |
00d0158265571a474bc6ed02547db51416ab2228327e66332117ea7b587aca94 | 13912 |
1ffdaf1cd7574b72f933a9d5e102143f3e4d761a6e51b4019ed821b7b99b097a | 2313 |
04034e4a6228fdb92e8978fb85d9c2d1f79501b0c509f24b0f1eede3ca7cb234 | 1611 |
10f21b2eacdf858cf9824d29e9c0d09bf666d3d900fbc54b6438f67e63831d4d | 1157 |
2966fdb0cb953d94e959dcee2b2c3238c42bf0d1e0991a5a51609059aaa04080 | 890 |
1f33bf814d191cc602888479ef371d13082f19718f63737e685cc76110e323d9 | 813 |
02e5f95f2c3a3ccfa5ce71d0f11ad70f4746b8e0f3fe7bcecc63dbc8cfba71d1 | 731 |
438c52065d4605460b12d1b9446876a1c922b416103a20d44e994a9fd2b8ed07 | 477 |
00642bcad8681caf0f45f195cc1483f8366f155d83e272c9ff93fe3840a61dcb | 396 |
4001852857ca1c5dad1c1275f766fc5208e63c48ba0289127591ffef3c440d53 | 388 |
27e1640238d07d569852b3e4fe784f5adce0e6649673ea587aa7389d72b855af | 381 |
However we also note that most entries in the Dict
do not occur in the store. We count 96394 Dict
entries that do not occur in the store (96%).
Furthermore, we observe that most entries in the Dict
are hashes. Commonly used short keywords like total_bytes
, missed_endorsement
or index
are not Dict
entries.
In order to evaluate potential performance improvement we run benchmarks with following changes to the Dict
:
- Increase capacity to 500000 entries.
- Remove limitations on the number of entries that the
Dict
can hold (see metanivek/irmin-92d910b). Path segments are always added to the dictionary before writing. - Don’t use
Dict
for any newly written content3.7.1 500k-dict infinite-dict ignore-dict CPU time elapsed 104m07s 109m42s (105%) 130m36s (125%) 101m39s (98%) total bytes read 66.466G 64.976G (98%) 60.747G (91%) 67.489G (102%) total bytes written 46.360G 41.671G (90%) 37.903G (82%) 46.891G (101%) max memory usage (bytes) 0.394G 0.508G (129%) 2.534G (643%) 0.396G (100%)
When increasing the capacity of the Dict
, the amount of bytes read and written decreases as improved usage of the Dict
results in more compact representations. Memory usage increases as expected. Unfortunately it seems that overall performance degrades.
When disabling the Dict
for newly written content we see that slightly more content is written and read. This small change again indicates that the Dict
is underused. However, we see that disabling the Dict
, even at the cost of causing more I/O, increases performance slightly.
The performance decrease when increasing Dict
capacity could be due to the in-memory Dict (a Hashtbl
) being taken to it’s limits. A more efficient in-memory datastructure might offset this and result in performance increase.
We conclude that the Dict
is underused. This observation has been previously made (see mirage/irmin#1807). An increased usage of the Dict
could lead to a considerably more compact on-disk representation and potential performance improvements. However, we note that the in-memory Dict
seems to need considerable performance optimizations before we see any overall performance improvments.
Note that existing Dict
entries can not be removed and it is not clear how to improve usage of the Dict
without rewriting existing data.
One suggestion would be to make the Dict
local to individual chunks. Since version 3.5.0 irmin-pack
supports splitting on-disk data into multiple smaller chunks (see mirage/irmin#2115). Instead of having one single global Dict
we could implement a Dict
per chunk. This might allow the chunk-local ~Dict~s to perform much better as they hold entries to data that has been accessed recently. This design seems to also fit well with how the garbage collector works.
It might also be worth to reconsider the first-come-first-serve strategy. An improvement might be to only add a Dict
entry for path segments that appear at least n
times. This would prevent the addition of Dict
entries for single-shot accessed path-segments.
In order to measure real disk IO accesses we add some instrumentation to irmin-pack that allows us to measure how many bytes are read/written to individual files in how many system calls. In addition we also trace calls to IO reads and writes using the landmarks library.
Read | Write | |
---|---|---|
Bytes per block (mean) | 591394.118 (56%) | 463528.962 (44%) |
Number of system calls per block (mean) | 9645.864 | 4.061 |
Bytes per system call (mean) | 61.310642 | 114141.58 |
Relative time | 13.6% | 0.4% |
Relative time in I/O | 96.5% | 3.5% |
We observe:
- Total I/O activity in bytes consists of about 56% reads and 44% writes. However, reads take 96.5% of the time [fn:: We measure this by tracing the
Irmin_pack_unix.Io.Util.really_read
andIrmin_pack_unix.Io.Util.really_write
functions]. - Reads are performed in many very small chunks (in mean 61 bytes per system call).
- Total time spend doing I/O is only 14%. This is an upper-bound for the effect on overall performance by improving I/O.
As read performance dominates the overall performance we concentrate on it in the following.
Based on the observations we can create a baseline fio
job description that simulates the read access pattern of irmin-pack
:
[global]
rw=randread
filename=/home/adatario/dev/tclpa/.git/annex/objects/gx/17/SHA256E-s3691765475--13300581f2404cc24774da8615a5a3d3f0adb7d68c4c8034c4fa69e727706000/SHA256E-s3691765475--13300581f2404cc24774da8615a5a3d3f0adb7d68c4c8034c4fa69e727706000
[job1]
ioengine=psync
rw=randread
blocksize_range=50-100
size=591394B
loops=100
fio baseline-reads.ini
job1: (g=0): rw=randread, bs=(R) 50B-100B, (W) 50B-100B, (T) 50B-100B, ioengine=psync, iodepth=1
fio-3.35
Starting 1 process
job1: (groupid=0, jobs=1): err= 0: pid=153129: Thu Jun 15 10:14:00 2023
read: IOPS=328k, BW=21.0MiB/s (22.1MB/s)(56.4MiB/2681msec)
clat (nsec): min=210, max=3742.4k, avg=2833.00, stdev=20598.76
lat (nsec): min=230, max=3743.1k, avg=2858.85, stdev=20605.58
clat percentiles (nsec):
| 1.00th=[ 211], 5.00th=[ 221], 10.00th=[ 221], 20.00th=[ 221],
| 30.00th=[ 231], 40.00th=[ 231], 50.00th=[ 231], 60.00th=[ 231],
| 70.00th=[ 241], 80.00th=[ 402], 90.00th=[ 410], 95.00th=[ 676],
| 99.00th=[150528], 99.50th=[158720], 99.90th=[191488], 99.95th=[230400],
| 99.99th=[329728]
bw ( KiB/s): min=20846, max=21922, per=99.73%, avg=21483.80, stdev=449.08, samples=5
iops : min=317386, max=333862, avg=327161.20, stdev=6851.62, samples=5
lat (nsec) : 250=70.87%, 500=23.40%, 750=1.39%, 1000=1.64%
lat (usec) : 2=0.93%, 4=0.08%, 10=0.06%, 20=0.01%, 50=0.01%
lat (usec) : 250=1.59%, 500=0.03%, 750=0.01%, 1000=0.01%
lat (msec) : 4=0.01%
cpu : usr=12.13%, sys=9.74%, ctx=14307, majf=0, minf=10
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=879400,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
READ: bw=21.0MiB/s (22.1MB/s), 21.0MiB/s-21.0MiB/s (22.1MB/s-22.1MB/s), io=56.4MiB (59.1MB), run=2681-2681msec
Disk stats (read/write):
dm-1: ios=14145/0, merge=0/0, ticks=2116/0, in_queue=2116, util=96.36%, aggrios=14301/0, aggrmerge=0/0, aggrticks=2132/0, aggrin_queue=2132, aggrutil=95.57%
dm-0: ios=14301/0, merge=0/0, ticks=2132/0, in_queue=2132, util=95.57%, aggrios=14301/0, aggrmerge=0/0, aggrticks=1539/0, aggrin_queue=1539, aggrutil=95.57%
nvme0n1: ios=14301/0, merge=0/0, ticks=1539/0, in_queue=1539, util=95.57%
We observe a read band-width of 21.0MiB/s which seems to match our observations.
Having a compact on-disk representation is not only good for disk usage but can also improve overall performance as data can be loaded into memory closer to the processor faster if it is smaller.
We analyze the compactness of the on-disk representation of irmin-pack
by compressing with the Zstandard compression algorithm. Compact representation have a low compression ratio, whereas less compact representations may admit a more compact representation (for example by compressing with Zstandard).
Store | Uncompressed | Compressed | Compression Ratio |
---|---|---|---|
Level 2981990 | 5108531200 | 3265930512 | 1.5641886 |
Level 3081990 | 32111902720 | 10332988078 | 3.1077073 |
Single suffix from level 2981990 | 5101987840 | 3261700639 | 1.5642109 |
Single suffix from level 3081990 | 3633868800 | 940228651 | 3.8648778 |
There seems to be room to improve the compactness of the on-disk representation.
We can not explain the difference in compression rates between the stores of level 2981990 and 3081990.
Content Size (logarithmic with base 2) | Count | Percentage of Content |
---|---|---|
0 | 596661 | 0.74748116 |
1 | 23722650 | 29.719110 |
2 | 38698770 | 48.480798 |
3 | 1125580 | 1.4100969 |
4 | 3131048 | 3.9224944 |
5 | 4194650 | 5.2549469 |
6 | 7477058 | 9.3670611 |
7 | 532486 | 0.66708442 |
8 | 194262 | 0.24336631 |
9 | 35410 | 0.044360714 |
10 | 35600 | 0.044598741 |
11 | 71535 | 0.089617161 |
12 | 4289 | 5.3731461e-3 |
13 | 2583 | 3.2359143e-3 |
14 | 260 | 3.2572114e-4 |
15 | 9 | 1.1274963e-5 |
16 | 1 | 1.2527736e-6 |
17 | 3 | 3.7583209e-6 |
18 | 7 | 8.7694154e-6 |
We observe that the size of content in the Tezos store is very small.
We perform some audits in order to understand what paths in the Tezos Context are accessed how frequently.
We analyze the 10000 blocks starting at block level 2981990 and record what paths are accessed by read operations such as Context.find_tree
, Context.mem
or Context.find
:
Path | Read operations * |
---|---|
contracts/global_counter | 2121152 |
big_maps/index/347453/total_bytes | 614589 |
big_maps/index/347455/total_bytes | 614589 |
big_maps/index/347456/total_bytes | 614589 |
big_maps/index/103258/total_bytes | 79037 |
big_maps/index/149768/total_bytes | 46549 |
endorsement_branch | 40004 |
grand_parent_branch | 40004 |
v1/constants | 40004 |
v1/cycle_eras | 40004 |
version | 40004 |
big_maps/index/149787/total_bytes | 32835 |
big_maps/index/149771/total_bytes | 29009 |
big_maps/index/149772/total_bytes | 29009 |
first_level_of_protocol | 20002 |
votes/current_period | 20000 |
big_maps/index/103260/total_bytes | 19991 |
cycle/558/random_seed | 16183 |
cycle/558/selected_stake_distribution | 16183 |
We observe that certain paths are hit very often. The nodes on the commonly accessed paths should be kept in the irmin-pack
LRU cache and should not incur any disk I/O.
We also analyze a single block (block level 2981990), breaking down into categories of mem
, read
, and write
operations. Here is the top 15, sorted by read
and then mem
.
Path | mem | read | write |
---|---|---|---|
contracts/index/0001f46d...63a0/balance | 444 | 888 | 444 |
contracts/index/01e12d94...da00/used_bytes | 222 | 443 | 222 |
contracts/index/0001f46d...63a0/delegate | 0 | 443 | 0 |
contracts/global_counter | 225 | 224 | 225 |
contracts/index/01e12d94...da00/len/storage | 222 | 222 | 222 |
contracts/index/0001f46d...63a0/counter | 222 | 222 | 222 |
contracts/index/01e12d94...da00/paid_bytes | 222 | 221 | 222 |
contracts/index/01e12d94...da00/balance | 0 | 221 | 0 |
big_maps/index/347456/total_bytes | 111 | 110 | 111 |
big_maps/index/347455/total_bytes | 111 | 110 | 111 |
big_maps/index/347453/total_bytes | 111 | 110 | 111 |
big_maps/index/149776/contents/af067ef3...5bad/data | 6 | 6 | 1 |
contracts/index/00006027...b77a/balance | 3 | 6 | 3 |
big_maps/index/149776/contents/af067ef3...5bad/len | 0 | 6 | 1 |
contracts/index/0002358c...8636/delegate_desactivation | 0 | 5 | 2 |
… | |||
Total | 2363 | 3892 | 3180 |
We note a symmetry in number of mem
and write
operations to certain paths.
We run a replay of 100000 Tezos blocks with three different CPU frequencies:
- 3.40GHz: Run on an Equinix Metal
c3.small.x86
machine with an Intel® Xeon® E-2278G CPU with 8 cores running at 3.40 GHz. - 2.5GHz: Run on an Equinix Metal
m3.large.x86
machine with an AMD EPYC 7502P CPU with 32 cores running at 2.5 GHz. - 1.5GHz: Run on an Equinix Metal
m3.large.x86
machine with an AMD EPYC 7502P CPU with 32 cores at 1.5 GHz. The CPU frequency was set usingcpupower frequency-set 1.5GHz
.
CPU Frequency | 3.40GHz | 2.5GHz | 1.5GHz |
---|---|---|---|
CPU time elapsed | 73m27s | 103m26s 141% | 194m37s 265% |
TZ-transactions per sec | 1043.919 | 741.288 71% | 393.969 38% |
TZ-operations per sec | 6818.645 | 4841.928 71% | 2573.316 38% |
This seems to indicate that irmin-pack
is CPU-bound. Improving CPU efficiency should directly improve overall performance.
Operating systems implement their own cache mechanism for disk access. This can cause additional CPU cycles which are unnecessary as irmin-pack.unix
implements it’s own cache.
We can disable the operating system to cache blocks by using the ~fadvise~ system call with FADV_NOREUSE
. This will prevent the operating system from maintaining blocks in cache, thus saving CPU cycles.
We can tell fio
to do this with the fadvise_hint
option:
[global]
rw=randread
filename=/home/adatario/dev/tclpa/.git/annex/objects/gx/17/SHA256E-s3691765475--13300581f2404cc24774da8615a5a3d3f0adb7d68c4c8034c4fa69e727706000/SHA256E-s3691765475--13300581f2404cc24774da8615a5a3d3f0adb7d68c4c8034c4fa69e727706000
[job1]
ioengine=psync
rw=randread
blocksize_range=50-100
size=591394B
loops=100
fadvise_hint=noreuse
fio fadvise-noreuse.ini
job1: (g=0): rw=randread, bs=(R) 50B-100B, (W) 50B-100B, (T) 50B-100B, ioengine=psync, iodepth=1
fio-3.35
Starting 1 process
job1: (groupid=0, jobs=1): err= 0: pid=156889: Thu Jun 15 10:55:44 2023
read: IOPS=432k, BW=27.7MiB/s (29.1MB/s)(56.4MiB/2035msec)
clat (nsec): min=210, max=3652.8k, avg=2107.02, stdev=18068.27
lat (nsec): min=230, max=3653.4k, avg=2131.96, stdev=18076.02
clat percentiles (nsec):
| 1.00th=[ 211], 5.00th=[ 221], 10.00th=[ 221], 20.00th=[ 221],
| 30.00th=[ 231], 40.00th=[ 231], 50.00th=[ 231], 60.00th=[ 231],
| 70.00th=[ 241], 80.00th=[ 402], 90.00th=[ 410], 95.00th=[ 442],
| 99.00th=[121344], 99.50th=[152576], 99.90th=[224256], 99.95th=[254976],
| 99.99th=[346112]
bw ( KiB/s): min=27723, max=28868, per=99.72%, avg=28298.25, stdev=476.32, samples=4
iops : min=422174, max=439618, avg=430933.00, stdev=7326.27, samples=4
lat (nsec) : 250=71.97%, 500=24.01%, 750=1.26%, 1000=0.95%
lat (usec) : 2=0.51%, 4=0.07%, 10=0.06%, 20=0.01%, 50=0.02%
lat (usec) : 100=0.02%, 250=1.06%, 500=0.05%, 750=0.01%, 1000=0.01%
lat (msec) : 4=0.01%
cpu : usr=12.00%, sys=15.34%, ctx=10101, majf=0, minf=11
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=879400,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
READ: bw=27.7MiB/s (29.1MB/s), 27.7MiB/s-27.7MiB/s (29.1MB/s-29.1MB/s), io=56.4MiB (59.1MB), run=2035-2035msec
Disk stats (read/write):
dm-1: ios=11489/0, merge=0/0, ticks=1868/0, in_queue=1868, util=94.89%, aggrios=12303/0, aggrmerge=0/0, aggrticks=2016/0, aggrin_queue=2016, aggrutil=94.15%
dm-0: ios=12303/0, merge=0/0, ticks=2016/0, in_queue=2016, util=94.15%, aggrios=12303/0, aggrmerge=0/0, aggrticks=1629/0, aggrin_queue=1628, aggrutil=94.15%
nvme0n1: ios=12303/0, merge=0/0, ticks=1629/0, in_queue=1628, util=94.15%
We observe a read bandwidth of 27.7MiB/s. This seems to be a considerable performance improvement (around 30%) that can be implemented fairly easily.
Unfortunately we cannot observe any performance improvement when testing these changes on the Tezos Context replay benchmarks:
3.7.1 | POSIX_FADV_NOREUSE | POSIX_FADV_RANDOM | |
---|---|---|---|
CPU time elapsed | 104m07s | 104m50s (101%) | 103m51s (100%) |
TZ-transactions per sec | 736.349 | 731.387 (99%) | 738.298 (100%) |
TZ-operations per sec | 4809.664 | 4777.256 (99%) | 4822.395 (100%) |
We note that the size of most content in the Tezos Context Store is very small (see Content Size distribution). These small pieces of content are stored in individual leaf nodes of the tree with their own hash, incurring a large overhead. An optimization would be to inline the small contents, and possible inodes, into the parent node. Some explorations into this idea has already been done (see mirage/irmin#884).
In order to approximate potential performance gains we measure the time for reading and decoding objects from disk. We measure these times for 100 blocks from Lima. Total runtime of the measurement is 14s.
We can consider inlining objects that are smaller than a certain number of bytes. In our analysis, we consider 64, 128, 256, and 512 bytes. These byte lengths are measures of on-disk data, which means that they are inclusive of the 32 byte hash.
For considering 64 bytes as the boundary, we observe this distribution of objects.
type | < 64B | % < 64B | >= 64B |
---|---|---|---|
contents | 12445 | (71.9%) | 4865 |
inode | 5650 | (4.2%) | 127642 |
Note: this percentage of contents is different from the content size distribution analysis. The current analysis considers the entire length of bytes on disk, which includes the 32 bytes of hash, but the content distribution analysis only looks at the content length. When accounting for the hash size, the content distribution analysis shows that 89.5% of contents are less than 32 bytes. This is more than the current analysis of 71.9%, but the current analysis is only contents that are read as a part of processing the 100 blocks, whereas the content distribution analysis is for the entire context. The current analysis only sees 17310 (possibly with some duplicates) of 79822881 contents objects in the context, which is only 0.02%. This is the most likely explanation for the difference in observed ratios.
When going to 512 bytes, we observe the following distribution.
type | < 512B | % < 512B | >=512B |
---|---|---|---|
contents | 16415 | (94.8%) | 895 |
inode | 98070 | (73.6%) | 35222 |
To simulate possible performance improvements of inlining contents or inodes, we assume that inlined objects will save all of observed read time and then look at different percentages of saving of decode time to calculate overall saving of total runtime.
For 64 bytes:
% | %-save-find | %-save-runtime |
---|---|---|
0 | 0.5% | 0.1% |
20 | 0.7% | 0.1% |
50 | 0.9% | 0.2% |
75 | 1.1% | 0.2% |
100 | 1.2% | 0.2% |
For 512 bytes:
% | %-save-find | %-save-runtime |
---|---|---|
0 | 4.9% | 1.0% |
20 | 19.9% | 3.9% |
50 | 42.4% | 8.3% |
75 | 61.1% | 12.0% |
100 | 79.9% | 15.7% |
If we assume we can save between 0-50% of time from decoding, we would expect a range of 0.1-8.3% overall improvement, depending on configuration. We also think there is a chance for this to improve unobserved improvements in the overall functioning of the system, so this could improve performance in the 0-10% range.
Access to content in Irmin requires descending a path. Every node on the path needs to be de-referenced, accessed from disk and unpacked individually. The cost for accessing a piece of content is a function of the path size. It might make sense to optimize the tree descent.
We note that read accesses to the Tezos Context Store are focused on few paths (see Access Patterns).
The existing LRU cache in irmin-pack
should prevent re-fetching these paths from the disk, increasing performance considerably. This is done by maintaining a mapping from position in pack file to tree object. When fetching a path, we need to sequentially iterate over the cache until we reach the object. This should be mostly in-memory operations as tree objects are cached. Still this incurs a lot of CPU cycles.
A potential optimization would be use a cache that maintains a mapping from path to tree object. When fetching a path that is cached, only a single lookup would be required to get the final tree object.
Tezos usage of irmin-pack
goes through the tezos_context
library but also various storage functors implemented in the tezos-protocol-015-PtLimaPt.environment
. This is where commonly appearing keywords like index
or total_bytes
are defined.
It may be possible that the way these storage functors use irmin-pack
(indirectly via the tezos_context
library) could be improved. In particular, it seems like many small files are created with long, nested paths. It may be more efficient to create larger files with shorter paths.
A concrete proposal would be to add additional Irmin types for the various storage abstractions offered by the storage functors. Irmin allows variable content types. Currently the only type used is bytes
. Instead one could define more complex types such as:
module Contents = struct
module StringMap = Map.Make(String)
type t =
| Bytes of bytes
| KVMap of string StringMap.t
end
The type KVMap
would be serialized to a single larger content object in the Irmin store. This might improve locality of data as well as increase the size of content objects that would improve I/O bandwidth (see Batching Reads). This proposal can be seen as a form of explicity inlining (see Inline small objects) that uses the existing Irmin API.
It would also be interesting to investigate the observed symmetry between mem
and write
access as observed in section Operations in a single block.
Read performance seems to be bad because of the large amount of small reads that are performed with individual system calls. The system call overhead seems to be significant and reducing it should increase performance.
Proposal such as Inline small objects would also help in increasing the size of reads and quite possibly will improve read performance.
Performance could be improved by batching read operations into larger blocks. For example if we use 4KiB blocks:
[global]
rw=randread
filename=/home/adatario/dev/tclpa/.git/annex/objects/gx/17/SHA256E-s3691765475--13300581f2404cc24774da8615a5a3d3f0adb7d68c4c8034c4fa69e727706000/SHA256E-s3691765475--13300581f2404cc24774da8615a5a3d3f0adb7d68c4c8034c4fa69e727706000
[job1]
ioengine=psync
rw=randread
blocksize=4KiB
size=591394B
loops=100
fio batching-reads.ini
job1: (g=0): rw=randread, bs=(R) 4000B-4000B, (W) 4000B-4000B, (T) 4000B-4000B, ioengine=psync, iodepth=1
fio-3.35
Starting 1 process
job1: (groupid=0, jobs=1): err= 0: pid=10702: Fri Jun 16 10:32:29 2023
read: IOPS=8941, BW=34.1MiB/s (35.8MB/s)(56.1MiB/1644msec)
clat (nsec): min=310, max=3684.8k, avg=110624.46, stdev=91545.42
lat (nsec): min=330, max=3685.4k, avg=110715.69, stdev=91573.99
clat percentiles (nsec):
| 1.00th=[ 422], 5.00th=[ 628], 10.00th=[ 820], 20.00th=[ 1256],
| 30.00th=[ 3472], 40.00th=[122368], 50.00th=[134144], 60.00th=[142336],
| 70.00th=[152576], 80.00th=[164864], 90.00th=[199680], 95.00th=[236544],
| 99.00th=[317440], 99.50th=[415744], 99.90th=[667648], 99.95th=[700416],
| 99.99th=[905216]
bw ( KiB/s): min=32390, max=37875, per=100.00%, avg=34973.67, stdev=2756.26, samples=3
iops : min= 8292, max= 9696, avg=8953.33, stdev=705.52, samples=3
lat (nsec) : 500=2.02%, 750=6.08%, 1000=6.39%
lat (usec) : 2=11.84%, 4=4.03%, 10=1.27%, 20=0.33%, 50=0.01%
lat (usec) : 100=0.06%, 250=64.49%, 500=3.18%, 750=0.27%, 1000=0.02%
lat (msec) : 4=0.01%
cpu : usr=1.58%, sys=7.00%, ctx=10010, majf=0, minf=11
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=14700,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
READ: bw=34.1MiB/s (35.8MB/s), 34.1MiB/s-34.1MiB/s (35.8MB/s-35.8MB/s), io=56.1MiB (58.8MB), run=1644-1644msec
Disk stats (read/write):
dm-1: ios=8537/0, merge=0/0, ticks=1280/0, in_queue=1280, util=93.46%, aggrios=10000/0, aggrmerge=0/0, aggrticks=1496/0, aggrin_queue=1496, aggrutil=93.96%
dm-0: ios=10000/0, merge=0/0, ticks=1496/0, in_queue=1496, util=93.96%, aggrios=10000/0, aggrmerge=0/0, aggrticks=1149/0, aggrin_queue=1149, aggrutil=93.90%
nvme0n1: ios=10000/0, merge=0/0, ticks=1149/0, in_queue=1149, util=93.90%
We observe a read bandwidth of 34.1MiB/s (a 60% performance improvement to the baseline). However, it is not clear how reads could be batched without major re-organization of the on-disk representation.
As noted in section CPU boundness improving CPU efficiency of irmin-pack
may have a great effect on overall performance of the Tezos Context Store.
Some potential improvements are described in the following.
The currently implemented LRU cache in irmin-pack
is implemented using doubly-linked-lists which requires non-negligible datastructure manipulations even while only accessing objects in the cache.
Other cache replacement policies and cache datastructures exist that might decrease the number of required CPU cycles. Futhermore they may provide better cache hit rates that will also improve overall performance.
In order to simulate the potential performance improvement of using multiple cores we use a fio
job similar to the baseline but instead use N
cores that in total read the same amount of bytes:
[global]
rw=randread
filename=/home/adatario/dev/tclpa/.git/annex/objects/gx/17/SHA256E-s3691765475--13300581f2404cc24774da8615a5a3d3f0adb7d68c4c8034c4fa69e727706000/SHA256E-s3691765475--13300581f2404cc24774da8615a5a3d3f0adb7d68c4c8034c4fa69e727706000
loops=100
group_reporting
thread
[job1]
ioengine=psync
rw=randread
blocksize_range=50-100
size=591394 / N
numjobs=N
Using the options thread
and numjobs
we create N
threads that randomly read from the file.
We observe following read bandwidths:
Number of Threads | Read bandwidth (MiB/s) |
---|---|
1 | 21 |
2 | 58.8 |
3 | 89.4 |
4 | 109 |
5 | 120 |
6 | 141 |
7 | 148 |
8 | 162 |
9 | 168 |
10 | 185 |
11 | 196 |
12 | 201 |
13 | 213 |
14 | 219 |
15 | 228 |
16 | 234 |
Increasing the number of threads/CPU cores utilized seems to lead to considerable performance improvements.
Note however, that the we simulate N
independent threads. In irmin-pack
, and Irmin in general, reads may need to be sequential as they descend a tree by de-referencing nodes individually.
Modern PCIe-attached solid-state drives (SSDs) offer high throughput and large capacity at low cost. They are exposed to the system using the same block-based APIs as traditional disks or SSDs attached via serial buses. However, in order to utilize the full performance of modern hardware, the way I/O operations are performed needs to be changed.
In order to illustrate the capabilities of modern hardware we run fio
using the Linux io_uring API. This allows asynchronous I/O operations. We also increase size of the blocks read from a few bytes to 64KiB. This increases latency for individual reads, but allows much higher bandwidth.
[global]
rw=randread
filename=/home/adatario/dev/tclpa/.git/annex/objects/gx/17/SHA256E-s3691765475--13300581f2404cc24774da8615a5a3d3f0adb7d68c4c8034c4fa69e727706000/SHA256E-s3691765475--13300581f2404cc24774da8615a5a3d3f0adb7d68c4c8034c4fa69e727706000
loops=100
group_reporting
thread
[job1]
ioengine=io_uring
iodepth=16
rw=randread
blocksize=64KiB
size=100MiB
numjobs=8
fio read-io_uring.ini
job1: (g=0): rw=randread, bs=(R) 62.5KiB-62.5KiB, (W) 62.5KiB-62.5KiB, (T) 62.5KiB-62.5KiB, ioengine=io_uring, iodepth=16
...
fio-3.33
Starting 8 threads
job1: (groupid=0, jobs=8): err= 0: pid=44365: Fri May 19 14:25:25 2023
read: IOPS=147k, BW=8955MiB/s (9390MB/s)(74.5GiB/8517msec)
slat (nsec): min=290, max=9700.4k, avg=9392.25, stdev=73806.67
clat (nsec): min=130, max=23784k, avg=767189.99, stdev=1005905.48
lat (usec): min=4, max=23786, avg=776.58, stdev=1007.13
clat percentiles (usec):
| 1.00th=[ 17], 5.00th=[ 30], 10.00th=[ 43], 20.00th=[ 62],
| 30.00th=[ 83], 40.00th=[ 125], 50.00th=[ 215], 60.00th=[ 400],
| 70.00th=[ 914], 80.00th=[ 1795], 90.00th=[ 2278], 95.00th=[ 2606],
| 99.00th=[ 3884], 99.50th=[ 4490], 99.90th=[ 6390], 99.95th=[ 7439],
| 99.99th=[10028]
bw ( MiB/s): min= 7688, max=10172, per=100.00%, avg=9047.06, stdev=83.12, samples=129
iops : min=125968, max=166674, avg=148226.72, stdev=1361.83, samples=129
lat (nsec) : 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
lat (usec) : 2=0.01%, 4=0.02%, 10=0.10%, 20=1.72%, 50=11.80%
lat (usec) : 100=21.71%, 250=17.36%, 500=10.21%, 750=4.71%, 1000=3.36%
lat (msec) : 2=12.41%, 4=15.67%, 10=0.88%, 20=0.01%, 50=0.01%
cpu : usr=2.68%, sys=14.62%, ctx=559563, majf=0, minf=0
IO depths : 1=0.1%, 2=0.1%, 4=0.3%, 8=0.5%, 16=99.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=99.9%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=1249600,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=16
Run status group 0 (all jobs):
READ: bw=8955MiB/s (9390MB/s), 8955MiB/s-8955MiB/s (9390MB/s-9390MB/s), io=74.5GiB (80.0GB), run=8517-8517msec
Disk stats (read/write):
dm-1: ios=345735/95, merge=0/0, ticks=597292/4, in_queue=597296, util=98.72%, aggrios=349972/95, aggrmerge=0/0, aggrticks=599656/4, aggrin_queue=599660, aggrutil=98.56%
dm-0: ios=349972/95, merge=0/0, ticks=599656/4, in_queue=599660, util=98.56%, aggrios=349972/87, aggrmerge=0/8, aggrticks=542294/6, aggrin_queue=542303, aggrutil=97.77%
nvme0n1: ios=349972/87, merge=0/8, ticks=542294/6, in_queue=542303, util=97.77%
We observe a read bandwidth of about 8955MiB/s this is an improvement of more than 40000% compared to the baseline. Note that this is pure I/O read bandwidth without any data management.
On-going research is investigating how the performance of such hardware can be used in database systems (see LeanStore: In-Memory Data Management Beyond Main Memory). We note that the main ingredients for doing this in the OCaml ecosystem are present or are being actively worked on: OCaml 5 with Multicore support and Algebraic Effects as well as the eio library. The question seems to be: What Are We Waiting For? Let’s Use Coroutines for Asynchronous I/O to Hide I/O Latencies and Maximize the Read Bandwidth!
In this report we have identified storage performance bottlenecks of Irmin and irmin-pack
as used by Octez and propose potential improvements.
A selection of the different scenarios considered, with expected impact on performances is provided in the following table:
Description | I/O | Overall | Effort (FTE) | On-disk format |
---|---|---|---|---|
Improve Index usage | - | 0% | - | No change |
Improve Dict usage | 10-20% | -20% | 8-16w | Potential change |
Disable OS page cache | 30% | 0-3% | 2-4w | No change |
Inline small objects | 0-40% | 0-10% | 8-16w | Change |
Path Cache | - | 5% | 8-16w | No change |
More compact on-disk representation | 300% | 10% | 16-32w | Change |
More efficient cache datastructure | - | 8-16w | No change | |
Optimize Tezos Storage Functors | - | 16-32w | No change | |
Batching reads | 60% | 5% | 16-32w | Change |
Multicore I/O [fn:multicore_io] | 700% | 12% | 32-64w | No change |
Asynchronous I/O with modern APIs | 40000% | 64w+ | Change |
Some details to the columns:
- I/O
- Potential increase in I/O throughput (i.e. as bytes/s) measured and estimated with
fio
. - Overall
- Potential improvement in overall performance of the Tezos Context store based on performance of current implementation and structure of Context Store measured as decrease in time to replay a recording of 100000 blocks. The performance improvements are indicative and purely theoretical. Experiments on dict and OS page cache have shown that some results may be counter-intuitive.
- Effort
- Estimated effort to implement performance in FTE weeks.
- On-disk format
- Indication if an on-disk format change is needed.
Note that performance improvements may not be additive and individual improvements may have unexpected results on the effect of other improvements.
We conclude with following recommendations:
- Implement Inlining of small objects: This seems to be an obvious optimization. Even if the estimated performance gains from decreased I/O might be low we can expect more performance gains by doing larger reads (see Batching Reads), improved CPU efficiency while decoding content (see CPU efficiency) as well a more shallow tree (see Optimize tree descent).
- Investigate CPU usage: We observe that the Tezos Context Storage seems to be CPU bound (see CPU boundness). Currently we have little insight into what are the bottlenecks and what areas should be optimized for best overall performance improvement. We propose using the OCaml 5.1
Runtime_events
library for observing running applications (see ocaml/ocaml#11474). This approach can be used for also observing overall Octez code and identifying performance bottlenecks beyond the Tezos Context Store. - Explore Multicore and Asynchronous APIs: Modern hardware, APIs such as
io_uring
and multicore capabilities in OCaml 5 allow massively improved performance for storage intensive applications. However, they require fundamental redesign of the storage backends (see Asynchronous I/O). To remain competitive with regards to performance in the medium to longer-term, it is imperative to explore such ideas now.
[fn:multicore_io] Only considering I/O. We expect even larger performance improvements when CPU intensive operations are parallelized over multiple cores.