-
Notifications
You must be signed in to change notification settings - Fork 54
benchmarking number of threads
Intention is to saturate warcprox under realistic load.
-
ran brozzler crawl of some popular sites that should not be affected by a burst of traffic from us https://partner.archive-it.org/3/collections/261/crawl/561147
-
filter into a set of evenly-sized request-only warcs; we will replay the requests from these warcs, and the purpose of splitting them up this way is make sure we can replay many requests quickly in parallel
a. filtered request records into separate warcs using warc cli on this branch https://github.com/nlevitt/warc/tree/cli
$ for f in /var/tmp/warcs/* ; do > warc filter warc-type:request $f > ${f%.warc.gz}-REQUESTS_ONLY.warc > done
b. combine and then split into evenly sized warcs
$ cat *.warc > one-huge.warc $ python split-em.py
Source of split-em.py:
#!/usr/bin/env python3 import sys warc_no = -1 record_no = -1 warc = None with open('req-only/one-huge.warc', 'rb') as f: for line in f: if line == b'WARC/1.0\r\n': record_no += 1 if record_no % 6702 == 0: if warc: warc.close() warc_no += 1 warc = open('%s.warc' % warc_no, 'wb') print('writing to %s.warc' % warc_no) warc.write(line)
Running warcprox on wbgrp-svc203 in cwd /tmp.
First set these sysctl variables:
wbgrp-svc203$ sudo sysctl -w net.core.somaxconn=4096
wbgrp-svc203$ sudo sysctl -w net.ipv4.tcp_max_syn_backlog=4096
Running warcprox always with these options:
--address=0.0.0.0
--gzip
--base32
--dedup-db-file=/dev/null
--stats-db-file=/dev/null
and varying --max-threads --writer-threads
Then running replay-warc-requests on wbgrp-svc203 https://github.com/nlevitt/replay-warc-requests
timeout 2m nice replay-warc-requests --proxy=http://wbgrp-svc203:8000 /var/tmp/warcs/*.warc
Because of queueing, warcprox finishes writing the records sometimes a few seconds after the two minutes are up. To determine the elapsed time I looked at the log line of the last record written to disk.
- 2m0s min
- 104002 urls
- 2m0s
- 1052 urls (8.77 urls/s)
- 21611963 bytes written to warc (180099 bytes/s)
- 2m0s
- 8836 urls (73.6 urls/s)
- 201653177 warc bytes (1680443 bytes/s)
- 2m4s
- 12743 urls (102.8 urls/s)
- 330544775 warc bytes (2665683 bytes/s)
- 2m3s
- 12383 urls (100.6 urls/s)
- 323689997 warc bytes (2631626 bytes/s)
- 2m4s
- 12417 urls (100.1 urls/s)
- 328685433 warc bytes (2650688 bytes/s)
- 2m3s
- 12027 urls (97.8 urls/s)
- 295704025 warc bytes (2404098 bytes/s)
- 2m1s
- 12366 urls (102.2 urls/s)
- 323261586 warc bytes (2671583 bytes/s)
- 2m4s
- 12630 urls (101.8 urls/s)
- 331888758 warc bytes (2676522 bytes/s)
- 2m3s
- 12628 urls (102.7 urls/s)
- 333533547 warc bytes (2711655 bytes/s)
- 2m4s
- 12595 urls (101.6 urls/s)
- 329635952 warc bytes (2658354 bytes/s)
- 2m9s
- 12225 urls (94.8 urls/s)
- 324936789 warc bytes (2518890 bytes/s)
- 2m0s
- 11959 urls (99.7 urls/s)
- 289077720 warc bytes (2408981 bytes/s)
In this scenario at least there was no appreciable performance gain running
with more than one writer thread. And basically the same performance with 50,
100, or 150 proxy threads. Recommending --max-threads=100 --writer-threads=1
.
There are many more things we could test:
- other stuff happening on the machine (e.g. a process moving warcs into permanent storage, store-warcs.py in the archive-it case)
- enable stats, try the different implementations
- enable dedup, try the different implementations
- disable gzip
- what happens running multiple instances of warcprox on a vm
- for archive-it, enable archive-it plugins
Warcprox does not peg disk or cpu. What is the bottleneck? Maybe the GIL? I would like to try refactoring warcprox to use multiprocessing instead of multithreading and see how that affects performance. Asyncio is an option but that's a major undertaking. I would rewrite warcprox in golang instead.