benchmarking number of threads

Methodology

Intention is to saturate warcprox under realistic load.

Preparation

ran brozzler crawl of some popular sites that should not be affected by a burst of traffic from us https://partner.archive-it.org/3/collections/261/crawl/561147

filter into a set of evenly-sized request-only warcs; we will replay the requests from these warcs, and the purpose of splitting them up this way is make sure we can replay many requests quickly in parallel

a. filtered request records into separate warcs using warc cli on this branch https://github.com/nlevitt/warc/tree/cli

$ for f in /var/tmp/warcs/* ; do
>     warc filter warc-type:request $f > ${f%.warc.gz}-REQUESTS_ONLY.warc
> done

b. combine and then split into evenly sized warcs

$ cat *.warc > one-huge.warc
$ python split-em.py

Source of split-em.py:

#!/usr/bin/env python3
import sys
warc_no = -1
record_no = -1
warc = None
with open('req-only/one-huge.warc', 'rb') as f:
    for line in f:
        if line == b'WARC/1.0\r\n':
            record_no += 1
            if record_no % 6702 == 0:
                if warc:
                    warc.close()
                warc_no += 1
                warc = open('%s.warc' % warc_no, 'wb')
                print('writing to %s.warc' % warc_no)
        warc.write(line)

Running the benchmarks

Running warcprox on wbgrp-svc203 in cwd /tmp.

First set these sysctl variables:

wbgrp-svc203$ sudo sysctl -w net.core.somaxconn=4096
wbgrp-svc203$ sudo sysctl -w net.ipv4.tcp_max_syn_backlog=4096

Running warcprox always with these options:

   --address=0.0.0.0
   --gzip
   --base32
   --dedup-db-file=/dev/null
   --stats-db-file=/dev/null

and varying --max-threads --writer-threads

Then running replay-warc-requests on wbgrp-svc203 https://github.com/nlevitt/replay-warc-requests

timeout 2m nice replay-warc-requests --proxy=http://wbgrp-svc203:8000 /var/tmp/warcs/*.warc

Because of queueing, warcprox finishes writing the records sometimes a few seconds after the two minutes are up. To determine the elapsed time I looked at the log line of the last record written to disk.

Results

baseline (no proxy)

2m0s min
104002 urls

--max-threads=1 --writer-threads=1 (another baseline)

2m0s
1052 urls (8.77 urls/s)
21611963 bytes written to warc (180099 bytes/s)

--max-threads=10 --writer-threads=1

2m0s
8836 urls (73.6 urls/s)
201653177 warc bytes (1680443 bytes/s)

--max-threads=50 --writer-threads=1

2m4s
12743 urls (102.8 urls/s)
330544775 warc bytes (2665683 bytes/s)

--max-threads=100 --writer-threads=1

2m3s
12383 urls (100.6 urls/s)
323689997 warc bytes (2631626 bytes/s)

--max-threads=150 --writer-threads=1

2m4s
12417 urls (100.1 urls/s)
328685433 warc bytes (2650688 bytes/s)

--max-threads=200 --writer-threads=1

2m3s
12027 urls (97.8 urls/s)
295704025 warc bytes (2404098 bytes/s)

--max-threads=300 --writer-threads=1

2m1s
12366 urls (102.2 urls/s)
323261586 warc bytes (2671583 bytes/s)

--max-threads=200 --writer-threads=2

2m4s
12630 urls (101.8 urls/s)
331888758 warc bytes (2676522 bytes/s)

--max-threads=100 --writer-threads=2

2m3s
12628 urls (102.7 urls/s)
333533547 warc bytes (2711655 bytes/s)

--max-threads=100 --writer-threads=3

2m4s
12595 urls (101.6 urls/s)
329635952 warc bytes (2658354 bytes/s)

--max-threads=1000 --writer-threads=1

2m9s
12225 urls (94.8 urls/s)
324936789 warc bytes (2518890 bytes/s)

--max-threads=1000 --writer-threads=100

2m0s
11959 urls (99.7 urls/s)
289077720 warc bytes (2408981 bytes/s)

Conclusion

In this scenario at least there was no appreciable performance gain running with more than one writer thread. And basically the same performance with 50, 100, or 150 proxy threads. Recommending --max-threads=100 --writer-threads=1.

Additional thoughts

There are many more things we could test:

other stuff happening on the machine (e.g. a process moving warcs into permanent storage, store-warcs.py in the archive-it case)
enable stats, try the different implementations
enable dedup, try the different implementations
disable gzip
what happens running multiple instances of warcprox on a vm
for archive-it, enable archive-it plugins

Warcprox does not peg disk or cpu. What is the bottleneck? Maybe the GIL? I would like to try refactoring warcprox to use multiprocessing instead of multithreading and see how that affects performance. Asyncio is an option but that's a major undertaking. I would rewrite warcprox in golang instead.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly