Add a dataset write benchmark #123

jonkeane · 2022-12-06T20:37:56Z

One can (re)write a dataset (partitioned or not) without reading the full thing into memory with pyarrow. We currently have a benchmark that runs a filter on datasets.

We should create a new benchmark that is similar to the filtering, but on top of filtering, also write the results out to a new dataset (instead of pulling into table like we do at

benchmarks/benchmarks/dataset_selectivity_benchmark.py

Line 71 in 5ea34d7

return lambda: dataset.to_table(

We might parameterize this over:

how selective the filter is (and we should definitely keep the 100% selectivity where we do not remove any rows, especially for the (re)partitioning below)
the format of the output (parquet, arrow file)
partitioning (add a new partitioning column to partition by?)

jgehrcke · 2022-12-08T13:41:55Z

Questions to self (not expecting timely answer from Jon):

Provocative point of view: almost knowing nothing, I feel like this is an 'extend feature coverage in a test suite' kind of idea, and not so much a benchmarking topic in the most traditional sense. Am I on the right track or do I miss something?

Or do we already care about performance aspects from the start, such as the wall time it takes for an operation to complete?

also write the results out to a new dataset

Does that imply 'to the file system'? I am asking because then it's interesting to see how much the run time (duration) would start to be affected by I/O performance of the test environment, and especially the volatility of that I/O performance.

jgehrcke · 2022-12-09T09:09:26Z

FYI, I am taking this on because Jon recommended this as a good first issue to get involved here 🎈.

jgehrcke · 2022-12-09T14:22:22Z

Documenting preparation of environment:

# check out repo
$ git clone https://github.com/voltrondata-labs/benchmarks
$ mv benchmarks voltrondata-labs-benchmarks
$ cd voltrondata-labs-benchmarks/

# prepare new python environment, install benchmark deps
$ pyenv virtualenv 3.10.8 3108-vd-benchmarks
$ pyenv activate 3108-vd-benchmarks
$ pyenv local 3108-vd-benchmarks
$ pip install .

# install HEAD of Conbench into this environment
$ git clone https://github.com/conbench/conbench
$ cd conbench
$ python setup.py install

# conbench cmd inspection:
$ command -v conbench
/home/jp/.pyenv/shims/conbench
$ conbench --version
[221209-14:58:31.471] [24409] [benchclients.logging] INFO: Initializing adapter
Usage: conbench [OPTIONS] COMMAND [ARGS]...
Try 'conbench --help' for help.

Error: No such option: --version

I then tried running a specific benchmark:

± conbench dataset-selectivity nyctaxi_multi_parquet_s3
[221209-14:53:34.388] [23796] [benchclients.logging] INFO: Initializing adapter

Benchmark result:
{
    "batch_id": "094255e7381d428fa01eb58748fe9e1b",
    "context": {
        "arrow_compiler_flags": " -fdiagnostics-color=always -O2 -DNDEBUG -ftree-vectorize",
        "benchmark_language": "Python"
    },
    "github": {
        "commit": "",
        "pr_number": null,
        "repository": "https://github.com/apache/arrow"
    },
    "info": {
        "arrow_compiler_id": "GNU",
        "arrow_compiler_version": "10.2.1",
        "arrow_version": "10.0.1",
        "benchmark_language_version": "Python 3.10.8"
    },
    "machine_info": {
        "architecture_name": "x86_64",
        "cpu_core_count": "12",
        "cpu_frequency_max_hz": "4800000000",
        "cpu_l1d_cache_bytes": "458752",
        "cpu_l1i_cache_bytes": "655360",
        "cpu_l2_cache_bytes": "9437184",
        "cpu_l3_cache_bytes": "18874368",
        "cpu_model_name": "12th Gen Intel(R) Core(TM) i7-1270P",
        "cpu_thread_count": "16",
        "gpu_count": "0",
        "gpu_product_names": [],
        "kernel_name": "6.0.11-300.fc37.x86_64",
        "memory_bytes": "32212254720",
        "name": "fedora",
        "os_name": "Linux",
        "os_version": "6.0.11-300.fc37.x86_64-x86_64-with-glibc2.36"
    },
    "optional_benchmark_info": {},
    "run_id": "afec3d7926ef478093d0098d8ebdb0a4",
    "stats": {
        "data": [
            "0.728294"
        ],
        "iqr": "0.000000",
        "iterations": 1,
        "max": "0.728294",
        "mean": "0.728294",
        "median": "0.728294",
        "min": "0.728294",
        "q1": "0.728294",
        "q3": "0.728294",
        "stdev": 0,
        "time_unit": "s",
        "times": [],
        "unit": "s"
    },
    "tags": {
        "cpu_count": null,
        "dataset": "nyctaxi_multi_parquet_s3",
        "name": "dataset-selectivity",
        "selectivity": "1%"
    },
    "timestamp": "2022-12-09T13:57:08.989601+00:00"
}

Interesting.

Thoughts:

no log output before the download, the download took rather long
q1, q3, mean, median, stdev: these numbers are not meaningful from one sample. should be null instead maybe.
or: maybe iterations should be more than 1 by default. Maybe 6. Only then the above-mentioned properties slowly become in any way meaningful.

Exploring cmdline options.

  --iterations INTEGER         [default: 1]

Did this again with 10 iterations. Got this:

...
    "stats": {
        "data": [
            "0.719354",
            "0.703909",
            "0.715284",
            "0.732232",
            "0.735586",
            "0.750671",
            "0.759200",
            "0.779739",
            "0.768349",
            "0.791532"
        ],
        "iqr": "0.043489",
        "iterations": 10,
        "max": "0.791532",
        "mean": "0.745586",
        "median": "0.743129",
        "min": "0.703909",
        "q1": "0.722573",
        "q3": "0.766062",
        "stdev": "0.029114",
        "time_unit": "s",
        "times": [],
        "unit": "s"
    },
...

Now given all those aggregates there's the lack of the standard error of the mean which we could now use to plot meaningful error bars (at least that's a very canonical thing to do: plot the mean and then the standard error of the mean).

And then maybe there could be a printable result, like: T = (0.7456 +/- 0.0092) s
This uses the standard err of the mean as 0.029114 / math.sqrt(10).

Of course the assumption that things are normally distributed might be flawed, and maybe the minimal value plus the volatility are more interesting than the mean value.

Anyway. Just exploring, I know some of this is super off topic from the purpose of this ticket.

jgehrcke · 2022-12-09T15:06:25Z

On topic. Looks like we want to use pyarrow.dataset.write_dataset() (docs).

Interestingly, there is no corresponding method on Dataset itself (docs).

write_dataset() works with a path to a directory. I think to focus this benchmark on the serialization time it makes sense to not actually write to a disk-backed file system but to a tmpfs backed by machine-local RAM.

On Linux /dev/shm/ can be used for that (a RAM-backed tmpfs device that any user can user, also see here) -- It's just a little unclear to me on which platforms this benchmark needs to be able to work.

jgehrcke · 2022-12-14T11:28:36Z

Yesterday I spent more time investigating the method to choose.

Investigated more about about how the dataset/table/scanner abstractions in pyarrow actually behave.

Confirmed empirically that dataset and scanner abstractions can be used to build

reading -> deserializing -> filtering -> serializing -> writing

(RDFSW) while retaining only tiny chunks in memory. (a cool benchmark for this pipeline would confirm that memory usage stays tiny!)

I also found that this RDFSW pipeline is dominated by disk read I/O performance given the kinds of dataset we use here. That means that RDFSW would be a boring if not even useless benchmark, where the write phase is probably shorter than fluctuations in the read phase.

That is, towards the goal of benchmarking serialization and writing I propose:

read dataset into memory first
time serialization and writing
write to tmpfs: so that this is at least a rather constant contrubtion. If the write phase contributes significantly to the timing then this benchmark is measuring hardware performance more than software performance.

Tweaked timing reporting a bit by Conbench CLI, discovered a cool library called sigfig: conbench/conbench#538

jgehrcke · 2022-12-14T11:36:21Z

That means that RDFSW would be a boring if not even useless benchmark, where the write phase is probably shorter than fluctuations in the read phase.

I want to support that with data. For example, for a case ('10pc', 'feather') the read phase took:

read source dataset into memory in 1.4584 s

and the serialize/write phase took ~1/10th of that time:

        "data": [
            "0.258376",
            "0.204958",
            "0.185941",
            "0.179702",
            "0.178900",
            "0.171137"
        ],

jgehrcke · 2022-12-14T12:50:37Z

My attempt to summarize. I want to start adding a benchmark that focuses on serialization and writing-to-tmpfs, starting with data being in memory.

That is, timing of reading-from-whatever-disk-and-then-filtering does not contribute to the benchmark duration. That simplifies reasoning, and allows for drawing stronger conclusions (when compared with a benchmark that exercised the entire information flow).

Given that, on my machine, I see that writing the same kind of data

takes ~0.1 s for arrow, feather, ipc
takes ~4.0 s for csv
takes ~2.0 s for parquet

These are all default settings, and I find the differences quite remarkable. From here, it's interesting to see how csv writing and parquet writing could indeed be improved when changing parameters like @joosthooz investigated elsewhere.

It's also interesting to see more data being written, and of course it's interesting to see how this behaves in CI as opposed to my machine. So, working towards completing the patch to have something to iterate on.

jgehrcke · 2022-12-20T11:57:44Z

Update: this ticket after all motivated me to land a rather specific (focused) benchmark now called dataset-serialize -- which just now ran in the context of arrow-benchmarks-ci, and had its results posted to conbench.ursa.dev. Also see voltrondata-labs/arrow-benchmarks-ci#92 (comment).

jonkeane · 2022-12-29T22:10:41Z

I'm super glad to see the dataset-serialize benchmarks. Those are great + I learned a lot from the discussion in #124 around how to target the serialization itself.

Something alluded to above, I do think it might still be worth doing an end-to-end version that interacts with the disk (even if that is dominated by read (or write) disk!). Since that represents a non-trivial real-world chunk of work (e.g. I've got a raw dataset and I want to (re)partition it to fit a pattern that is useful for querying) that we want to make sure doesn't get slower.

jgehrcke · 2023-01-05T13:27:32Z

it might still be worth doing an end-to-end version that interacts with the disk [...]. Since that represents a non-trivial real-world chunk of work [...] that we want to make sure doesn't get slower.

The high level motivation is absolutely reasonable!

I have just spent a bit of time consolidating some thoughts around that. A little too deep for this ticket here, but I will write down my thoughts anyway.

When working with a multi-stage benchmark, two challenges make such a benchmark difficult to extract value from:

The sensitivity for detecting performance regressions easily enters the 'boring' regime as of the noise as of I/O volatility.
If a performance regression is big enough to stand out of the noise level then one needs to do quite a bit of follow-up benchmarking to see which part exactly got slower (because the benchmark method is known to cover N consecutive steps/stages).

A multi-stage benchmark's noise level is the sum of the noise levels of the individual stages. The noise level of one of those stages may easily be larger than the expected duration of a single substage.

This is interesting to compare with testing. I like end-to-end functional/integration tests because the signal stays strong. That is, end-to-end tests do not suffer from signal weakening (1). With proper logging/debug information, end-to-end tests also often do not really suffer from (2) because there is insight into the complete flow. In contrast, the value of a benchmark quickly dilutes with the number of stages it covers.

I think a good strategy is therefore to build benchmarks that are known to be dominated by a certain stage, and to call that out.

I want to ack: there is value in doing the end-to-end (multi-stage) benchmarking. An end-to-end benchmark can certainly serve as a sanity check that can uncover drastic performance changes. But there is exponentially more value in covering individual stages via focused benchmarks.

As of these thoughts a recommendable strategy for benchmarking a multi-stage information flow is:
a) build benchmarks for individual stages for decent performance signal
b) build an end-to-end benchmark as a low-resolution sanity check. can add value to (a): missed a stage, or unaccounted interaction between stages

jgehrcke mentioned this issue Dec 12, 2022

benchmarks: add dataset_serialize_benchmark.py #124

Merged

jonkeane mentioned this issue Jan 6, 2023

Macrobenchmarks versus microbenchmarks conbench/conbench#579

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a dataset write benchmark #123

Add a dataset write benchmark #123

jonkeane commented Dec 6, 2022

jgehrcke commented Dec 8, 2022

jgehrcke commented Dec 9, 2022

jgehrcke commented Dec 9, 2022

jgehrcke commented Dec 9, 2022 •

edited

Loading

jgehrcke commented Dec 14, 2022

jgehrcke commented Dec 14, 2022

jgehrcke commented Dec 14, 2022

jgehrcke commented Dec 20, 2022

jonkeane commented Dec 29, 2022

jgehrcke commented Jan 5, 2023 •

edited

Loading

Add a dataset write benchmark #123

Add a dataset write benchmark #123

Comments

jonkeane commented Dec 6, 2022

jgehrcke commented Dec 8, 2022

jgehrcke commented Dec 9, 2022

jgehrcke commented Dec 9, 2022

jgehrcke commented Dec 9, 2022 • edited Loading

jgehrcke commented Dec 14, 2022

jgehrcke commented Dec 14, 2022

jgehrcke commented Dec 14, 2022

jgehrcke commented Dec 20, 2022

jonkeane commented Dec 29, 2022

jgehrcke commented Jan 5, 2023 • edited Loading

jgehrcke commented Dec 9, 2022 •

edited

Loading

jgehrcke commented Jan 5, 2023 •

edited

Loading