[BE] Unify `buffer_size` across datapipes #335

pmeier · 2022-03-28T17:36:32Z

The buffer_size parameter is currently fairly inconsistent across datapipes:

name	default `buffer_size`	infinite `buffer_size`	warn on infinite
Demultiplexer	1e3	-1	yes
Forker	1e3	-1	yes
Grouper	1e4	N/A	N/A
Shuffler	1e4	N/A	N/A
MaxTokenBucketizer	1e3	N/A	N/A
UnZipper	1e3	-1	yes
IterKeyZipper	1e4	None	no

Here are my suggestion on how to unify this:

Use the same default buffer_size everywhere. It makes little difference whether we use 1e3 or 1e4 given that it is tightly coupled with the data we know nothing about. Given today's hardware / datasets, I would go with 1e4, but no strong opinion.
Give every datapipe with buffer the ability for an infinite buffer. Otherwise users will just be annoyed and use a workaround. For example, torchvision simply uses INFINITE_BUFFER_SIZE = 1_000_000_000, which for all intents and purposes lives up to its name. Which sentinel we use, i.e. -1 or None, again makes little difference. I personally would use None to have a clear separation, but again no strong opinion other than being consistent.
Do not warn on infinite buffer sizes. Especially since infinite buffer is not the default behavior, the user is expected to know what they are doing when setting buffer_size=None. I'm all for having a warning like this in the documentation, but I'm strongly against a runtime warning. For example, torchvision datasets need to use an infinite buffer everywhere. Thus, by using the infinite buffer sentinel, users would always get runtime warnings although neither them nor we did anything wrong.

The text was updated successfully, but these errors were encountered:

pmeier · 2022-04-05T07:55:53Z

@NivekT does the 👍 mean you agree with my suggestions and I can send a PR?

ejguan · 2022-04-05T13:37:15Z

@pmeier Sorry I missed this issue before. We are currently exploring a plan to add a BufferSpec class (configuration) for buffer. We should provide an instance for default value of all DataPipe. And, the infinite_buffer should be another special instance of such class. WDYT about this idea

pmeier · 2022-04-05T13:39:54Z

We are currently exploring a plan to add a BufferSpec class (configuration) for buffer

Is there a design document / issue for this? If not, what would a an instance of this class contain?

ejguan · 2022-04-05T13:45:57Z

We are currently exploring a plan to add a BufferSpec class (configuration) for buffer

Is there a design document / issue for this? If not, what would a an instance of this class contain?

Not for now. This is a requirement from internal usage that they want to provide some special buffer instance because they would be able to replace a certain instance of DataPipe in the graph but using the BufferSpec instance. For OSS side, this class is really straight forward.

class BufferSpec:
    buffer_size: int

If users use an integer as an argument, we should automatically change it into an instance of BufferSpec.

pmeier · 2022-04-05T14:04:31Z

I can't comment on the internal use case, but if the public API still accepts int's and any sentinel for the infinite buffer, I'm ok with that.

ejguan · 2022-04-05T14:09:40Z

Yeah. And, for this issue, we should definitely unify the buffer_size.
For the infinite one, I agree None is better than -1. WDYT whether or not we should raise Error for negative value. If not raising value, we are kind supporting both sentinels for infinite.

pmeier · 2022-04-05T14:13:17Z

WDYT whether or not we should raise Error for negative value. If not raising value, we are kind supporting both sentinels for infinite.

Let's not have two ways to do the same thing. If we decide on a sentinel, everything else should error if not isinstance(buffer_size, int) and buffer_size > 1.

ejguan · 2022-04-05T14:14:42Z

Makes sense to me

…#335

pmeier mentioned this issue Mar 30, 2022

add test to enforce infinite buffer size for all applicable datapipes pytorch/vision#5707

Merged

VitalyFedyunin changed the title ~~Unify buffer_size across datapipes~~ [BE] Unify buffer_size across datapipes Jul 6, 2022

VitalyFedyunin added the Better Engineering label Jul 6, 2022

alexanderbattig added a commit to alexanderbattig/data that referenced this issue Mar 11, 2023

Unify buffer_size, following the convention suggested in pytorch#335

90aa8bd

alexanderbattig added a commit to alexanderbattig/pytorch that referenced this issue Mar 11, 2023

Unify buffer_size, following the convention suggested in pytorch/data…

c3c7f5e

…#335

ejguan linked a pull request Mar 13, 2023 that will close this issue

Unify buffer_size across TorchData DataPipes #1077

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BE] Unify `buffer_size` across datapipes #335

[BE] Unify `buffer_size` across datapipes #335

pmeier commented Mar 28, 2022

pmeier commented Apr 5, 2022

ejguan commented Apr 5, 2022

pmeier commented Apr 5, 2022

ejguan commented Apr 5, 2022

pmeier commented Apr 5, 2022

ejguan commented Apr 5, 2022

pmeier commented Apr 5, 2022

ejguan commented Apr 5, 2022

[BE] Unify buffer_size across datapipes #335

[BE] Unify buffer_size across datapipes #335

Comments

pmeier commented Mar 28, 2022

pmeier commented Apr 5, 2022

ejguan commented Apr 5, 2022

pmeier commented Apr 5, 2022

ejguan commented Apr 5, 2022

pmeier commented Apr 5, 2022

ejguan commented Apr 5, 2022

pmeier commented Apr 5, 2022

ejguan commented Apr 5, 2022

[BE] Unify `buffer_size` across datapipes #335

[BE] Unify `buffer_size` across datapipes #335