-
Notifications
You must be signed in to change notification settings - Fork 227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow users to specify Packer's internal buf_size #547
Conversation
Giving this flexibility to users means that internal reallocations can be avoided if the buffer size is good enough, at the expense of potentially allocating more memory than needed. This, together with reusing a Packer object, means that multiple serialisations can end up requiring no memory allocations other than the initial buffer creation, which can be a big win in some situations. The default value is still 1MB, making this backwards compatible. Signed-off-by: Rodrigo Tobar <[email protected]>
@methane a gentle ping to bring attention to this fairly simple PR I opened about 10 days ago, I saw that you have been contributing the most lately to this file so I imagine you are a good person to review it. |
Adding public API is not "simple". I need to consider many aspects of it. If you want to help review, please show a benchmark result that demonstraits how reallocation is important.
Reusing Packer objects makes also reduce buffer size increase even without this change. So it doesn't make a big win. |
@methane thanks for the fast answer, much appreciated.
True, I think my previous statement was an oversimplification. I was mostly hoping that, since it was a well-constrained and backwards-compatible change, it wouldn't be too much of a burden to include it. Of course you will have a better view of what else adding this would entail.
True as well, I was thinking this would be more of a win for users of
This is a smaller version of the benchmark I was running. We are using xarrays, but you can simplify it further if you simply use a numpy array. The version here doesn't include the import msgpack
import msgpack_numpy as mnp
import numpy as np
import xarray as xa
import timeit
import functools
import seaborn as sns
import pandas as pd
import matplotlib.pyplot
to_dict = lambda x: x.to_dict(data="array")
mser = lambda x: msgpack.packb(x, default=mnp.encode)
packer = msgpack.Packer(default=mnp.encode, autoreset=False)
def mser_packer(x):
packer.reset()
packer.pack(x)
return packer
all_benchmarks = {
"write_only": {
'Dataset -> dict + msgpack.packb': [to_dict, mser],
'Dataset -> dict + msgpack.Packer': [to_dict, mser_packer],
}
}
def benchmark_for_sizes(functions, nitems):
results = []
for nitem in nitems:
arr = np.random.rand(nitem)
xarr = xa.Dataset({"x": arr, "y": arr + 30})
size = xarr.nbytes
print(f" {nitem=}, {size=}")
timer = timeit.Timer('functools.reduce(lambda v, f: f(v), functions, xarr)', setup="import functools", globals=locals())
n_executions, total_duration = timer.autorange()
duration = total_duration / n_executions
results.append((size, duration))
return results
def run_benchmarks(nitems, benchmarks):
results = {}
for name, functions in benchmarks.items():
print(f"Benchmarking {name}")
results[name] = benchmark_for_sizes(functions, nitems)
flat_results = list((name, size, duration) for name, values in results.items() for size, duration in values)
df = pd.DataFrame(flat_results, columns=("Benchmark", "Size", "Duration"))
df['Size [MB]'] = df['Size'] / 1024 / 1024
df['Speed [MB/s]'] = df['Size'] / 1024 / 1024 / df["Duration"]
return df
def run_all():
sns.set_theme()
nitems = list(range(10000000, 2000000, -120000))
dfs = []
for group, benchmarks in all_benchmarks.items():
df = run_benchmarks(nitems, benchmarks)
df["Group"] = group
dfs.append(df)
df = pd.concat(dfs)
sns.relplot(data=df, x='Size [MB]', y='Speed [MB/s]', kind='line', hue='Benchmark', col='Group')
matplotlib.pyplot.savefig(f'benchmark_results.png')
if __name__ == '__main__':
run_all() |
@methane: a gentle ping to learn if you have any further thoughts on whether you'd be willing to accept this change or not, given the evidence above, and the hopefully small maintenance burden? |
Thank you for detailed report. |
There is no schedule for v1.1 yet. |
Thanks @methane, much appreciated |
committed in #604. |
Giving this flexibility to users means that internal reallocations can be avoided if the buffer size is good enough, at the expense of potentially allocating more memory than needed. This, together with reusing a Packer object, means that multiple serialisations can end up requiring no memory allocations other than the initial buffer creation, which can be a big win in some situations.
The default value is still 1MB, making this backwards compatible.