Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More efficient row filtering #20

Open
andynsd opened this issue Jul 22, 2024 · 1 comment
Open

More efficient row filtering #20

andynsd opened this issue Jul 22, 2024 · 1 comment

Comments

@andynsd
Copy link

andynsd commented Jul 22, 2024

I am using rowStart and rowEnd to filter rows which works as advertised, but I am seeing some performance problems. It looks like the library is assembling all of the data from relevant row groups and then slicing off the undesired portion after the fact. If I just want a single row but my row group size is relatively high (i.e. 1 GB), the heap size still gets very large. There doesn't seem much benefit to using rowStart or rowEnd.

Looking through the code , it seems like the library could avoid holding onto the rows that fall outside of the requested row window. Does this problem resonate at all? I wonder if there are any plans to make this more efficient. I might be able to get some bandwidth to help with a fix if it seems doable/useful.

@platypii
Copy link
Collaborator

This is absolutely something that I would like to see improved! There is already a rowLimit parameter to the readColumn function which helps to stop parsing early if not all the rows are needed. But agree that it could be improved.

One thing to be careful of is that raw column data may have a different length than the actual row start and end, because it gets assembled into lists and structs. That being said, I'm pretty sure that clever tricks could save significantly on heap size.

Contributions are most welcome! Happy to further discuss strategies here too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants