-
-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Excessive memory consumption during SIA-PA download #64
Comments
Message that will be displayed on users first issue |
Thanks for reporting this bug @heber-augusto. The solution shouldn't be hard to implement, |
Datasus now decided to split large files into |
Hello @fccoelho , I wrote a possible solution to read split files inside issue 27 but I think the excessive memory consumption is an issue that exists despite this. I can help with the download function (issue 27). I tried contact with Datasus team to check if they have a solution to avoid excessive memory consuption but did not have an answer yet. :( |
Hi @heber-augusto, I have just opened a PR to fix this situation. It also allows PySUS to automatically handle downloading of the a, b, c, … parts of large files when necessary. As soon as it passes the review and gets merged, I'll make a new release of PySUS. There is also new documentation showing how to handle these large SIA files. |
Hi @heber-augusto , My PR Solves issue #27 as well! Take a look at it. |
Thanks a lot @fccoelho! |
Hello @fccoelho , how are you? You can reproduce the error with the following code:
The exception message is "unpack requires a buffer of 32 bytes" I am creating a new repo using PySus to help downloading files (forgive if I am not using the correct reference yet - I will do that soon) and solved the problem with some customization. I found two possible bugs here:
You can see how I solved the single file download inside this file, just take a look at |
Thanks for reporting this bug, @heber-augusto. I'll take a look when I have some time. |
There are some SIA-PA files wich are huge and it seems that converting from dbc to parquet is consuming too much memory.
Although the following example is somehow related to issue 27 (which occurs when a given month has a lot of data), to simulate the error, I collected the file manually through the sus ftp and called only the function that is failing.
I was using google colab and tried to hire the pro version for a month to increase the memory limit, but even so, it wasn't enough.
I don't know how to solve exactly with the dbf file. With csvs and pandas files, I usually tried to read the file in chunks to avoid excessive memory consumption.
The text was updated successfully, but these errors were encountered: