Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excessive memory consumption during SIA-PA download #64

Open
heber-augusto opened this issue Jan 5, 2022 · 10 comments
Open

Excessive memory consumption during SIA-PA download #64

heber-augusto opened this issue Jan 5, 2022 · 10 comments
Assignees

Comments

@heber-augusto
Copy link

heber-augusto commented Jan 5, 2022

There are some SIA-PA files wich are huge and it seems that converting from dbc to parquet is consuming too much memory.
Although the following example is somehow related to issue 27 (which occurs when a given month has a lot of data), to simulate the error, I collected the file manually through the sus ftp and called only the function that is failing.

from pysus.utilities.readdbc import dbc2dbf, read_dbc
infile = 'PASP2003a.dbc'
outfile = 'PASP2003a.dbf'

dbc2dbf(infile,outfile)
#exception is raised by the line bellow
df = read_dbc(infile)

I was using google colab and tried to hire the pro version for a month to increase the memory limit, but even so, it wasn't enough.

I don't know how to solve exactly with the dbf file. With csvs and pandas files, I usually tried to read the file in chunks to avoid excessive memory consumption.

@github-actions
Copy link

github-actions bot commented Jan 5, 2022

Message that will be displayed on users first issue

@fccoelho fccoelho self-assigned this Jan 9, 2022
@fccoelho
Copy link
Collaborator

fccoelho commented Jan 9, 2022

Thanks for reporting this bug @heber-augusto.
We apply a chunked solution to COVID vaccination data.

The solution shouldn't be hard to implement,

@fccoelho
Copy link
Collaborator

fccoelho commented Feb 1, 2022

Datasus now decided to split large files into <filename>a.dbc, <filename>b.dbc, etc. The download function needs to be adapted to look for this variant names.

@heber-augusto
Copy link
Author

Hello @fccoelho , I wrote a possible solution to read split files inside issue 27 but I think the excessive memory consumption is an issue that exists despite this. I can help with the download function (issue 27).

I tried contact with Datasus team to check if they have a solution to avoid excessive memory consuption but did not have an answer yet. :(

@fccoelho
Copy link
Collaborator

fccoelho commented Feb 2, 2022

Hi @heber-augusto, I have just opened a PR to fix this situation.

It also allows PySUS to automatically handle downloading of the a, b, c, … parts of large files when necessary.

As soon as it passes the review and gets merged, I'll make a new release of PySUS. There is also new documentation showing how to handle these large SIA files.

@fccoelho
Copy link
Collaborator

fccoelho commented Feb 2, 2022

Hello @fccoelho , I wrote a possible solution to read split files inside issue 27 but I think the excessive memory consumption is an issue that exists despite this. I can help with the download function (issue 27).

I tried contact with Datasus team to check if they have a solution to avoid excessive memory consuption but did not have an answer yet. :(

Hi @heber-augusto , My PR Solves issue #27 as well! Take a look at it.

@heber-augusto
Copy link
Author

Thanks a lot @fccoelho!

@heber-augusto
Copy link
Author

@fccoelho , I did a review and left some comments on PR.

@heber-augusto
Copy link
Author

Hello @fccoelho , how are you?
I was testing the new version and discovered that the download for a single file (when the is not file split) is broken.

You can reproduce the error with the following code:

from pysus.online_data.SIA import download as download_sia

df_pa = download_sia(
'ES',
2020,
3,
cache=True,
group= ['PA',])

The exception message is "unpack requires a buffer of 32 bytes"

I am creating a new repo using PySus to help downloading files (forgive if I am not using the correct reference yet - I will do that soon) and solved the problem with some customization. I found two possible bugs here:

  1. Line 151 from SIA.py is trying to create a dbf file but there is no code to download dbc file (ftp.retrbinary() and dbc2dbf()).
  2. I think that Line 117 from SIA.py should be changed to if cache and (df is not None). Depending on python version the present version may raise exception.

You can see how I solved the single file download inside this file, just take a look at _fetch_file() , download_single_file() and download_multiples(). Please there are some customizations here.

@fccoelho
Copy link
Collaborator

Thanks for reporting this bug, @heber-augusto. I'll take a look when I have some time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants