Excessive memory consumption during SIA-PA download #64

heber-augusto · 2022-01-05T16:35:38Z

There are some SIA-PA files wich are huge and it seems that converting from dbc to parquet is consuming too much memory.
Although the following example is somehow related to issue 27 (which occurs when a given month has a lot of data), to simulate the error, I collected the file manually through the sus ftp and called only the function that is failing.

from pysus.utilities.readdbc import dbc2dbf, read_dbc
infile = 'PASP2003a.dbc'
outfile = 'PASP2003a.dbf'

dbc2dbf(infile,outfile)
#exception is raised by the line bellow
df = read_dbc(infile)

I was using google colab and tried to hire the pro version for a month to increase the memory limit, but even so, it wasn't enough.

I don't know how to solve exactly with the dbf file. With csvs and pandas files, I usually tried to read the file in chunks to avoid excessive memory consumption.

github-actions · 2022-01-05T16:36:18Z

Message that will be displayed on users first issue

fccoelho · 2022-01-09T20:26:12Z

Thanks for reporting this bug @heber-augusto.
We apply a chunked solution to COVID vaccination data.

The solution shouldn't be hard to implement,

fccoelho · 2022-02-01T12:42:06Z

Datasus now decided to split large files into <filename>a.dbc, <filename>b.dbc, etc. The download function needs to be adapted to look for this variant names.

heber-augusto · 2022-02-01T23:46:52Z

Hello @fccoelho , I wrote a possible solution to read split files inside issue 27 but I think the excessive memory consumption is an issue that exists despite this. I can help with the download function (issue 27).

I tried contact with Datasus team to check if they have a solution to avoid excessive memory consuption but did not have an answer yet. :(

fccoelho · 2022-02-02T08:33:48Z

Hi @heber-augusto, I have just opened a PR to fix this situation.

It also allows PySUS to automatically handle downloading of the a, b, c, … parts of large files when necessary.

As soon as it passes the review and gets merged, I'll make a new release of PySUS. There is also new documentation showing how to handle these large SIA files.

fccoelho · 2022-02-02T08:52:42Z

Hello @fccoelho , I wrote a possible solution to read split files inside issue 27 but I think the excessive memory consumption is an issue that exists despite this. I can help with the download function (issue 27).

I tried contact with Datasus team to check if they have a solution to avoid excessive memory consuption but did not have an answer yet. :(

Hi @heber-augusto , My PR Solves issue #27 as well! Take a look at it.

heber-augusto · 2022-02-02T23:30:17Z

Thanks a lot @fccoelho!

heber-augusto · 2022-02-03T00:23:18Z

@fccoelho , I did a review and left some comments on PR.

heber-augusto · 2022-03-13T23:06:35Z

Hello @fccoelho , how are you?
I was testing the new version and discovered that the download for a single file (when the is not file split) is broken.

You can reproduce the error with the following code:

from pysus.online_data.SIA import download as download_sia

df_pa = download_sia(
'ES',
2020,
3,
cache=True,
group= ['PA',])

The exception message is "unpack requires a buffer of 32 bytes"

I am creating a new repo using PySus to help downloading files (forgive if I am not using the correct reference yet - I will do that soon) and solved the problem with some customization. I found two possible bugs here:

Line 151 from SIA.py is trying to create a dbf file but there is no code to download dbc file (ftp.retrbinary() and dbc2dbf()).
I think that Line 117 from SIA.py should be changed to if cache and (df is not None). Depending on python version the present version may raise exception.

You can see how I solved the single file download inside this file, just take a look at _fetch_file() , download_single_file() and download_multiples(). Please there are some customizations here.

fccoelho · 2022-03-14T07:04:23Z

Thanks for reporting this bug, @heber-augusto. I'll take a look when I have some time.

fccoelho self-assigned this Jan 9, 2022

fccoelho added bug help wanted labels Jan 9, 2022

fccoelho mentioned this issue Feb 2, 2022

Optimize sia download #68

Merged

anapaulagomes mentioned this issue Jul 22, 2024

Large datasets splited in multiple files generate a ValueError: not enough values to unpack (expected 2, got 1) #201

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Excessive memory consumption during SIA-PA download #64

Excessive memory consumption during SIA-PA download #64

heber-augusto commented Jan 5, 2022 •

edited

Loading

github-actions bot commented Jan 5, 2022

fccoelho commented Jan 9, 2022

fccoelho commented Feb 1, 2022

heber-augusto commented Feb 1, 2022

fccoelho commented Feb 2, 2022 •

edited

Loading

fccoelho commented Feb 2, 2022

heber-augusto commented Feb 2, 2022

heber-augusto commented Feb 3, 2022

heber-augusto commented Mar 13, 2022

fccoelho commented Mar 14, 2022

Excessive memory consumption during SIA-PA download #64

Excessive memory consumption during SIA-PA download #64

Comments

heber-augusto commented Jan 5, 2022 • edited Loading

github-actions bot commented Jan 5, 2022

fccoelho commented Jan 9, 2022

fccoelho commented Feb 1, 2022

heber-augusto commented Feb 1, 2022

fccoelho commented Feb 2, 2022 • edited Loading

fccoelho commented Feb 2, 2022

heber-augusto commented Feb 2, 2022

heber-augusto commented Feb 3, 2022

heber-augusto commented Mar 13, 2022

fccoelho commented Mar 14, 2022

heber-augusto commented Jan 5, 2022 •

edited

Loading

fccoelho commented Feb 2, 2022 •

edited

Loading