Very high memory usage with write_html.py #15

ghost · 2019-10-11T10:48:58Z

On my system running write_html.py without arguments requires too much memory and takes too long. After more than 30 minutes I had to manually stop it because my system became unresponsive. Memory usage increased slowly but relentlessly, until write_html.py used all 8 gigabytes of RAM plus 5 gigabytes of swap.

My data directory is currently 3.1 gigabytes. It will continue to expand in the future because I'm always fetching new subreddits.

How can I help debug this?

P.S. My knowledge of Python is still very small...

The text was updated successfully, but these errors were encountered:

libertysoft3 · 2019-10-12T05:12:33Z

I think you could add something like this to the start of every function in write_html.py, and try to determine which function grows memory the most.

import os
import psutil
# in functions:
process = psutil.Process(os.getpid())
print('function X runs, memory used: %s' % process.memory_info().rss / 1024)  # in kilobytes

ghost · 2019-10-12T15:26:19Z

I launched the following command for one minute only:

timeout 60 ./write_html.py > write_html.log 2>&1

write_html.zip

libertysoft3 · 2019-10-26T02:38:18Z

@fturco Awesome man, thanks for putting in the effort. I'll try to fix this this weekend.

libertysoft3 · 2019-11-05T10:22:22Z

Can you try commenting out these two lines and seeing how the memory goes?

sub_links.append(l)
user_index[l['author']].append(l)

The script is intentionally loading all content for a sub into memory, so it's kind of a big logical failure. I've gotta rewrite a bit of it and maybe a lot of it. Not having all of the comments in memory at once might be enough to get by.

ghost · 2019-11-05T21:10:23Z

After commenting out those lines, write_html.py no longer uses too much memory.
But I noticed index.html files for each subreddit are now missing, so I can't display the archives with a browser.

libertysoft3 · 2019-11-12T10:21:58Z

Okay I pushed an update. Not everything was optimized, but I think it should be a lot better.

If it's still bad, can you try commenting out this line:

user_index[l['author']].append(l)

ghost · 2019-11-12T11:08:58Z

I tried running write_html.py again after updating it, and it seems it successfully generated all HTML pages. While processing posts, my system reached a peak of 2.9 GiB of used RAM and then it went back to 1.1 GiB after write_html.py finished. So that's a lot better.

I haven't yet tried commenting out the line you specified.

ghost · 2019-11-12T15:45:47Z

To give you a better idea, I have already archived 16 subreddits:

$ ./write_html.py
...
all done. 581830 links filtered to 581830

$ du -sm data
4715    data

libertysoft3 · 2019-11-13T07:28:07Z

Thanks for the stats. Well how it stands now, I'm basically loading all /data/*/links.csv data into memory. So whenever you get 13GB of link data (only, not comments) in your archive, you won't be able to generate the html.

So uhh I donno. Maybe we'll leave this open and I'll do more optimization in the future. Bug me when you get to 8GB of data?

ghost · 2019-11-13T12:26:29Z

OK, sure.

libertysoft3 added the bug Something isn't working label Oct 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Very high memory usage with write_html.py #15

Very high memory usage with write_html.py #15

ghost commented Oct 11, 2019

libertysoft3 commented Oct 12, 2019 •

edited

Loading

ghost commented Oct 12, 2019

libertysoft3 commented Oct 26, 2019

libertysoft3 commented Nov 5, 2019

ghost commented Nov 5, 2019

libertysoft3 commented Nov 12, 2019 •

edited

Loading

ghost commented Nov 12, 2019

ghost commented Nov 12, 2019

libertysoft3 commented Nov 13, 2019

ghost commented Nov 13, 2019

Very high memory usage with write_html.py #15

Very high memory usage with write_html.py #15

Comments

ghost commented Oct 11, 2019

libertysoft3 commented Oct 12, 2019 • edited Loading

ghost commented Oct 12, 2019

libertysoft3 commented Oct 26, 2019

libertysoft3 commented Nov 5, 2019

ghost commented Nov 5, 2019

libertysoft3 commented Nov 12, 2019 • edited Loading

ghost commented Nov 12, 2019

ghost commented Nov 12, 2019

libertysoft3 commented Nov 13, 2019

ghost commented Nov 13, 2019

libertysoft3 commented Oct 12, 2019 •

edited

Loading

libertysoft3 commented Nov 12, 2019 •

edited

Loading