Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very high memory usage with write_html.py #15

Open
ghost opened this issue Oct 11, 2019 · 10 comments
Open

Very high memory usage with write_html.py #15

ghost opened this issue Oct 11, 2019 · 10 comments
Labels
bug Something isn't working

Comments

@ghost
Copy link

ghost commented Oct 11, 2019

On my system running write_html.py without arguments requires too much memory and takes too long. After more than 30 minutes I had to manually stop it because my system became unresponsive. Memory usage increased slowly but relentlessly, until write_html.py used all 8 gigabytes of RAM plus 5 gigabytes of swap.

My data directory is currently 3.1 gigabytes. It will continue to expand in the future because I'm always fetching new subreddits.

How can I help debug this?

P.S. My knowledge of Python is still very small...

@libertysoft3 libertysoft3 added the bug Something isn't working label Oct 12, 2019
@libertysoft3
Copy link
Owner

libertysoft3 commented Oct 12, 2019

I think you could add something like this to the start of every function in write_html.py, and try to determine which function grows memory the most.

import os
import psutil
# in functions:
process = psutil.Process(os.getpid())
print('function X runs, memory used: %s' % process.memory_info().rss / 1024)  # in kilobytes 

@ghost
Copy link
Author

ghost commented Oct 12, 2019

I launched the following command for one minute only:

timeout 60 ./write_html.py > write_html.log 2>&1

write_html.zip

@libertysoft3
Copy link
Owner

@fturco Awesome man, thanks for putting in the effort. I'll try to fix this this weekend.

@libertysoft3
Copy link
Owner

Can you try commenting out these two lines and seeing how the memory goes?

sub_links.append(l)
user_index[l['author']].append(l)

The script is intentionally loading all content for a sub into memory, so it's kind of a big logical failure. I've gotta rewrite a bit of it and maybe a lot of it. Not having all of the comments in memory at once might be enough to get by.

@ghost
Copy link
Author

ghost commented Nov 5, 2019

After commenting out those lines, write_html.py no longer uses too much memory.
But I noticed index.html files for each subreddit are now missing, so I can't display the archives with a browser.

@libertysoft3
Copy link
Owner

libertysoft3 commented Nov 12, 2019

Okay I pushed an update. Not everything was optimized, but I think it should be a lot better.

If it's still bad, can you try commenting out this line:

user_index[l['author']].append(l)

@ghost
Copy link
Author

ghost commented Nov 12, 2019

I tried running write_html.py again after updating it, and it seems it successfully generated all HTML pages. While processing posts, my system reached a peak of 2.9 GiB of used RAM and then it went back to 1.1 GiB after write_html.py finished. So that's a lot better.

I haven't yet tried commenting out the line you specified.

@ghost
Copy link
Author

ghost commented Nov 12, 2019

To give you a better idea, I have already archived 16 subreddits:

$ ./write_html.py
...
all done. 581830 links filtered to 581830

$ du -sm data
4715    data

@libertysoft3
Copy link
Owner

Thanks for the stats. Well how it stands now, I'm basically loading all /data/*/links.csv data into memory. So whenever you get 13GB of link data (only, not comments) in your archive, you won't be able to generate the html.

So uhh I donno. Maybe we'll leave this open and I'll do more optimization in the future. Bug me when you get to 8GB of data?

@ghost
Copy link
Author

ghost commented Nov 13, 2019

OK, sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant