Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory consumption: Steps forward #182

Closed
3 tasks done
reginafcompton opened this issue Mar 1, 2018 · 2 comments
Closed
3 tasks done

Memory consumption: Steps forward #182

reginafcompton opened this issue Mar 1, 2018 · 2 comments
Assignees

Comments

@reginafcompton
Copy link
Contributor

reginafcompton commented Mar 1, 2018

The Councilmatic server had significant memory issues, beginning around midnight February 28. The /var/log/syslog shows that python started to kill processes (python invoked oom-killer) around 12:50 - after the execution of the Chicago cron (45 after) and LA Metro cron (40 after).

Mar  1 06:40:01 ip-10-0-0-124 CRON[849]: (datamade) CMD (/usr/bin/flock -n /tmp/lametro_dataload.lock -c 'cd $APPDIR && $PYTHONDIR manage.py import_data >> /tmp/lametro-loaddata.log 2>&1 && $PYTHONDIR manage.py compile_pdfs >> /tmp/lametro-compilepdfs.log 2>&1 && $PYTHONDIR manage.py update_index >> /tmp/lametro-updateindex.log 2>&1 && $PYTHONDIR manage.py data_integrity >> /tmp/lametro-integrity.log')

...

Mar  1 06:45:01 ip-10-0-0-124 CRON[2042]: (datamade) CMD (/usr/bin/flock -n /tmp/chicago_dataload.lock -c 'cd $APPDIR && $PYTHONDIR manage.py import_data >> /tmp/chicago-loaddata.log 2>&1 && $PYTHONDIR manage.py update_index --batch-size=50 --age=1 >> /tmp/chicago_updateindex.log 2>&1 && $PYTHONDIR manage.py send_notifications >> /tmp/chicago_sendnotifications.log 2>&1')

...

Mar  1 06:50:01 ip-10-0-0-124 kernel: [6173922.727963] python invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0

@evz and I rebooted the server, and then we watched the memory consumption, as crontasks executed. We noticed that the LA Metro update_index process required considerable memory: the process (Jetty) consumed about 15% of memory and doubled to around 30% inserting the data into the Solr index (Java). Such memory use could be hazardous, if it overlaps with other indexing processes (i.e., NYC and Chicago).

We identified several next steps:


Additionally, we noticed that the rtf conversion script for NYC sometimes requires longer than 15 minutes to complete (which delays NYC data imports). Let's replace the RTF --> HTML with the actual PDFs. It should be possible via this PR.

@reginafcompton reginafcompton self-assigned this Mar 1, 2018
@reginafcompton
Copy link
Contributor Author

What is age?

The SearchIndex class provides structured data to the search engine. (Note: the search engine is document-based – a single text blob that gets tokenized, analyzed, and indexed – much like a key-value store.) An instance of the SearchIndex can contain a get_updated_field function. This tells the search index which field has an "updated" timestamp. For Councilmatic, the bill model has an updated_at field, and we tell Haystack all about it. Hence, we can use the --age argument.

It looks like Chicago had a big data import day, resulting in some unusually large bill counts. I queried our Councilmatic database for bills updated in the last hour: it's 1704. This number aligns with what I saw in the update_index log (also, 1704).

In short, the --age argument works as expected and should be implemented in LA Metro (and other Councilmatics, including staging sites, that do not use it).

@reginafcompton
Copy link
Contributor Author

Closing - I moved the last bullet point to issue #184

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant