Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solr process eventually runs of out memory #538

Closed
hancush opened this issue Jan 16, 2020 · 12 comments
Closed

Solr process eventually runs of out memory #538

hancush opened this issue Jan 16, 2020 · 12 comments

Comments

@hancush
Copy link
Collaborator

hancush commented Jan 16, 2020

Offshoot of #534, related to #535.

After running without issue for about seven months, the production Solr process ran out of memory to accept new updates. Restarting the process freed up enough memory to resolve the issue, however this is only a temporary solution.

By default, Solr caps memory use ("heap size") at about half a gig. The docs suggest that this will be insufficient for most production setups. We probably don't need the recommended 10GB, but a middle ground may be more appropriate, especially given the size of our documents.

This thread contains some guidance on getting a handle on the memory consumption of Solr processes. This may help us determine a saner value.

This post also looks like a good resource for growing heap size and ways of addressing it.

@hancush
Copy link
Collaborator Author

hancush commented Feb 17, 2020

This happened again. I'd like to escalate the priority of this issue.

@hancush
Copy link
Collaborator Author

hancush commented Feb 17, 2020

Potentially related: datamade/django-councilmatic#205

@hancush
Copy link
Collaborator Author

hancush commented Feb 18, 2020

This article suggests that frequent updates require a bigger heap size. The staging Solr index is updated once per day. The production Solr index is updated every 15 minutes, or 96 times per day, fully a quarter of which reindex every bill in the database. That could be one reason we're seeing this on production, but not staging.

I monitored the production Solr instance while a full reindex was taking place. Heap use hovered between 40 and 60% of the allocated memory (half a gig). This doesn't seem like enough to run out of memory, so I wonder if there's a leak somewhere that gradually increases heap use. In that case, increasing heap size may only be a band-aid. I've increased heap size on a branch, but I'd actually like to hold off on merging and check on this once a week for a few weeks to get a handle on whether heap use is creeping up, or whether our errors come from more of a shock to the system.

@fgregg
Copy link
Collaborator

fgregg commented Feb 18, 2020

this does sound like a memory leak. first thing I would try in this case is to upgrade solr.

@fgregg
Copy link
Collaborator

fgregg commented Feb 18, 2020

i think you monitoring plan is also good.

@hancush
Copy link
Collaborator Author

hancush commented Mar 16, 2020

This happened again after three weeks.

@hancush
Copy link
Collaborator Author

hancush commented Jun 24, 2020

Yikes, this happened again on the new server. I'd like to escalate this issue in the next month or two.

@hancush
Copy link
Collaborator Author

hancush commented Aug 12, 2020

Woah! This blog post is very, very helpful in tuning memory needs for Solr. In particular, it offers an explanation for how Solr uses memory. Most notably:

As you can see, a large portion of heap memory is used by multiple caches... The maximum memory a cache uses is controlled by individual cache size configured in solrconfig.xml.

So, a compelling reason why production index updates eventually fail is because Solr's various caches grow large enough that there is no longer sufficient heap space to make updates. This would also explain why restarting Solr frees up space. Since the staging site is nowhere near as regularly used as the production site, it would also explain why we don't see this on staging.

I think we will need to do a combination of limiting the max size of the caches, and perhaps giving the production Solr index a bit more memory to work with, to solve this issue. Will continue reading and update this thread.

@hancush
Copy link
Collaborator Author

hancush commented Aug 12, 2020

You can view stats on the various caches in the Solr admin by selecting your core in the lefthand menu, then navigating to Plugins / Stats > Cache.

The big one for us is the document cache, which is at its max size of 512. With 2514 docs totaling to 652.8 MB, we can estimate that each our documents weighs about 0.25 MB. That means our document cache is around 128 MB in size, or a quarter of our available heap space.

There are some items in the query and filter caches, as well, but neither is close to full. According to this article, those are the ones that can potentially get quite big. I could spend a lot of time further spelunking here, but I think we'll see diminishing returns to the precision / time spent.

I'm going to bump the production Solr heap size up to 1 GB (double its current heap size) and continue monitoring this thread.

@jeancochrane
Copy link
Contributor

Excellent research so far! Three questions:

  1. Do you have a sense of what causes the cache to expand? Is the cache populated by index updates, or only by direct queries?
  2. Is there a way for us to automatically expire the cache on a schedule, as we do with Django?
  3. Are there are any opportunities for us to add automated monitoring and alerting to heap usage? Seems like Solr offers metrics reporting and a logging API, is there a way we could perhaps hook these into Sentry?

@hancush
Copy link
Collaborator Author

hancush commented Aug 18, 2020

Thank you for these excellent prompts, @jeancochrane! I've increased Solr's memory in production, and I'll keep an eye on this issue. If we wind up needing further work, I'll start with these questions.

@antidipyramid
Copy link
Collaborator

Closing since we're on ElasticSearch now 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

4 participants