-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Solr service entering unrecoverable state under load #3603
Comments
Ruminating on potential causes here: Solr is running in Fargate, and each pod will only get 20GB of attached ephemeral storage. I think our current config tries to store a copy of every shard on every replica. That would mean that the total capacity of the cluster won't go above 20GB, no matter how many replicas we add. Assuming that is the problem (which I don't know yet without more investigation) here are a couple options:
|
We discovered that the catalog Solr index was not ~300GB, as previously rumored, but more like ~30G. That made this option much more viable:
@nickumia-reisys increased the number of nodes in the SolrCloud, split our collection into shards, then set the number of replicas for each shard at 2. This configuration means each individual node only holds part of the collection, but each part is replicated across at least two nodes for resiliency. Since making this change we were able to complete sync of the catalog Solr index, and we've been able to run load-testing without any interruption in the availability of the Solr service. |
How to reproduce
ckan search-index rebuild -i -o
Expected behavior
The index rebuild operation completes uneventfully.
Actual behavior
The Solr service goes into a bad state, and gives 500 errors.
Sketch
The text was updated successfully, but these errors were encountered: