Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solr service entering unrecoverable state under load #3603

Closed
mogul opened this issue Dec 16, 2021 · 3 comments
Closed

Solr service entering unrecoverable state under load #3603

mogul opened this issue Dec 16, 2021 · 3 comments
Assignees

Comments

@mogul
Copy link
Contributor

mogul commented Dec 16, 2021

How to reproduce

  1. Create a Solr service
  2. Bind it to a catalog.data.gov app with a large set of data.
  3. Subject the service to load by running a task with command ckan search-index rebuild -i -o

Expected behavior

The index rebuild operation completes uneventfully.

Actual behavior

The Solr service goes into a bad state, and gives 500 errors.

Sketch

  • Logs on the SolrCloud side indicate that the nodes are in disagreement about which one is the leader. This may be an indication that most of the nodes were down and recovery did not happen correctly.
  • We should figure out if this command can be used to get the cluster functional again (though it may not prevent data loss...!), ideally on the service side without user intervention.
  • We should figure out the root cause for the cluster entering that state in the first place, and prevent it!
@mogul mogul self-assigned this Dec 16, 2021
@mogul
Copy link
Contributor Author

mogul commented Dec 17, 2021

Ruminating on potential causes here: Solr is running in Fargate, and each pod will only get 20GB of attached ephemeral storage. I think our current config tries to store a copy of every shard on every replica. That would mean that the total capacity of the cluster won't go above 20GB, no matter how many replicas we add.

Assuming that is the problem (which I don't know yet without more investigation) here are a couple options:

  • Do a better job configuring shards vs replicas
    • Since there's not a lot of room for a better configuration of shards vs replicas in such a small space per-node, this seems like the wrong approach.
  • It just became possible to configure EKS to use EBS to satisfy PersistentVolumeClaims.
    • I think the way forward is to get that EBS CSI add-on installed, then provision our Solr clusters to use PVCs instead of ephemeral storage.

@mogul
Copy link
Contributor Author

mogul commented Dec 19, 2021

@mogul
Copy link
Contributor Author

mogul commented Jan 11, 2022

We discovered that the catalog Solr index was not ~300GB, as previously rumored, but more like ~30G. That made this option much more viable:

  • Do a better job configuring shards vs replicas

    • Since there's not a lot of room for a better configuration of shards vs replicas in such a small space per-node, this seems like the wrong approach.

@nickumia-reisys increased the number of nodes in the SolrCloud, split our collection into shards, then set the number of replicas for each shard at 2. This configuration means each individual node only holds part of the collection, but each part is replicated across at least two nodes for resiliency.

Since making this change we were able to complete sync of the catalog Solr index, and we've been able to run load-testing without any interruption in the availability of the Solr service.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants