Solr service entering unrecoverable state under load #3603

mogul · 2021-12-16T17:01:17Z

How to reproduce

Create a Solr service
Bind it to a catalog.data.gov app with a large set of data.
Subject the service to load by running a task with command ckan search-index rebuild -i -o

Expected behavior

The index rebuild operation completes uneventfully.

Actual behavior

The Solr service goes into a bad state, and gives 500 errors.

Sketch

Logs on the SolrCloud side indicate that the nodes are in disagreement about which one is the leader. This may be an indication that most of the nodes were down and recovery did not happen correctly.
We should figure out if this command can be used to get the cluster functional again (though it may not prevent data loss...!), ideally on the service side without user intervention.
We should figure out the root cause for the cluster entering that state in the first place, and prevent it!

The text was updated successfully, but these errors were encountered:

mogul · 2021-12-17T00:38:56Z

Ruminating on potential causes here: Solr is running in Fargate, and each pod will only get 20GB of attached ephemeral storage. I think our current config tries to store a copy of every shard on every replica. That would mean that the total capacity of the cluster won't go above 20GB, no matter how many replicas we add.

Assuming that is the problem (which I don't know yet without more investigation) here are a couple options:

Do a better job configuring shards vs replicas
- Since there's not a lot of room for a better configuration of shards vs replicas in such a small space per-node, this seems like the wrong approach.
It just became possible to configure EKS to use EBS to satisfy PersistentVolumeClaims.
- I think the way forward is to get that EBS CSI add-on installed, then provision our Solr clusters to use PVCs instead of ephemeral storage.

mogul · 2021-12-19T18:41:52Z

Here's a fix in the operator that will probably make recovery succeed correctly.

mogul · 2022-01-11T17:49:03Z

We discovered that the catalog Solr index was not ~300GB, as previously rumored, but more like ~30G. That made this option much more viable:

Do a better job configuring shards vs replicas

Since there's not a lot of room for a better configuration of shards vs replicas in such a small space per-node, this seems like the wrong approach.

@nickumia-reisys increased the number of nodes in the SolrCloud, split our collection into shards, then set the number of replicas for each shard at 2. This configuration means each individual node only holds part of the collection, but each part is replicated across at least two nodes for resiliency.

Since making this change we were able to complete sync of the catalog Solr index, and we've been able to run load-testing without any interruption in the availability of the Solr service.

mogul self-assigned this Dec 16, 2021

mogul assigned nickumia-reisys Jan 10, 2022

nickumia-reisys removed their assignment Jan 13, 2022

mogul added this to the Sprint 20220120 milestone Jan 20, 2022

mogul closed this as completed Jan 20, 2022

nickumia-reisys mentioned this issue Mar 7, 2022

[timebox: 5d] Catalog load- and stress-testing in cloud.gov #2701

Closed

11 tasks

nickumia-reisys mentioned this issue Sep 15, 2022

Dissect Solr Performance through New Relic #3956

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Solr service entering unrecoverable state under load #3603

Solr service entering unrecoverable state under load #3603

mogul commented Dec 16, 2021 •

edited

Loading

mogul commented Dec 17, 2021

mogul commented Dec 19, 2021

mogul commented Jan 11, 2022

Solr service entering unrecoverable state under load #3603

Solr service entering unrecoverable state under load #3603

Comments

mogul commented Dec 16, 2021 • edited Loading

How to reproduce

Expected behavior

Actual behavior

Sketch

mogul commented Dec 17, 2021

mogul commented Dec 19, 2021

mogul commented Jan 11, 2022

mogul commented Dec 16, 2021 •

edited

Loading