Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to CouchDb 3.4.x #9303

Closed
dianabarsan opened this issue Aug 5, 2024 · 27 comments
Closed

Upgrade to CouchDb 3.4.x #9303

dianabarsan opened this issue Aug 5, 2024 · 27 comments
Assignees
Labels
Type: Improvement Make something better
Milestone

Comments

@dianabarsan
Copy link
Member

What feature do you want to improve?
CouchDb 3.4.0 will be released soon.
It includes some changes that could improve things for the CHT significantly such as:

  • #4736: Stop client process and clean up if client disconnects.
  • #4503: Make timeouts for _view and _search configurable.
  • #4627: Add QuickJS as a JavaScript engine option.

Full release notes: https://docs.couchdb.org/en/latest/whatsnew/3.4.html

Describe the improvement you'd like
Upgrade CHT to use CouchDb 3.4.0

@dianabarsan dianabarsan added the Type: Improvement Make something better label Aug 5, 2024
@dianabarsan dianabarsan self-assigned this Aug 5, 2024
@garethbowen garethbowen changed the title Upgrade to CouchDb 3.4.0 Upgrade to CouchDb 3.4.x Oct 23, 2024
@garethbowen garethbowen added this to the 4.14.0 milestone Oct 23, 2024
@garethbowen
Copy link
Member

Added to 4.14 to at least investigate it.

We need to test for performance improvements and regressions, particularly when querying the changes feed with over 1,000 doc IDs.

Use this issue as the MVP upgrade - we can look at turning on the optional new features later.

@mrjones-plip
Copy link
Contributor

mrjones-plip commented Oct 23, 2024

Related: CHT Core is looking to bifurcate online and offline search which will leverage features in the latest version of couch. We're currently on couch 3.3.3 which doesn't have the latest search features. With the new couch version 3.4.2 it has a stable version of Nouveau search. Upgrading core ahead of the bifurcation would be great!

@sugat009 sugat009 assigned sugat009 and unassigned dianabarsan Oct 24, 2024
@sugat009 sugat009 moved this from Todo to This Week's commitments in Product Team Activities Oct 24, 2024
@sugat009 sugat009 linked a pull request Oct 24, 2024 that will close this issue
@sugat009
Copy link
Member

sugat009 commented Oct 25, 2024

I've done a round of tests on upgrading CouchDB to version 3.4.2 on CHT.

  1. With the docker helper 4. x setup, I updated the CouchDB container to version 3.4.2. According to the logs, all the services seem to be working. I also created a few users, contacts, etc.
  2. I ran unit tests, integration tests, and wdio tests locally and in the CI by creating PR. One of the tests is failing at the moment. (CI Run) The test is checking if /<database>/_explain endpoint is restricted to offline users. The response from the database has been changed a bit from version 3.3.3 to 3.4.2. (response pasted below). The key fields is the one that is causing the test to fail at the moment.
    a. /<database>/_explain endpoint response v3.3.3
{
    "dbname": "medic",
    "index": {
        "ddoc": null,
        "name": "_all_docs",
        "type": "special",
        "def": {
            "fields": [
                {
                    "_id": "asc"
                }
            ]
        }
    },
    "partitioned": "undefined",
    "selector": {
        "type": {
            "$eq": "person"
        }
    },
    "opts": {
        "use_index": [],
        "bookmark": "nil",
        "limit": 25,
        "skip": 0,
        "sort": {},
        "fields": "all_fields",
        "partition": "",
        "r": [
            49
        ],
        "conflicts": false,
        "stale": false,
        "update": true,
        "stable": false,
        "execution_stats": false
    },
    "limit": 25,
    "skip": 0,
    "fields": "all_fields",
    "mrargs": {
        "include_docs": true,
        "view_type": "map",
        "reduce": false,
        "partition": null,
        "start_key": null,
        "end_key": "<MAX>",
        "direction": "fwd",
        "stable": false,
        "update": true,
        "conflicts": "undefined"
    }
}

b. /<database>/_explain endpoint response v3.4.2

{
    "dbname": "medic",
    "index": {
        "ddoc": null,
        "name": "_all_docs",
        "type": "special",
        "def": {
            "fields": [
                {
                    "_id": "asc"
                }
            ]
        }
    },
    "partitioned": false,
    "selector": {
        "type": {
            "$eq": "person"
        }
    },
    "opts": {
        "use_index": [],
        "bookmark": "nil",
        "limit": 25,
        "skip": 0,
        "sort": {},
        "fields": [],
        "partition": "",
        "r": 1,
        "conflicts": false,
        "stale": false,
        "update": true,
        "stable": false,
        "execution_stats": false
    },
    "limit": 25,
    "skip": 0,
    "fields": [],
    "index_candidates": [],
    "selector_hints": [
        {
            "type": "json",
            "indexable_fields": [
                "type"
            ],
            "unindexable_fields": []
        }
    ],
    "mrargs": {
        "include_docs": true,
        "view_type": "map",
        "reduce": false,
        "partition": null,
        "start_key": null,
        "end_key": "<MAX>",
        "direction": "fwd",
        "stable": false,
        "update": true,
        "conflicts": "undefined"
    },
    "covering": false
}

TODO: performance tests

@sugat009 sugat009 pinned this issue Oct 25, 2024
@sugat009 sugat009 unpinned this issue Oct 25, 2024
@Hareet
Copy link
Member

Hareet commented Oct 25, 2024

@sugat009
Thinking of production scenarios: is it worth testing an upgrade from couchdb 2.3.1 with existing data (cht-core 3.x) to couchdb 3.4.2? Do you feel that's already covered? Thanks!

@mrjones-plip
Copy link
Contributor

Seconding Hareet's suggestion to test pre-couch 3.x upgrades. Since Core 4.4 added Couch 3.x, maybe try Core 4.2 -> Core branch @ ~master with couch 3.4.x?

@sugat009
Copy link
Member

@Hareet @mrjones-plip yes, we should try that if it's one of the production cases.

@github-project-automation github-project-automation bot moved this from This Week's commitments to Done in Product Team Activities Oct 28, 2024
@dianabarsan dianabarsan reopened this Oct 28, 2024
@dianabarsan dianabarsan moved this from Done to This Week's commitments in Product Team Activities Oct 28, 2024
@sugat009
Copy link
Member

Did an upgrade test from an instance in CHT version 4.13 with CouchDB version 3.3.3 to CouchDB 3.4.2 with 250K docs in the medic database. There were no document losses in the upgrade process. The way I checked for it was to store the hash of every document before the upgrade and re-check the new hash with the stored hash. I only checked for the medic database as it's the largest one and the outcome probably holds for other databases as well.

Next is clustered upgrade test

@lorerod lorerod modified the milestones: 4.14.0, 4.15.0 Oct 30, 2024
@lorerod
Copy link
Contributor

lorerod commented Oct 30, 2024

Moved to 4.15.0 so as not to hold up the release.

@sugat009
Copy link
Member

The upgrade test from a CHT instance with version 4.13 and clustered CouchDB version 3.3.3 to 3.4.2 was successful without any document loss. The test procedure is the same as above for a single-node CouchDB.

Next: Performance tests

@sugat009
Copy link
Member

Performance tests for purging and replication have been done.
The test scenario is as follows:

  1. Deploy an instance of EKS with CouchDB v3.3.3.
  2. Add data ~5M using test-data-generator with each CHW having ~15K data(mostly reports)
  3. Log in through a client device. A different browser or a phone.
  4. Purge ~10K of those data(reports)
  5. Log in through the same device from 3 and sync.
  6. Upgrade CouchDB to v3.4.2
  7. Delete the purge databases
  8. Do steps 3-6 again for v3.4.2

The timing metrics are as follows.

  1. 3.3.3
    1. Replication before purging
      1. Polling data: 6.83s
      2. Actual Replication: 50.76s
    2. Purging
      1. Time taken for purging: 60.46 minutes
    3. Sync after purging
      1. Time taken: 32s
  2. 3.4.2
    1. Replication before purging
      1. Polling data: 7.03s
      2. Actual replication: 55.94s
    2. Purging
      1. Time taken for purging: 1148.24 minutes = 19.18 hours
    3. Sync after purging
      1. Polling data: 5.007909633333 minutes
      2. Actual replication: 37s

The metrics obtained for v3.3.3 VS v3.4.2 have a major difference in the purging time. Should we perform another test to confirm the validity of these timing measurements?
In the meantime, I'm checking server logs for anything unusual.
CC: @jkuester @m5r @mrjones-plip

@dianabarsan
Copy link
Member Author

I'm seriously worried about two metrics here:

  • time taken to purge docs in 3.4.2
  • polling data in 3.4.2

I think we should at least re-run the tests and check if we get comparable times. And if yes, it's possible we might need to re-evaluate what happens for both these actions.

@sugat009
Copy link
Member

After checking the logs of Sentinel and Couch, I'm guessing the major bottleneck for this is the batch size of the purge documents. The batch size was seen to be decreased from 1000 to a minimum of 15. From there on the processing, is normal but slow. I've deleted the purge DBs and rerun the purge to check if this was not a one-time thing.

@dianabarsan
Copy link
Member Author

yea, that was my suspicion, that we need to rework purging to hit different endpoints that are more efficient now.

@dianabarsan
Copy link
Member Author

I've created an issue for this: #9642

@mrjones-plip
Copy link
Contributor

@dianabarsan - OK to make #9642 a sub-issue of this ticket? I think we don't want to release the couch v3.4.2 upgrade without using new endpoints and sub-issues are a nice new feature of GH that we can leverage to show the dependencies!

no biggie

@dianabarsan
Copy link
Member Author

Added it as a sub issue

@sugat009
Copy link
Member

The second run of purging in deployment with CouchDB version 3.4.2 has been completed. The metrics and logs are similar.

  1. Purging
    1. Time taken: 1179.79 minutes = 19.66 hours
  2. Sync after purging
    1. Polling data: 5.3 minutes
    2. Actual replication: 24s

@latin-panda
Copy link
Contributor

Moving it to 4.16 to not block 4.15

@latin-panda latin-panda modified the milestones: 4.15.0, 4.16.0 Nov 18, 2024
@sugat009
Copy link
Member

Update:
Changes from #9651 made the purge take only 47.108 minutes in CouchDB version 3.4.2 compared to ~19 hours from before.

@dianabarsan
Copy link
Member Author

Yes, good news is that we don't need to make any code changes to have quick purging, we just need to adjust the changes optimization counter, so minimal effort required here.

@dianabarsan
Copy link
Member Author

I think we should just merge this as-is, with the adjustment so we don't take a performance hit for purging.
This way we unblock the work for follow-up issues.
Any objections?

@dianabarsan dianabarsan self-assigned this Nov 25, 2024
@garethbowen
Copy link
Member

+1 from me. If I understand correctly the perf numbers are all approximately the same, right? If so, then it makes sense to get this first step merged sooner rather than later.

@dianabarsan
Copy link
Member Author

@sugat009 what do you think?

@sugat009
Copy link
Member

Yes, the numbers seem alright to do the upgrade. However, we planned on doing a quick upgrade test from older active versions like the one Hareet mentioned above to 3.4.2. We could continue with the upgrade after that.

@dianabarsan
Copy link
Member Author

We have an e2e test that covers an upgrade from 4.2.2 to current branch.

@sugat009
Copy link
Member

In that case, I think we can move ahead with the upgrade.
CC: @mrjones-plip @jkuester @m5r

@dianabarsan
Copy link
Member Author

I've just merged the commit that upgrades couch to v 3.4.2 and bumps the changes optimization threshold.
I'm going to close this issue and separate all the other sub-issues that required this work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Improvement Make something better
Projects
Status: Done
Status: Done
Development

Successfully merging a pull request may close this issue.

7 participants