Support for adding an item to an array/collection field in a document #16917

kowndinyav · 2024-12-30T04:48:02Z

Feature

Proposal for adding support for updating array fields by appending a new item without requiring a document read, update the field in memory and write back to OpenSearch. Provide a similar capability to append values to an array field in a document without retrieving and re-indexing the entire document.

Example (Inspired by MongoDB):

Insert Document:

db.students.insertOne( { _id: 1, scores: [ 44, 78, 38, 80 ] } )

Append to Array Field:

db.students.updateOne(
   { _id: 1 },
   { $push: { scores: 89 } }
)

Background and Context

Our product uses OpenSearch primarily for the following purposes:

Filtering – Efficiently filter documents based on user queries.
Lexical Search – Support full-text search across multiple fields.
Semantic Search – Enable similarity-based search.

To ensure efficient filtering and search, we intentionally denormalize data (which is a recommended practice). Below are some examples to explain the design.

Example 1: Certification and Course Search

Below document contains course and skill names to support denormalized search across related entities.

Index: /learningObject
Certification Document
{
	"id": "100_certification",
	"loType": "certification",
	"certificationName_en": "Java Certification",
	"courseNames_en_US": ["Java Core", "Java Spring Essentials", "Java ORM Basics", "Java Maven Basics", "Java AOP Basics"],
	"courseNames_de_DE": ["Java Core(DE)", "Java Spring Essentials(DE)", "Java ORM Basics(DE)", "Java Maven Basics(DE)", "Java AOP Basics(DE)"],
	"skillNames": ["Java Basics", "Maven Basics", "Spring Basics"]
}

Example 2: User Enrollment Search

To list courses or certifications where a user is enrolled, we search in the course index and filter based on the user’s enrolled_courses field by doing an index lookup.

Index: /learningObject
Document: Course
{
	"id": "200_course",
	"loType": "course",
	"courseName_en_US": "Java Advanced",
	"courseName_de_DE": "Java Advanced (DE)",
	"skillNames": ["Spring Essentials", "ORM"]
}

User Document with Enrollment Info:
Index: /user
Document: user
{
	"id": "1000_user",
	"name": "",
	"enrolled_courses": ["200_course", "201_course", "202_course", "203_course"]
}

Problem Statement

In our product, frequent updates to user and learning object documents are required. Currently, updates involve the following steps:

Fetch the Document – Retrieve the document from OpenSearch.
Modify the Array Field – Update the array locally in memory.
Write Back the Document – Send the updated document back to OpenSearch.

Challenges with Current Approach:

Performance Issues – High-volume updates lead to a large number of reads and writes, even with optimizations like batch processing.
Functional Issues
- We introduced an application-level caching layer to minimize I/O operations.
- The caching layer merges updates and batches reads/writes, but due to application specific logical issues, it causes inconsistencies in the document. Sometimes, the array does not get updated with new values. Sometimes, new values overwrite the entire array
Complexity – The caching and patching logic adds complexity to the application, requiring additional testing and maintenance.

Proposed Solution

We propose introducing a feature to directly append values to array fields without fetching and reindexing the entire document.

Benefits

Improved Performance: Eliminates unnecessary reads and writes, especially during bulk updates.
Reduced Complexity: Removes the need for custom caching and patching mechanisms in applications.
Consistency: Avoids functional inconsistencies caused by application-level caching layers.

API Suggestion

Similar to MongoDB’s $push operator, OpenSearch can introduce an operation like:

Inserts

# single document update using $each
POST /user/_doc/1000_user
{
	{ "$push": { "enrolled_courses": { "$each": [ "200_course", "201_course" ] } } },
	{ "$push": { "enrolled_certitifications": { "$each": [ "300_cerftification", "301_cerftification" ] } } },
	{ "$push": { "enrolled_jobaids": "200_course" } }	
}

# single document update using $pushAll
POST /user/_doc/1000_user
{
	{ "$pushAll": { "enrolled_courses": [ "200_course", "201_course" ] } },
	{ "$pushAll": { "enrolled_certitifications": [ "300_cerftification", "301_cerftification" ] } },
	{ "$push": { "enrolled_jobaids": "200_course" } }
}

# bulk document update using $each
POST /_bulk
{ "update": { "_index": "user", "_id": "1000_user" } }
{ "doc" : { "$push": { "enrolled_courses": { "$each": [ "200_course", "201_course" ] } } } }
{ "update": { "_index": "user", "_id": "1000_user" } }
{ "$push": { "enrolled_certitifications": { "$each": [ "300_cerftification", "301_cerftification" ] } } },
{ "update": { "_index": "user", "_id": "1000_user" } }
{ "$push": { "enrolled_jobaids": "200_course" } }


# bulk document update using $pushAll
POST /_bulk
{ "update": { "_index": "user", "_id": "1000_user" } }
{ "doc" : { "$pushAll": { "enrolled_courses": [ "200_course", "201_course" ] } } }
{ "update": { "_index": "user", "_id": "1000_user" } }
{ "$pushAll": { "enrolled_certitifications": [ "300_cerftification", "301_cerftification" ] } },
{ "update": { "_index": "user", "_id": "1000_user" } }
{ "$push": { "enrolled_jobaids": "200_course" } }

Deletions

# single document update using $each
POST /user/_doc/1000_user
{
	{ "$pop": { "enrolled_courses": { "$each": [ "200_course", "201_course" ] } } },
	{ "$pop": { "enrolled_certitifications": { "$each": [ "300_cerftification", "301_cerftification" ] } } },
	{ "$pop": { "enrolled_jobaids": "200_course" } }	
}

# single document update using $popAll
POST /user/_doc/1000_user
{
	{ "$popAll": { "enrolled_courses": [ "200_course", "201_course" ] } },
	{ "$popAll": { "enrolled_certitifications": [ "300_cerftification", "301_cerftification" ] } },
	{ "$pop": { "enrolled_jobaids": "200_course" } }
}

# bulk document update using $each
POST /_bulk
{ "update": { "_index": "user", "_id": "1000_user" } }
{ "doc" : { "$pop": { "enrolled_courses": { "$each": [ "200_course", "201_course" ] } } } }
{ "update": { "_index": "user", "_id": "1000_user" } }
{ "$pop": { "enrolled_certitifications": { "$each": [ "300_cerftification", "301_cerftification" ] } } },
{ "update": { "_index": "user", "_id": "1000_user" } }
{ "$pop": { "enrolled_jobaids": "200_course" } }

# bulk document update using $popAll
POST /_bulk
{ "update": { "_index": "user", "_id": "1000_user" } }
{ "doc" : { "$popAll": { "enrolled_courses": [ "200_course", "201_course" ] } } }
{ "update": { "_index": "user", "_id": "1000_user" } }
{ "$popAll": { "enrolled_certitifications": [ "300_cerftification", "301_cerftification" ] } },
{ "update": { "_index": "user", "_id": "1000_user" } }
{ "$pop": { "enrolled_jobaids": "200_course" } }

This uses an inline script to append values to an array field (enrolled_courses).

Business Impact

Scalability: Reduces overhead on application servers, enabling better scalability for high-traffic environments.
Operational Efficiency: Simplifies document update logic, speeding up development cycles and reducing maintenance costs.
Reliability: Ensures consistency without requiring additional caching and synchronization logic.

Conclusion

We believe that adding native support for array field updates will significantly improve performance and simplify application logic, not just for our use case but also for other OpenSearch users dealing with denormalized data models.

We are happy to provide more details or collaborate with the OpenSearch team on refining this proposal further.

Thank you for considering this request.

Related component

Indexing

Describe alternatives you've considered

No response

Additional context

While inspecting the opensearch audit logs, we could see logs like below that sort of indicate that this support might be already existing internally

[
    {
        "op": "add",
        "path": "/name_en_US",
        "value": "testCourse1"
    },
    {
        "op": "add",
        "path": "/courseOverview_en_US",
        "value": ""
    },
    {
        "op": "add",
        "path": "/courseName_en_US",
        "value": "testCourse1"
    },
    {
        "op": "add",
        "path": "/courseDescription_en_US",
        "value": ""
    }
]

{
    "audit_compliance_operation": "UPDATE",
    "audit_cluster_name": "081720342125:cp-acap-unified-search",
    "audit_node_name": "942da20cdfdf1ead18cd6ea67884771f",
    "audit_category": "COMPLIANCE_DOC_WRITE",
    "audit_request_origin": "REST",
    "audit_compliance_doc_version": 2,
    "audit_request_body": "{\"courseEffectiveness\":0,\"filterDataUpdatedAt\":3553119522917154,\"creationDate\":\"2024-12-02T09:51:37.000Z\",\"learningObjectId\":5726442,\"loRating\":0,\"my_join_field\":{\"name\":\"course\"},\"accountId\":65128,\"lastModificationDate\":\"2024-12-02T09:51:37.000Z\",\"price\":0.0,\"cptype\":\"course\",\"createdForCertification\":false,\"ratingsCount\":0,\"subType\":\"SELF_ENROLL\",\"id\":\"3156300\",\"state\":\"UNDER_CONSTRUCTION\",\"isShared\":false,\"name_en_US\":\"testCourse1\",\"courseOverview_en_US\":\"\",\"courseName_en_US\":\"testCourse1\",\"courseDescription_en_US\":\"\"}",
    "audit_node_id": "4rRagdnDTH2KTIltUF511Q",
    "@timestamp": "2024-12-03T10:00:44.902+00:00",
    "audit_compliance_diff_is_noop": false,
    "audit_format_version": 4,
    "audit_request_remote_address": "34.192.26.255",
    "audit_trace_doc_id": "3156300_course",
    "audit_compliance_diff_content": "[{\"op\":\"add\",\"path\":\"/name_en_US\",\"value\":\"testCourse1\"},{\"op\":\"add\",\"path\":\"/courseOverview_en_US\",\"value\":\"\"},{\"op\":\"add\",\"path\":\"/courseName_en_US\",\"value\":\"testCourse1\"},{\"op\":\"add\",\"path\":\"/courseDescription_en_US\",\"value\":\"\"}]",
    "audit_request_effective_user": "arn:aws:iam::081720342125:role/cp-iam-qe-ec2role-3RORS774YP48",
    "audit_trace_shard_id": 0,
    "audit_trace_indices": [
        "learningobject"
    ],
    "audit_trace_resolved_indices": [
        "learningobject"
    ]
}

No response

The text was updated successfully, but these errors were encountered:

gaobinlong · 2024-12-31T08:15:34Z

Have you tried using painless script to append value to an array? It can be achieved by something like this:

POST test1/_doc/1?refresh
{
  "a":["1", "2"]
}

POST test1/_update/1
{
  "script": {
    "source": "ctx._source.a.add(params.value)",
    "params": {
      "value": "3"
    }
  }
}

# now check the field value
GET test1/_doc/1

{
  "_index": "test1",
  "_id": "1",
  "_version": 2,
  "_seq_no": 1,
  "_primary_term": 1,
  "found": true,
  "_source": {
    "a": [
      "1",
      "2",
      "3"
    ]
  }
}

kowndinyav · 2025-01-01T02:53:40Z

Thanks for the input. Well I can try but it is mentioned in the documentation that painless script translates to a bit of additional overhead as it involves script compilation. Is my understanding correct?

kowndinyav · 2025-01-01T02:54:30Z

Would it scale if we use script in "_bulk" with large number of updates?

gaobinlong · 2025-01-02T06:29:26Z

Thanks for the input. Well I can try but it is mentioned in the documentation that painless script translates to a bit of additional overhead as it involves script compilation. Is my understanding correct?

Yes, painless script introduces some overhead because of the complication and has performance issue, but there're multiple methods to reduce the compilation times and improve the performance, such as:

Use stored script instead of defining script in the update API directly, see https://opensearch.org/docs/latest/api-reference/script-apis/create-stored-script/
Use parameters in the script instead of hard-coding value to reduce the compilation times.
Shorten the script as much as possible.

Would it scale if we use script in "_bulk" with large number of updates?

There're some dynamic settings for the script, such as script.max_compilations_rate and script.cache.max_size, see https://opensearch.org/docs/latest/install-and-configure/configuring-opensearch/circuit-breaker/. You can have a load test on your workload.

kowndinyav · 2025-01-02T13:37:36Z

If I use below scripts, four scripts can be stored upfront and I can use them generically across all indices. If we go this route, I believe max_compilations_rate may not be of concern. Is my understanding correct?

PUT _scripts/add_one_item
{
  "script": {
      "lang": "painless",
      "source": "ctx._source[params.field].add(params.value)"
  }
}

PUT _scripts/delete_one_item
{
  "script": {
      "lang": "painless",
      "source": "ctx._source[params.field].add(params.value)"
  }
}

PUT _scripts/add_multiple_items
{
  "script": {
      "lang": "painless",
      "source": "ctx._source[params.field].addAll(params.value)"
  }
}

PUT _scripts/delete_multiple_items
{
  "script": {
      "lang": "painless",
      "source": "ctx._source[params.field].removeAll(params.value)"
  }
}


POST /_bulk
{ "update": {"index": "user", "_id": 1 } }
{ "script": { "id": "add_one_item", "params": { "field": "a", "value": 3 } }
{ "update": {"index": "user", "_id": 2 } }
{ "script": { "id": "delete_one_item", "params": { "field": "a", "value": 3 } }
{ "update": { "index": "user","_id": <> } }
{ "script": { "id": "add_multiple_items", "params": { "field": "a.b.c", "value": [3,4,5] } }
{ "update": { "index": "user", "_id": <> } }
{ "script": { "id": "delete_multiple_items", "params": { "field": "a.b.c", "value": [3,4] } }

gaobinlong · 2025-01-03T02:58:01Z

If we go this route, I believe max_compilations_rate may not be of concern. Is my understanding correct?

Yes, the stored script is cached and won't be compiled every time when using.

kowndinyav · 2025-01-03T04:16:10Z

Thank you, it looks promising for our use-case. Do you have any other recommendations to keep in mind?

kowndinyav added enhancement Enhancement or improvement to existing feature or request untriaged labels Dec 30, 2024

github-actions bot added the Indexing Indexing, Bulk Indexing and anything related to indexing label Dec 30, 2024

kowndinyav changed the title ~~Support for Adding an Item to an Array Field in a Document~~ Support for adding an item to an array/collection field in a document Dec 30, 2024

soosinha removed the untriaged label Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for adding an item to an array/collection field in a document #16917

Support for adding an item to an array/collection field in a document #16917

kowndinyav commented Dec 30, 2024 •

edited

Loading

gaobinlong commented Dec 31, 2024

kowndinyav commented Jan 1, 2025

kowndinyav commented Jan 1, 2025

gaobinlong commented Jan 2, 2025

kowndinyav commented Jan 2, 2025

gaobinlong commented Jan 3, 2025

kowndinyav commented Jan 3, 2025

Support for adding an item to an array/collection field in a document #16917

Support for adding an item to an array/collection field in a document #16917

Comments

kowndinyav commented Dec 30, 2024 • edited Loading

Feature

Example (Inspired by MongoDB):

Background and Context

Example 1: Certification and Course Search

Example 2: User Enrollment Search

Problem Statement

Proposed Solution

Benefits

API Suggestion

Inserts

Deletions

Business Impact

Conclusion

Related component

Describe alternatives you've considered

Additional context

gaobinlong commented Dec 31, 2024

kowndinyav commented Jan 1, 2025

kowndinyav commented Jan 1, 2025

gaobinlong commented Jan 2, 2025

kowndinyav commented Jan 2, 2025

gaobinlong commented Jan 3, 2025

kowndinyav commented Jan 3, 2025

kowndinyav commented Dec 30, 2024 •

edited

Loading