Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for adding an item to an array/collection field in a document #16917

Open
kowndinyav opened this issue Dec 30, 2024 · 7 comments
Open
Labels
enhancement Enhancement or improvement to existing feature or request Indexing Indexing, Bulk Indexing and anything related to indexing

Comments

@kowndinyav
Copy link

kowndinyav commented Dec 30, 2024

Feature

Proposal for adding support for updating array fields by appending a new item without requiring a document read, update the field in memory and write back to OpenSearch. Provide a similar capability to append values to an array field in a document without retrieving and re-indexing the entire document.

Example (Inspired by MongoDB):

Insert Document:

db.students.insertOne( { _id: 1, scores: [ 44, 78, 38, 80 ] } )

Append to Array Field:

db.students.updateOne(
   { _id: 1 },
   { $push: { scores: 89 } }
)

Background and Context

Our product uses OpenSearch primarily for the following purposes:

  • Filtering – Efficiently filter documents based on user queries.
  • Lexical Search – Support full-text search across multiple fields.
  • Semantic Search – Enable similarity-based search.

To ensure efficient filtering and search, we intentionally denormalize data (which is a recommended practice). Below are some examples to explain the design.

Example 1: Certification and Course Search

Below document contains course and skill names to support denormalized search across related entities.

Index: /learningObject
Certification Document
{
	"id": "100_certification",
	"loType": "certification",
	"certificationName_en": "Java Certification",
	"courseNames_en_US": ["Java Core", "Java Spring Essentials", "Java ORM Basics", "Java Maven Basics", "Java AOP Basics"],
	"courseNames_de_DE": ["Java Core(DE)", "Java Spring Essentials(DE)", "Java ORM Basics(DE)", "Java Maven Basics(DE)", "Java AOP Basics(DE)"],
	"skillNames": ["Java Basics", "Maven Basics", "Spring Basics"]
}

Example 2: User Enrollment Search

To list courses or certifications where a user is enrolled, we search in the course index and filter based on the user’s enrolled_courses field by doing an index lookup.

Index: /learningObject
Document: Course
{
	"id": "200_course",
	"loType": "course",
	"courseName_en_US": "Java Advanced",
	"courseName_de_DE": "Java Advanced (DE)",
	"skillNames": ["Spring Essentials", "ORM"]
}

User Document with Enrollment Info:
Index: /user
Document: user
{
	"id": "1000_user",
	"name": "",
	"enrolled_courses": ["200_course", "201_course", "202_course", "203_course"]
}

Problem Statement

In our product, frequent updates to user and learning object documents are required. Currently, updates involve the following steps:

  • Fetch the Document – Retrieve the document from OpenSearch.
  • Modify the Array Field – Update the array locally in memory.
  • Write Back the Document – Send the updated document back to OpenSearch.

Challenges with Current Approach:

  • Performance Issues – High-volume updates lead to a large number of reads and writes, even with optimizations like batch processing.
  • Functional Issues
    • We introduced an application-level caching layer to minimize I/O operations.
    • The caching layer merges updates and batches reads/writes, but due to application specific logical issues, it causes inconsistencies in the document. Sometimes, the array does not get updated with new values. Sometimes, new values overwrite the entire array
  • Complexity – The caching and patching logic adds complexity to the application, requiring additional testing and maintenance.

Proposed Solution

We propose introducing a feature to directly append values to array fields without fetching and reindexing the entire document.

Benefits

  • Improved Performance: Eliminates unnecessary reads and writes, especially during bulk updates.
  • Reduced Complexity: Removes the need for custom caching and patching mechanisms in applications.
  • Consistency: Avoids functional inconsistencies caused by application-level caching layers.

API Suggestion

Similar to MongoDB’s $push operator, OpenSearch can introduce an operation like:

Inserts
# single document update using $each
POST /user/_doc/1000_user
{
	{ "$push": { "enrolled_courses": { "$each": [ "200_course", "201_course" ] } } },
	{ "$push": { "enrolled_certitifications": { "$each": [ "300_cerftification", "301_cerftification" ] } } },
	{ "$push": { "enrolled_jobaids": "200_course" } }	
}

# single document update using $pushAll
POST /user/_doc/1000_user
{
	{ "$pushAll": { "enrolled_courses": [ "200_course", "201_course" ] } },
	{ "$pushAll": { "enrolled_certitifications": [ "300_cerftification", "301_cerftification" ] } },
	{ "$push": { "enrolled_jobaids": "200_course" } }
}

# bulk document update using $each
POST /_bulk
{ "update": { "_index": "user", "_id": "1000_user" } }
{ "doc" : { "$push": { "enrolled_courses": { "$each": [ "200_course", "201_course" ] } } } }
{ "update": { "_index": "user", "_id": "1000_user" } }
{ "$push": { "enrolled_certitifications": { "$each": [ "300_cerftification", "301_cerftification" ] } } },
{ "update": { "_index": "user", "_id": "1000_user" } }
{ "$push": { "enrolled_jobaids": "200_course" } }


# bulk document update using $pushAll
POST /_bulk
{ "update": { "_index": "user", "_id": "1000_user" } }
{ "doc" : { "$pushAll": { "enrolled_courses": [ "200_course", "201_course" ] } } }
{ "update": { "_index": "user", "_id": "1000_user" } }
{ "$pushAll": { "enrolled_certitifications": [ "300_cerftification", "301_cerftification" ] } },
{ "update": { "_index": "user", "_id": "1000_user" } }
{ "$push": { "enrolled_jobaids": "200_course" } }
Deletions
# single document update using $each
POST /user/_doc/1000_user
{
	{ "$pop": { "enrolled_courses": { "$each": [ "200_course", "201_course" ] } } },
	{ "$pop": { "enrolled_certitifications": { "$each": [ "300_cerftification", "301_cerftification" ] } } },
	{ "$pop": { "enrolled_jobaids": "200_course" } }	
}

# single document update using $popAll
POST /user/_doc/1000_user
{
	{ "$popAll": { "enrolled_courses": [ "200_course", "201_course" ] } },
	{ "$popAll": { "enrolled_certitifications": [ "300_cerftification", "301_cerftification" ] } },
	{ "$pop": { "enrolled_jobaids": "200_course" } }
}

# bulk document update using $each
POST /_bulk
{ "update": { "_index": "user", "_id": "1000_user" } }
{ "doc" : { "$pop": { "enrolled_courses": { "$each": [ "200_course", "201_course" ] } } } }
{ "update": { "_index": "user", "_id": "1000_user" } }
{ "$pop": { "enrolled_certitifications": { "$each": [ "300_cerftification", "301_cerftification" ] } } },
{ "update": { "_index": "user", "_id": "1000_user" } }
{ "$pop": { "enrolled_jobaids": "200_course" } }

# bulk document update using $popAll
POST /_bulk
{ "update": { "_index": "user", "_id": "1000_user" } }
{ "doc" : { "$popAll": { "enrolled_courses": [ "200_course", "201_course" ] } } }
{ "update": { "_index": "user", "_id": "1000_user" } }
{ "$popAll": { "enrolled_certitifications": [ "300_cerftification", "301_cerftification" ] } },
{ "update": { "_index": "user", "_id": "1000_user" } }
{ "$pop": { "enrolled_jobaids": "200_course" } }

This uses an inline script to append values to an array field (enrolled_courses).

Business Impact

  • Scalability: Reduces overhead on application servers, enabling better scalability for high-traffic environments.
  • Operational Efficiency: Simplifies document update logic, speeding up development cycles and reducing maintenance costs.
  • Reliability: Ensures consistency without requiring additional caching and synchronization logic.

Conclusion

We believe that adding native support for array field updates will significantly improve performance and simplify application logic, not just for our use case but also for other OpenSearch users dealing with denormalized data models.

We are happy to provide more details or collaborate with the OpenSearch team on refining this proposal further.

Thank you for considering this request.

Related component

Indexing

Describe alternatives you've considered

No response

Additional context

While inspecting the opensearch audit logs, we could see logs like below that sort of indicate that this support might be already existing internally

[
    {
        "op": "add",
        "path": "/name_en_US",
        "value": "testCourse1"
    },
    {
        "op": "add",
        "path": "/courseOverview_en_US",
        "value": ""
    },
    {
        "op": "add",
        "path": "/courseName_en_US",
        "value": "testCourse1"
    },
    {
        "op": "add",
        "path": "/courseDescription_en_US",
        "value": ""
    }
]
{
    "audit_compliance_operation": "UPDATE",
    "audit_cluster_name": "081720342125:cp-acap-unified-search",
    "audit_node_name": "942da20cdfdf1ead18cd6ea67884771f",
    "audit_category": "COMPLIANCE_DOC_WRITE",
    "audit_request_origin": "REST",
    "audit_compliance_doc_version": 2,
    "audit_request_body": "{\"courseEffectiveness\":0,\"filterDataUpdatedAt\":3553119522917154,\"creationDate\":\"2024-12-02T09:51:37.000Z\",\"learningObjectId\":5726442,\"loRating\":0,\"my_join_field\":{\"name\":\"course\"},\"accountId\":65128,\"lastModificationDate\":\"2024-12-02T09:51:37.000Z\",\"price\":0.0,\"cptype\":\"course\",\"createdForCertification\":false,\"ratingsCount\":0,\"subType\":\"SELF_ENROLL\",\"id\":\"3156300\",\"state\":\"UNDER_CONSTRUCTION\",\"isShared\":false,\"name_en_US\":\"testCourse1\",\"courseOverview_en_US\":\"\",\"courseName_en_US\":\"testCourse1\",\"courseDescription_en_US\":\"\"}",
    "audit_node_id": "4rRagdnDTH2KTIltUF511Q",
    "@timestamp": "2024-12-03T10:00:44.902+00:00",
    "audit_compliance_diff_is_noop": false,
    "audit_format_version": 4,
    "audit_request_remote_address": "34.192.26.255",
    "audit_trace_doc_id": "3156300_course",
    "audit_compliance_diff_content": "[{\"op\":\"add\",\"path\":\"/name_en_US\",\"value\":\"testCourse1\"},{\"op\":\"add\",\"path\":\"/courseOverview_en_US\",\"value\":\"\"},{\"op\":\"add\",\"path\":\"/courseName_en_US\",\"value\":\"testCourse1\"},{\"op\":\"add\",\"path\":\"/courseDescription_en_US\",\"value\":\"\"}]",
    "audit_request_effective_user": "arn:aws:iam::081720342125:role/cp-iam-qe-ec2role-3RORS774YP48",
    "audit_trace_shard_id": 0,
    "audit_trace_indices": [
        "learningobject"
    ],
    "audit_trace_resolved_indices": [
        "learningobject"
    ]
}

No response

@kowndinyav kowndinyav added enhancement Enhancement or improvement to existing feature or request untriaged labels Dec 30, 2024
@github-actions github-actions bot added the Indexing Indexing, Bulk Indexing and anything related to indexing label Dec 30, 2024
@kowndinyav kowndinyav changed the title Support for Adding an Item to an Array Field in a Document Support for adding an item to an array/collection field in a document Dec 30, 2024
@gaobinlong
Copy link
Collaborator

Have you tried using painless script to append value to an array? It can be achieved by something like this:

POST test1/_doc/1?refresh
{
  "a":["1", "2"]
}

POST test1/_update/1
{
  "script": {
    "source": "ctx._source.a.add(params.value)",
    "params": {
      "value": "3"
    }
  }
}

# now check the field value
GET test1/_doc/1

{
  "_index": "test1",
  "_id": "1",
  "_version": 2,
  "_seq_no": 1,
  "_primary_term": 1,
  "found": true,
  "_source": {
    "a": [
      "1",
      "2",
      "3"
    ]
  }
}

@kowndinyav
Copy link
Author

Thanks for the input. Well I can try but it is mentioned in the documentation that painless script translates to a bit of additional overhead as it involves script compilation. Is my understanding correct?

@kowndinyav
Copy link
Author

Would it scale if we use script in "_bulk" with large number of updates?

@gaobinlong
Copy link
Collaborator

Thanks for the input. Well I can try but it is mentioned in the documentation that painless script translates to a bit of additional overhead as it involves script compilation. Is my understanding correct?

Yes, painless script introduces some overhead because of the complication and has performance issue, but there're multiple methods to reduce the compilation times and improve the performance, such as:

  1. Use stored script instead of defining script in the update API directly, see https://opensearch.org/docs/latest/api-reference/script-apis/create-stored-script/
  2. Use parameters in the script instead of hard-coding value to reduce the compilation times.
  3. Shorten the script as much as possible.

Would it scale if we use script in "_bulk" with large number of updates?

There're some dynamic settings for the script, such as script.max_compilations_rate and script.cache.max_size, see https://opensearch.org/docs/latest/install-and-configure/configuring-opensearch/circuit-breaker/. You can have a load test on your workload.

@kowndinyav
Copy link
Author

If I use below scripts, four scripts can be stored upfront and I can use them generically across all indices. If we go this route, I believe max_compilations_rate may not be of concern. Is my understanding correct?

PUT _scripts/add_one_item
{
  "script": {
      "lang": "painless",
      "source": "ctx._source[params.field].add(params.value)"
  }
}

PUT _scripts/delete_one_item
{
  "script": {
      "lang": "painless",
      "source": "ctx._source[params.field].add(params.value)"
  }
}

PUT _scripts/add_multiple_items
{
  "script": {
      "lang": "painless",
      "source": "ctx._source[params.field].addAll(params.value)"
  }
}

PUT _scripts/delete_multiple_items
{
  "script": {
      "lang": "painless",
      "source": "ctx._source[params.field].removeAll(params.value)"
  }
}


POST /_bulk
{ "update": {"index": "user", "_id": 1 } }
{ "script": { "id": "add_one_item", "params": { "field": "a", "value": 3 } }
{ "update": {"index": "user", "_id": 2 } }
{ "script": { "id": "delete_one_item", "params": { "field": "a", "value": 3 } }
{ "update": { "index": "user","_id": <> } }
{ "script": { "id": "add_multiple_items", "params": { "field": "a.b.c", "value": [3,4,5] } }
{ "update": { "index": "user", "_id": <> } }
{ "script": { "id": "delete_multiple_items", "params": { "field": "a.b.c", "value": [3,4] } }

@gaobinlong
Copy link
Collaborator

If we go this route, I believe max_compilations_rate may not be of concern. Is my understanding correct?

Yes, the stored script is cached and won't be compiled every time when using.

@kowndinyav
Copy link
Author

Thank you, it looks promising for our use-case. Do you have any other recommendations to keep in mind?

@soosinha soosinha removed the untriaged label Jan 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Indexing Indexing, Bulk Indexing and anything related to indexing
Projects
None yet
Development

No branches or pull requests

3 participants