[Feature Request] Use of Binary DocValue for high cardinality fields to improve aggregations performance #16837

rishabhmaurya · 2024-12-12T02:17:38Z

Is your feature request related to a problem? Please describe

DocValue type for keyword field is always set as SORTED_SET, this works well for cases with low/medium cardinality fields, however, for high cardinality fields, its an overhead as it unnecessarily iterate over ordinals and lookup ordinals using term dictionaries.
Lucene 9 also started always compressing the term dictionaries for sorted doc values (https://issues.apache.org/jira/browse/LUCENE-9843) and disregarding compression mode associated with codec. This makes ordinal lookup even slower when sorted doc values are used, making high cardinality agg queries even slower.

Describe the solution you'd like

Use of binary doc values for high cardinality fields can improve the performance significantly for cardinality aggregation and other aggregations too. The catch is, its an index time setting to set the doc value type and we can't set both as it will significantly increase the index size involving keyword fields.

We can do one of the following, feel free to add any other solution -

Introduce a new field type for such high cardinality fields and use doc value type as binary for them.
Introduce a configuration within keyword field (this is what i did for poc as a hack); I'm against this solution due to complexity it adds to keyword field type.

Shortcoming of having just binary doc value for a given field type compared to sorted set DV -

Larger index size depending on the amount of duplications present. Also, as lucene 9 always compresses term dict for sorted DV, which not the case for binary DV, so that will also add to higher index size when default best_speed compression mode is used.
aggregations or any other codepath involving ordinals like ordinalCollector for CardinalityAggregation can never be used. I believe for high cardinality fields, this will anyways be the case where ordinals overhead will always be very high and shouldn't be used.

Related component

Search:Performance

Describe alternatives you've considered

No response

Additional context

I tweaked the code to add both sorted set and binary doc values for keyword field type. Also, added a way to configure what to use for FieldData which is used for aggregations.
On running osb against Big5 workload for a high cardinality field, the improvement was significant - almost 10x from 28.8 sec to 3.2 sec:

Query:

{ 
  "size": 0, 
  "aggs": {
    "agent": {
      "cardinality": {
        "field": "event.id.keyword"
      }
    }
  }
}

Using sorted set doc value

{
  "took" : 28851,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "agent" : {
      "value" : 180250
    }
  }
}

Using binary doc value:

{
  "took" : 3266,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "agent" : {
      "value" : 180250
    }
  }
}

The text was updated successfully, but these errors were encountered:

kkewwei · 2024-12-12T06:35:45Z

@rishabhmaurya. Good idea, that is to say, we sacrifice compression rate to speed up reading. If we don't use term dictionaries, the storage will be a bit bloated, can you do a benchmark to compare?

rishabhmaurya · 2024-12-12T19:38:57Z

Good idea, that is to say, we sacrifice compression rate to speed up reading.

@kkewwei There might be some impact as even in high cardinality fields, there will be duplications, so it all depends on how much duplications are present. Also, it would be interesting to see the size differences when binary DV are encoded as well. I'm currently analyzing the sizes for the event.id fields used in big5 workload, will post the update soon.

rishabhmaurya · 2024-12-12T23:45:06Z

sorted set:

  "dvd" : {
    "size_in_bytes" : 331618470,
    "description" : "DocValues"
  },

binary:

  "dvd" : {
    "size_in_bytes" : 982887189,
    "description" : "DocValues"
  },

increase of ~500mb i.e. size of DV more than doubled on replacing sorted set with binary. but this is compressed term dictionary in sorted set vs uncompressed binary.

index size -

curl localhost:9200/_cat/indices
yellow open big5cardssdv O4SqYFCrRNSSUlJMS-pkiA 1 1 69223950 0  2.5gb  2.5gb
yellow open big5cardbdv  SlzNDVjwQ_KirkVG0DywWQ 1 1 69223950 0    3gb    3gb

increase of ~500 mb.

Next step - I will check how size and speed is impacted using best_compression as codec with binary DV.

rishabhmaurya · 2024-12-13T19:05:40Z

Surprisingly, on using best_compression, the size of dvd file remained same -

    "dvd" : {
      "size_in_bytes" : 983490211,
      "description" : "DocValues"
    },

so we are definitely trading off storage size with use of binary doc values. How much? that totally depends on duplications in high cardinality field.
It might still be worth it for some of the users to use this format for the speed gains it is providing. Need inputs from other folks as well on what they think about introducing a new fields type in keyword field family for such high cardinality cases.
cc @msfroh @andrross @reta @jainankitk

bugmakerrrrrr · 2024-12-15T08:46:32Z

Surprisingly, on using best_compression, the size of dvd file remained same -

AFAIK, the compression is not applied to binary doc values. Lucene 8 added compression for Binary doc value fields, but it's removed in Lucene 9. Maybe we could consider adding it back as a custom codec. In addition to, we could also consider using the ZSTD to compress the binary/sorted set/sorted doc values.

manojfaria · 2025-01-24T23:20:25Z

Regarding the introduction of a new field type in the keyword field family, may I propose adding an expert-level mapping parameter called "doc_values_type" for the existing keyword field type. I understand that @rishabhmaurya finds this config option as being complex, however I assume that the operational experience of changing doc_values_type is likely to be similar to changing the existing doc_values parameter. Is this feasible and acceptable? One needs to add a check to allow the doc_values_type config, only when doc_values is set to true (and should be limited to keyword field type as far as i understand).

If above mentioned solution approach is acceptable, "doc_values_type" parameter for "keyword" field type may introduce the default parameter value as "AUTO", which for the time being acts the same as SORTED_SET. We allow doc_values_type to be switched to BINARY for high-cardinality use cases where lower latency is prioritized over storage space. The storage trade-off for BINARY doc values could be further optimized by using ZSTD compression as mentioned by @bugmakerrrrrr .

In future, as we find a solution option to automatically derive and set doc_values_type based on field's high cardinality data, the behavior of docvalue_type=AUTO can evolve and choose between SORTED_SET and BINARY doc values. For use cases wherein a manual override is required doc_values_type may be set to "SORTED_SET" or "BINARY" based on the use case needs.

asimmahmood1 · 2025-02-03T23:46:59Z

I took a look through the Cardinality code and landed the same path as #15269 - does OrdinalsCollector perform better than Direct?

I also found a stale PR that allow user to specify: opensearch-project/documentation-website#8265

Using big5 with single shard, xmx 8g, I see

no hint - 77ms
execution_hint: ordinals - 2ms
execution_hint: direct - 77ms

Short term: I would propose that we pull in PR#8265 to allow any power user who's having issues to work around using this execution_hint.

Medium term: Follow up on #15269 where we tune the 10-year-old heuristic to pick Ordinal vs Direct, so it benefits most users.

Pros: both these opens not need any index time change, should work for both old and new index. Just the count of unique ordinals do not need term look up.
Cons: doesn't help with other aggregations that do need term look up and are slowed down the compression

No Hint

curl -X GET "http://localhost:9200/big5/_search?pretty=true&timeout=10m" -H 'Content-Type: application/json' -d'
{
  "size": 0,
  "aggs": {
    "agent": {
      "cardinality": {
        "field": "event.id"
      }
    }
  }
}
'
{
  "took" : 80984,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "agent" : {
      "value" : 180250
    }
  }
}

With ordinal hint

curl -X GET "http://localhost:9200/big5/_search?pretty=true&timeout=10m" -H 'Content-Type: application/json' -d'
{
  "size": 0,
  "aggs": {
    "agent": {
      "cardinality": {
        "execution_hint": "ordinals",
        "field": "event.id"
      }
    }
  }
}
'
{
  "took" : 2481,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "agent" : {
      "value" : 180250
    }
  }
}

rishabhmaurya added enhancement Enhancement or improvement to existing feature or request untriaged labels Dec 12, 2024

rishabhmaurya self-assigned this Dec 12, 2024

rishabhmaurya added this to Performance Roadmap Dec 12, 2024

github-project-automation bot moved this to Todo in Performance Roadmap Dec 12, 2024

github-actions bot added the Search:Performance label Dec 12, 2024

sandeshkr419 removed the untriaged label Dec 18, 2024

peterzhuamazon added this to Search Project Board Dec 19, 2024

github-project-automation bot moved this to 🆕 New in Search Project Board Dec 19, 2024

getsaurabh02 assigned asimmahmood1 Feb 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Use of Binary DocValue for high cardinality fields to improve aggregations performance #16837

[Feature Request] Use of Binary DocValue for high cardinality fields to improve aggregations performance #16837

rishabhmaurya commented Dec 12, 2024 •

edited

Loading

kkewwei commented Dec 12, 2024 •

edited

Loading

rishabhmaurya commented Dec 12, 2024 •

edited

Loading

rishabhmaurya commented Dec 12, 2024 •

edited

Loading

rishabhmaurya commented Dec 13, 2024 •

edited

Loading

bugmakerrrrrr commented Dec 15, 2024

manojfaria commented Jan 24, 2025 •

edited

Loading

asimmahmood1 commented Feb 3, 2025 •

edited

Loading

[Feature Request] Use of Binary DocValue for high cardinality fields to improve aggregations performance #16837

[Feature Request] Use of Binary DocValue for high cardinality fields to improve aggregations performance #16837

Comments

rishabhmaurya commented Dec 12, 2024 • edited Loading

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Related component

Describe alternatives you've considered

Additional context

kkewwei commented Dec 12, 2024 • edited Loading

rishabhmaurya commented Dec 12, 2024 • edited Loading

rishabhmaurya commented Dec 12, 2024 • edited Loading

rishabhmaurya commented Dec 13, 2024 • edited Loading

bugmakerrrrrr commented Dec 15, 2024

manojfaria commented Jan 24, 2025 • edited Loading

asimmahmood1 commented Feb 3, 2025 • edited Loading

rishabhmaurya commented Dec 12, 2024 •

edited

Loading

kkewwei commented Dec 12, 2024 •

edited

Loading

rishabhmaurya commented Dec 12, 2024 •

edited

Loading

rishabhmaurya commented Dec 12, 2024 •

edited

Loading

rishabhmaurya commented Dec 13, 2024 •

edited

Loading

manojfaria commented Jan 24, 2025 •

edited

Loading

asimmahmood1 commented Feb 3, 2025 •

edited

Loading