Selective field indexing during field type conflicts #5561

esatterwhite · 2024-11-22T20:37:00Z

Is your feature request related to a problem? Please describe.

We run a SaaS product collecting log data from thousands of customers and applications. Their logs are in various formats and styles.
We have an index template that defines some well known fields of specific types that allow us and our customers some additional functionality when using our product. Such as sorting, range queries, etc. We do our best to normalize data however, we cannot account for all cases. When we run into a situation where a mapped field receives a document with a field of a different type, or a value that cannot be coerced, the entire document is reject. In this case we are force to remove as much extraneous data as possible in an effort to index what we feel is absolutely critical. For logs, this is mainly the level and message. But this isn't ideal as we are throwing away customer data, and in many cases entire subset of log lines.

A prime example is timestamps. Great for sorting, and range queries, but can be seen in an imposible number of formats. we have been reactively trying to add date formats to accommodate an increasing number of customers using non standard date formats which typically results in us having to drop their data.

Currently one of our datetime fields has grown as such:

    , {
        name: 'timestamp'
      , type: 'datetime'
      , indexed: true
      , precision: 'milliseconds'
      , fast: true
      , input_formats: [
          'unix_timestamp'
        , 'iso8601'
        , 'rfc3339'
        , 'rfc2822'
        , '%Y/%m/%d %H:%M:%S'
        , '%Y-%m-%d %H:%M' // 2024-04-02 14:28
        , '%Y-%m-%dT%H:%M' // 2024-04-02T14:28
        , '%Y-%m-%dT%H:%M:%S' // 2024-04-02T14:28:01
        , '%Y-%m-%d %H:%M:%S' // 2024-04-08 19:40:48
        , '%Y-%m-%d %H:%M.%S' // 2024-04-08 19:40.48
        , '%Y-%m-%d %H:%M.%S,%f' // 2024-11-06 18:22:56,672
        , '%Y-%m-%d %H:%M,%f%z' // 2024-04-08 19:40,177+0000
        , '%Y-%m-%dT%H:%M:%S.%f' // 2024-04-02T18:28:23.655961
        , '%Y-%m-%d %H:%M:%S.%f' // 2024-04-02 18:28:23.655961
        , '%Y-%m-%d %H:%M:%S:%f' // 2024-04-02 14:29:30:2930
        , '%Y-%m-%d %H:%M:%S %z' // 2024-05-02 09:20:12 -0700
        , '%m-%d-%Y %H:%M:%S.%f' // 05-02-2024 16:22:25.065
        , '%m/%d/%Y %H:%M:%S' // 11/19/2024 02:15:30
        , '%m/%d/%Y %H:%M:%S.%f' // 11/19/2024 02:15:30
        , '%d/%b/%Y:%H:%M:%S %z' // 02/Apr/2024:14:27:14 +0000
        , '%d/%b/%Y:%H:%M:%S.%f' // 02/Apr/2024:14:31:16.215
        , '%d-%m-%Y %H:%M:%S' // 21-11-2024 11:56:28
        , '%d-%m-%Y %H:%M:%S.%f' // 21-11-2024 11:56:28.2395
        , '%b %e %H:%M:%S' // Apr 2 10:26:40
        ]
      , output_format: 'iso8601'
      }

This is still not enough to accommodate what our customers are sending us, and it is a never ending problem.
Removing it from the index mapping means we lose certain sets of functionality. Keeping it means we're most certainly losing data.

Describe the solution you'd like
Elasticsearch has a setting ignore_malformed on the root template, and on the field level that will ignore such conflicts field by field and index what is possible rather than rejecting the entire document.

Describe alternatives you've considered

removing field mappings for problemmatic fields
adding more date formats to capture as many date formats as we see come across
Stripping ingested documents to the bare essential fields when their is a field conflict -- even in this case the stripped document can have conflicts in which case we have to discard the entire document.

Additional context
https://www.elastic.co/guide/en/elasticsearch/reference/current/ignore-malformed.html#_dealing_with_malformed_fields

fulmicoton · 2024-11-25T01:52:50Z

@esatterwhite perfect feature request (explanation, details, etc.)

esatterwhite added the enhancement New feature or request label Nov 22, 2024

rdettai self-assigned this Dec 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Selective field indexing during field type conflicts #5561

Selective field indexing during field type conflicts #5561

esatterwhite commented Nov 22, 2024

fulmicoton commented Nov 25, 2024

Selective field indexing during field type conflicts #5561

Selective field indexing during field type conflicts #5561

Comments

esatterwhite commented Nov 22, 2024

fulmicoton commented Nov 25, 2024