Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Selective field indexing during field type conflicts #5561

Open
esatterwhite opened this issue Nov 22, 2024 · 1 comment
Open

Selective field indexing during field type conflicts #5561

esatterwhite opened this issue Nov 22, 2024 · 1 comment
Assignees
Labels
enhancement New feature or request

Comments

@esatterwhite
Copy link
Collaborator

Is your feature request related to a problem? Please describe.

We run a SaaS product collecting log data from thousands of customers and applications. Their logs are in various formats and styles.
We have an index template that defines some well known fields of specific types that allow us and our customers some additional functionality when using our product. Such as sorting, range queries, etc. We do our best to normalize data however, we cannot account for all cases. When we run into a situation where a mapped field receives a document with a field of a different type, or a value that cannot be coerced, the entire document is reject. In this case we are force to remove as much extraneous data as possible in an effort to index what we feel is absolutely critical. For logs, this is mainly the level and message. But this isn't ideal as we are throwing away customer data, and in many cases entire subset of log lines.

A prime example is timestamps. Great for sorting, and range queries, but can be seen in an imposible number of formats. we have been reactively trying to add date formats to accommodate an increasing number of customers using non standard date formats which typically results in us having to drop their data.

Currently one of our datetime fields has grown as such:

    , {
        name: 'timestamp'
      , type: 'datetime'
      , indexed: true
      , precision: 'milliseconds'
      , fast: true
      , input_formats: [
          'unix_timestamp'
        , 'iso8601'
        , 'rfc3339'
        , 'rfc2822'
        , '%Y/%m/%d %H:%M:%S'
        , '%Y-%m-%d %H:%M' // 2024-04-02 14:28
        , '%Y-%m-%dT%H:%M' // 2024-04-02T14:28
        , '%Y-%m-%dT%H:%M:%S' // 2024-04-02T14:28:01
        , '%Y-%m-%d %H:%M:%S' // 2024-04-08 19:40:48
        , '%Y-%m-%d %H:%M.%S' // 2024-04-08 19:40.48
        , '%Y-%m-%d %H:%M.%S,%f' // 2024-11-06 18:22:56,672
        , '%Y-%m-%d %H:%M,%f%z' // 2024-04-08 19:40,177+0000
        , '%Y-%m-%dT%H:%M:%S.%f' // 2024-04-02T18:28:23.655961
        , '%Y-%m-%d %H:%M:%S.%f' // 2024-04-02 18:28:23.655961
        , '%Y-%m-%d %H:%M:%S:%f' // 2024-04-02 14:29:30:2930
        , '%Y-%m-%d %H:%M:%S %z' // 2024-05-02 09:20:12 -0700
        , '%m-%d-%Y %H:%M:%S.%f' // 05-02-2024 16:22:25.065
        , '%m/%d/%Y %H:%M:%S' // 11/19/2024 02:15:30
        , '%m/%d/%Y %H:%M:%S.%f' // 11/19/2024 02:15:30
        , '%d/%b/%Y:%H:%M:%S %z' // 02/Apr/2024:14:27:14 +0000
        , '%d/%b/%Y:%H:%M:%S.%f' // 02/Apr/2024:14:31:16.215
        , '%d-%m-%Y %H:%M:%S' // 21-11-2024 11:56:28
        , '%d-%m-%Y %H:%M:%S.%f' // 21-11-2024 11:56:28.2395
        , '%b %e %H:%M:%S' // Apr 2 10:26:40
        ]
      , output_format: 'iso8601'
      }

This is still not enough to accommodate what our customers are sending us, and it is a never ending problem.
Removing it from the index mapping means we lose certain sets of functionality. Keeping it means we're most certainly losing data.

Describe the solution you'd like
Elasticsearch has a setting ignore_malformed on the root template, and on the field level that will ignore such conflicts field by field and index what is possible rather than rejecting the entire document.

Describe alternatives you've considered

  • removing field mappings for problemmatic fields
  • adding more date formats to capture as many date formats as we see come across
  • Stripping ingested documents to the bare essential fields when their is a field conflict -- even in this case the stripped document can have conflicts in which case we have to discard the entire document.

Additional context
https://www.elastic.co/guide/en/elasticsearch/reference/current/ignore-malformed.html#_dealing_with_malformed_fields

@esatterwhite esatterwhite added the enhancement New feature or request label Nov 22, 2024
@fulmicoton
Copy link
Contributor

@esatterwhite perfect feature request (explanation, details, etc.)

@rdettai rdettai self-assigned this Dec 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants