-
Notifications
You must be signed in to change notification settings - Fork 419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add related.entity
field
#2360
base: main
Are you sure you want to change the base?
Add related.entity
field
#2360
Conversation
Documentation changes preview: https://ecs_bk_2360.docs-preview.app.elstc.co/diff |
d3f25b9
to
53dc6fb
Compare
Think you linked an internal ticket. Do you expect this to duplicate with e.g. related.user or related.ip or is it only for leftover entities not representable via the existing |
@@ -70,3 +70,15 @@ | |||
identifiers include FQDNs, domain names, workstation names, or aliases. | |||
normalize: | |||
- array | |||
|
|||
- name: entity |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
entity
is an extremely broad category. The danger with using this is it will mean different things to different people, and become a bucket that will hold almost anything.
This would reduce the effectiveness of having a common schema, as this field will be used by different users to hold different types of data, and cause problems with writing queries, doing data normalization, etc. Already in the description, there's resource IDs, email addresses, and hostnames, which are three different things.
I think you'll need to consider the use-cases for this, and refine the definition of what this is intended to hold. Maybe just cloud_resource_names
? Or have multiple fields for the different types of data that could be related.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey Michael, I see where you coming from.
However, our need is very broad, indeed. What we wish is to be able to find any event related to an entity. What is an entity? Can be very much anything. A workstation. A bare metal machine. A user. An ec2 instance. A database. Pretty much anything a SoC team is concerned about.
But then why not specify cloud_resource_name
or cloud_entity
? Ideally, from a user experience perspective, a user doesn't need to know all the ecs field types to search by something. Doesn't need to think twice or search before typing its search. I do see the point over data organisation on having very separated buckets, but from a search perspective, that decreases the experience. Beyond that, some concepts are just overlapping. We have related.host
, related.ips
which both hold information about a machine that can be seen as an entity. So where does the data about that specific host exist? We believe it would be easier to just have all the data in related.entity
and search from there.
With that said, you mentioned that having it all in one field would reduce the effectiveness of data. Can you expand on that? Why would it cause problems writing queries and doing data normalisation?
Tagging @tinnytintin10 so he can give his cents as product (if he wishes).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your thoughtful analysis of the proposal, @mjwolf!
You're right that "entity" is an extremely broad category, and that's intentional. Let me explain our reasoning and address your concerns:
- Regarding data consistency, as related.entity is of keyword type, consistency in data format isn't a concern for searchability. All values stored will be searchable as keywords, regardless of the identifier format.
- Regarding query performance, given that related.entity will contain identifiers (such as ARNs, emails, hostnames, etc.) and is mapped to the keyword type, we don't anticipate significant performance issues. Querying over keyword fields is generally efficient in Elasticsearch, especially for exact matches which is the primary use case here.
- Regarding data analysis, the introduction of this field should not complicate data analysis. In fact, it may simplify certain types of analysis by providing a unified field for correlation across different entity types. For more specific analyses, users can still rely on the more targeted related fields and other event details.
- This approach also lends itself to future extensibility. Suppose certain entity types require more specific handling in the future (i.e., implicit entity type fields like host and user ecs fields), in that case, we can introduce additional fields without breaking the functionality of related.entity.
Regarding alternatives (like the one I mentioned in the last bullet above), creating implicit entity fieldsets for each possible entity type would be a significant undertaking (especially in the cloud). If we were to follow the pattern of existing fields like "host" and "user", we'd quickly run into an explosion of entity types. Consider this non-exhaustive list of potential generic entity types we'd need to account for/introduce:
expand me
a few of these might have some ecs fields available...
- ACCESS_ROLE
- API_GATEWAY
- BACKUP_SERVICE
- BUCKET
- CICD_SERVICE
- CLOUD_LOG_CONFIGURATION
- CDN
- CONFIG_MAP
- CONTAINER_IMAGE
- CONTAINER_REGISTRY
- CONTAINER_REPOSITORY
- DATA_WORKFLOW
- DATA_WORKLOAD
- DATABASE
- DNS_RECORD
- DNS_ZONE
- DOMAIN
- EMAIL_SERVICE
- ENCRYPTION_KEY
- FILE_SYSTEM_SERVICE
- FIREWALL
- GATEWAY
- GOVERNANCE_POLICY
- LOAD_BALANCER
- MANAGED_CERTIFICATE
- MANAGEMENT_SERVICE
- MAP_REDUCE_CLUSTER
- MESSAGING_SERVICE
- MONITOR_ALERT
- NETWORK_ADDRESS
- NETWORK_INTERFACE
- PEERING
- PRIVATE_ENDPOINT
- PRIVATE_LINK
- RAW_ACCESS_POLICY
- REGION
- REGISTERED_DOMAIN
- RESOURCE_GROUP
- ROUTE_TABLE
- SEARCH_INDEX
- SECRET
- SECRET_CONTAINER
- SERVERLESS
- SERVERLESS_PACKAGE
- SERVICE_CONFIGURATION
- SERVICE_USAGE_TECHNOLOGY
- SNAPSHOT
- STORAGE_ACCOUNT
- SUBNET
- SUBSCRIPTION
- VIRTUAL_NETWORK
- VOLUME
- WEB_SERVICE
This list doesn't even include some of the entity types we already have ECS fields for, such as those related to hosts, users, and Kubernetes (which ECS calls orchestrator).
Creating and maintaining fields for each of these entity types would not only take considerable time to implement but would also result in a proliferation of field types. This approach would place a substantial cognitive burden on users, requiring them to remember a large number of specific fields for different entity types.
The related.entity field addresses this challenge by providing a single, unified field for correlation. Users don't need to know the implicit entity type for each resource to correlate events, greatly simplifying the process. For instance, they wouldn't need to know that a bucket is for blob storage or that an ARN identifies an AWS resource - they could simply use related.entity to find all events related to that entity. i.e., related.entity offers a user-friendly way to correlate events across diverse entity types without overwhelming users or complicating the schema unnecessarily.
As we move forward, we'll continue to evaluate and adapt based on the evolving needs of our users and the insights we gain from this implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tinnytintin10 for the excellent explanation, I think this makes sense for achieving what you want to achieve
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mjwolf what do we need to wrap this PR up? If you approve we can merge or it must be discussed in other forums?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I've been looking into how could we implement this in OpenTelemetry. @trisch-me has been supporting me on the process. The summary is:
It's under discussion concepts that have some correlation with what we want to do. OTel have the concept of Resources, which "represents the entity producing telemetry as resource attributes". The concept of entity is under discussion here .
However, this discussion doesn't fully cover what we want because they are focusing mainly on resolving the problem of what entity produced the telemetry observation. What we want to observe is different, we want to know what entity has relation with the emitted event (actor or target). An event can have multiple related entities. And the emitting entity might have nothing to do with the information we would like. Example: romulo
created the ec2 instance i-001
was emitted by the trail monitor-elastic
. On this case, only the entities romulo
and i-001
are relevant to the security use case. The entity trail monitor-elastic
is supporting information of how that event was reported. But no security interest there.
I have discussed our use case both in Semantics SIG and Entity SIG. Both groups agreed that our use case is legit. But Entity SIG (the correct group to have this discussion) has other priorities to discuss right now. They suggested to us, as elastic, to open an OTep and be prepare our case to be discussed in the Entity SIG.
Because I believe this goes beyond just CDR, and other teams might have interest, I want to take this discussion further and find how we, as elastic, want to take this topic on. My next point of discussion will be at the Elastic OpenTelemetry Office Hours, where I'll raise what we want to do and see if more people in that group would have an interest.
There isn't, however any timeline available of when this discussion will properly start or end. @trisch-me and I agreed to merge this PR asynchronously from any OTel discussion or outcome, because the pace doesn't seem too promising.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I believe such a broad topic will take months in Otel to at least formalise correctly, not saying about implementation.
Also I want to add that we are talking here only about entities inside related. But the whole concept of related namespaces will not be ported to the otel as is, there is no place for it in this format. So before we could have entities there we should think about parent namespace related
first.
Saying that I believe we should proceed with this topic in ECS first
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tinnytintin10 hey, thank a lot for this explanation, but I still have doubts. Let's say I'm a developer who is using ECS for my case (for example we at Elastic are using ECS in Endgame), how should I detect if something from my event/log entry is an entity or not? Should I just throw into the bucket everything I think
can be entity?
I understand it's a broader topic, but I would like to have more clarity and examples there. Also to give everyone reading the info an understanding of the field and data put into it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@trisch-me do you have examples on what do you mean by I think
can be entity? Not sure what are the use cases you thought of.
I'm understanding entity as:
An "entity" in our context refers to any discrete component within an IT environment that can be uniquely identified and monitored. This broad term encompasses both managed and unmanaged elements.
Machines, virtual or physical, are entities. Instances of tooling/services/components such as queueing topics/subscriptions, databases, networking components, object storages, authentication and authorization components (and others) are entities. So are Users and they representation in different systems, like Okta, AWS, Azure, Active Directory and others.
Did you think something I didn't mention here @trisch-me?
Regardless, I agree that we as elastic should have a formal definition written in our docs of what are entities, and what are not entities. That will help us move forward with less doubts and friction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I said I think
it was a reference to a developer, who might not understand or know what entities are, especially if it's another non-security domain. So my request is to have concrete proposal on what entity is, directly in the description or notes. Even sentences you wrote above are better explanation than we currently have in proposal. We might also have this definition somewhere else and have just a link to it
hey @Samrose-Ahmed! The related.entity field is designed to complement, not replace, these existing fields. While there may be some overlap, the primary purpose of related.entity is to provide a unified field for correlation across a wide range of entity types, including those not currently represented in other related fields. We recommend that data producers continue to populate specific related fields (like related.ip for IP addresses) in addition to related.entity. This approach ensures backward compatibility and allows for more specific queries when needed, while also enabling broad correlation queries using related.entity. The goal is to enhance search capabilities rather than create redundancy. In cases where an identifier could be placed in both a specific related field and related.entity, populating both will maximize search flexibility Wdyt? |
6ef772b
to
8cc868c
Compare
Hi! We just realized that we haven't looked into this PR in a while. We're We're labeling this PR as Thank you for your contribution! |
@romulets could you please address the comments in this PR? |
Background
Elastic Cloud Security Team has been focusing, this past year, on Cloud Detection and Response (CDR). One of the first steps towards the CDR vision is to enhance investigation workflows for the Cloud Security use-case in SIEM.
As part of enhancing investigation workflows it's necessary to be able to correlate events and entities. Meaning, if an alert is triggered on the ec2 instance
i-000000000
, it is of great value to easily be able to search all the events related to that entity, across multiple indices, with one query. Therefore we are working on extracting entities and enabling them to be correlated.Why
related.entity
With this background, we've researched a few options on what would be the best approach to enable such feature, and arrived at the ecs field
related
. Based on therelated
description:To add a broad
related.entity
field that can hold any needed identifier to pivot data on seems to be well fitted. This would enable customers to simply runrelated.entity: "i-000000000"
and get all the hits to that specific cloud resource.What is an
entity
?An "entity" in our context refers to any discrete component within an IT environment that can be uniquely identified and monitored. This broad term encompasses both managed and unmanaged elements.
The term "entity" is broader than the current set of available fields under
related
. Althoughip
,user
andhosts
can be identities, there is a lack of space to represent messaging queues, load balancers, storage systems, databases and others. Therefore the proposal to add a new field.