Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new fields file.origin_referrer_url & file.origin_url #2348

Draft
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

AsuNa-jp
Copy link
Contributor

@AsuNa-jp AsuNa-jp commented Jun 27, 2024

Checklist

  • Have you signed the contributor license agreement?
    • YES
  • Have you followed the contributor guidelines?
    • YES
  • For proposing substantial changes or additions to the schema, have you reviewed the RFC process?
    • Not substantial changes
  • If submitting code/script changes, have you verified all tests pass locally using make test?
    • Not code/script changes changes
  • If submitting schema/fields updates, have you generated new artifacts by running make and committed those changes?
    • YES
  • Is your pull request against main? Unless there is a good reason otherwise, we prefer pull requests against main and will backport as needed.
    • YES
  • Have you added an entry to the CHANGELOG.next.md?
    • YES

PR summary

This PR adds file.origin_referrer_url and file.origin_url

To reviewers

Size of the fields
The two fields added in this PR are intended to store URL. Therefore, the default field size of 1024 bytes is insufficient. As a result, we want to set ignore_above to 8k(8192) bytes, if possible.

Default Field
This field will not necessarily be included in all file events.
Therefore, I set it as default_field: false, but please let me know if this is incorrect.

@AsuNa-jp AsuNa-jp requested a review from a team as a code owner June 27, 2024 08:19
@AsuNa-jp AsuNa-jp self-assigned this Jun 27, 2024
@AsuNa-jp AsuNa-jp added New Field Request review endpoint Relevant to elastic endpoint security Team: ECS labels Aug 6, 2024
@AsuNa-jp AsuNa-jp requested review from jdu2600 and mjwolf August 13, 2024 14:09
Copy link
Contributor

@mjwolf mjwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks generally pretty good, just some small things to fix up

CHANGELOG.next.md Outdated Show resolved Hide resolved
- name: origin_referrer_url
level: extended
type: keyword
ignore_above: 8192
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a fairly large ignore_above value, do you expect that there will be data this size? Have you looked into the performance impact?

Copy link
Contributor Author

@AsuNa-jp AsuNa-jp Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you expect that there will be data this size?

Yes. When defining the URL size, I did some research and came across a discussion from a past ECS thread about URL size, which I used as a reference. There was a discussion that the URL field should be at least 4096 bytes, and ideally 8192 bytes based on the actual URL log data, so I followed that suggestion. In addition, after discussing it within the team, we concluded that since the typical maximum URL size accepted by servers is 8KB, 8KB seemed like the most reasonable choice.

Have you looked into the performance impact?

However, I haven't look at the performance impact of setting ignore_above as 8192. If performance testing is required, I would be happy to carry it out. If there is any documentation on how to do it, could you please share it with me?

schemas/file.yml Outdated
level: extended
type: keyword
ignore_above: 8192
description: The url of the webpage that linked to the file.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Capitalize URL here and in the below description

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion! I have changed url to URL!

schemas/file.yml Outdated
level: extended
type: keyword
ignore_above: 8192
description: The url where the file is hosted.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered redirects? For example, when using CDNs, the initial URL can be redirected to various download URLs. Is that meaningful for your use case? Should the description specify if this is the initial request or a redirected URL?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have implemented this field (origin_url) to retrieve the actual URL from which the file was downloaded using the Mark of the Web (MoTW) information in Windows. (MoTW is metadata added to files downloaded from the internet in Windows. )

For example, if you download ejabberd_20.12-0_amd64.deb from https://www.process-one.net/downloads/downloads-action.php?file=/20.12/ejabberd_20.12-0_amd64.deb, the download is redirected, and the file is actually retrieved from https://static.process-one.net/ejabberd/downloads/20.12/ejabberd_20.12-0_amd64.deb.

In MoTW, as shown in the attached image, the redirect URL is saved as the file's host URL.

image

@AsuNa-jp
Copy link
Contributor Author

AsuNa-jp commented Sep 4, 2024

@mjwolf
Thank you for the detailed comments/feedbacks🙇🏻‍♀️ & I’m sorry for the delay in responding, as I was out of the office from mid to late August. I’ve updated the PR based on your feedback, and I’ve also answered your questions above. I’d appreciate it if you could take a look.🙇🏻‍♀️

level: extended
type: keyword
ignore_above: 8192
description: The URL of the webpage that linked to the file.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand what does it mean? Could you elaborate it a bit, give some examples maybe? Is it url, where file was located before downloading?

Copy link
Contributor Author

@AsuNa-jp AsuNa-jp Sep 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @trisch-me,
Thank you for the comments/feedbacks!🙇🏻‍♀️
Following is the actual example of the file.origin_referrer_url and file.origin_url.

When you download an image file (for example, news30_img3.png) from the following web page,

the file.origin_url will be the url for the downloaded file, and the file.origin_referrer_url will be the url which hosted the file (http://girls.seccon.jp/news30.html).

image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for explanation, can you add more information into description/examples of the actual fields?

@trisch-me
Copy link
Contributor

trisch-me commented Sep 12, 2024

Hey @AsuNa-jp I think it would be great to follow RFC process for this change, because it's a new addition and it should usually go through the process.

Currently looking into fields I'm not sure we can add those to the file namespace because from the semantic logic they are of type url and would be great to have them as such.

in Otel we have this differentiation between schema, i.e. https://github.com/open-telemetry/semantic-conventions/blob/main/model/registry/url.yaml and usage of schema (or how we call it - usecases), where you have fields from different namespaces together in 1 place, describing the usecase (https://github.com/open-telemetry/semantic-conventions/blob/main/model/trace/database.yaml#L263)

Translating this into your PR and new fields, I would argue that they should be a part of the file namespace (though it's possible too) but rather be a part of that mix for specific usecase. Also another possibility to have them not as strings but as a reference to a file namespace. Why - because then file.url will have the same restrictions as all other urls (for example redacting sensitive data in url)

In current Otel version one can't rename referenced attributes, it means url.full is always url.full but there will be a change soon for this possibility

@AsuNa-jp
Copy link
Contributor Author

Hi @trisch-me!

I’ve finally come to understand the differences between ECS and OpenTelemetry, so I’d like to answer your previous question.

Whether or not it should be placed in the file namespace

ECS

Based on the definition of the file fields in ECS, I believe file.origin_referrer_url and file.origin_url
should be defined in the file namespace.

A file is defined as a set of information that has been created on, or has existed on a filesystem.
File objects can be associated with host events, network events, and/or file events (e.g., those produced by File Integrity Monitoring [FIM] products or services). File fields provide details about the affected file associated with the event or metric.
https://www.elastic.co/guide/en/ecs/current/ecs-file.html

This is because this origin_referrer_url and origin_url information is added automatically to the file's Alternate Data Stream (ADS) in NTFS (the Windows file system) when a file is downloaded. It is not information retrieved from a web request (eg. GET request), but rather from the file system. According to the ECS definition, this type of information should be placed in the file namespace.

For example, when you download an image file (image17.webp) from this webpage using a web browser, the download source URL is automatically added to the file's Alternate Data Stream (ADS) as following.

  • file's ADS
image
  • Inside image17.webp:Zone.Identifier:$DATA
image (ReferrerUrl -> `origin_referrer_url` . HostUrl->`origin_url` )

OpenTelemetry

As for OpenTelemetry (registry.yaml), it seems to limit the file namespace to file metadata (such as size), so I understand your doubts about whether it should be included in the file namespace. However, if we are to unify the naming with ECS, I believe that file.origin_referrer_url and file.origin_url should also be included in the file namespace on the OpenTelemetry side.

If there is no need to unify the naming between ECS and OpenTelemetry, then I think it's fine to handle it as a mixed use case(currently, spans.yaml), as you suggested. However, this field itself is not intended for any specific use case, but rather it could be included in logs for any file-related operation. Therefore, it's difficult to define a single (or a few)use case. In such a case, what would be the best approach?

@AsuNa-jp
Copy link
Contributor Author

@magermark
If you have any thoughts or suggestions from endpoint (elastic defend) perspective, I would be grateful if you could kindly share them.

@trisch-me
Copy link
Contributor

hey @AsuNa-jp great explanation, I agree with you that in that case we might add those fields into file namespace.

Do you foresee there any private information in the url path? I don't think we are in control of it, but we might have it, and we might have to or want to remove it. Should we add this to the description?

We aim to have the same names for both otel and ecs fields, otherwise it will be a breaking change for us (small one but better to avoid). I'm not opposing this change in the ECS, but I would like to encourage you to create a similar PR in Otel (I can give you some hints and can show how to) to get initial response on the field names.

@AsuNa-jp
Copy link
Contributor Author

AsuNa-jp commented Sep 20, 2024

@trisch-me Thank you for your reply.
I had also been thinking about how to address the points you raised.

Do you foresee there any private information in the url path?

Yes, if a password is included in the URL of a web page, it can be exposed as shown below.
image
(Above is the URL to access the e-ticket I purchased, but it has already expired, so it is no longer accessible.)

We aim to have the same names for both otel and ecs fields

If the field needs to be the same between ECS and Otel, how about first adding file.origin_referrer_url and file.origin_url to ECS. and then, once Otel supports renaming referenced attributes, we can update spans.yaml by adding ref: url.full(and rename or alias as file.origin_referrer_url). How does that sound?

@trisch-me
Copy link
Contributor

trisch-me commented Sep 23, 2024

I think we can start with just adding fields to the otel as we want to have them in ECS, I don't see any problems in starting this discussion already. Can you do it? Should I create a PR?

And we should add a note about possible sensitive data inside url

@AsuNa-jp AsuNa-jp marked this pull request as draft September 25, 2024 09:18
@AsuNa-jp AsuNa-jp removed the request for review from jdu2600 September 27, 2024 05:19
Copy link

Hi!

We just realized that we haven't looked into this PR in a while. We're
sorry!

We're labeling this PR as Stale to make it hit our filters and
make sure we get back to it as soon as possible. In the meantime, it'd
be extremely helpful if you could take a look at it as well and confirm its
relevance. A simple comment with a nice emoji will be enough :+1.

Thank you for your contribution!

@github-actions github-actions bot added the stale Stale issues and pull requests label Nov 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
endpoint Relevant to elastic endpoint security New Field Request review stale Stale issues and pull requests Team: ECS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants