Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement alternative analyzers that favor speed over accuracy #8361

Open
sschuberth opened this issue Feb 29, 2024 · 2 comments
Open

Implement alternative analyzers that favor speed over accuracy #8361

sschuberth opened this issue Feb 29, 2024 · 2 comments
Labels
analyzer About the analyzer tool new feature Issues that are considered to be new features

Comments

@sschuberth
Copy link
Member

sschuberth commented Feb 29, 2024

Historically, the ORT analyzer has been pedantic about getting things right (i.e. resolving exactly the same dependencies as the build system does), and gathering all metadata known about a package (even if irrelevant in pretty much every compliance check).

To do so, the ORT analyzer requires the build system configuration of the project under analysis to be self-contained, in good shape, and to follow best practices WRT the build system's own recommendations (e.g. for Gradle, to not evaluate environment variables at task configuration time).

While it's somewhat nice that the ORT analyzer implicitly checks for adhering to those best practices, which can be seen as a build health analysis feature, there are cases where it hinders getting the required information, e.g. when analyzing legacy code bases and projects that are in maintenance mode.

Also, for many compliance checks, e.g. the hierarchy in a dependency graph does not really matter, but a flat list of packages would be enough, as only the combination of distributed packages and their licenses might matter.

So the proposal is to implement more lenient "quick & dirty" analyzer that much faster than the current analyzer that are eager to get things fully correct at all costs. Ideas for making things faster include:

  • lockfile-only analyzers
  • only get a flat list of packages with metadata, but no dependency graph
  • statically parse files instead of using the original build system
  • leverage information from deps.dev
  • fully offline analyzers that require no network connection
  • ...

References: #4112 (comment), #5175 (comment), #8278

@sschuberth sschuberth added analyzer About the analyzer tool new feature Issues that are considered to be new features labels Feb 29, 2024
@heliocastro
Copy link
Contributor

heliocastro commented Mar 1, 2024

First of all, i think this is a dangerous approach.

One of the main reasons Ort being adopted more and more is because the accuracy and level of detailed information we can provide.
I know the cost and the burden to keep this, but indeed is why make the whole software simply better.

Going to the "easy path" of good enough, we would simply downplaying the entire project to the same level of other tools, including commercial ones, and then give enough ammunition to say anything over an OSS project, like the usual arguments.

Said that, there's actually a compromise that could be done in the terms of caching the analyser process similar way is done on the scan front.
Technically, unless we have some vector attack that replace a package in a registry with same version, the package not changes, therefore, not even the transient dependencies. We act over direct dependencies.

So, let's say that Package A=1.0.0 depends on 20 levels of transient dependencies. First analyzer job, we go like we go today, all 20 levels. Slow. One time.
Second run, we detect A=1.0.0. Did the package match same hash/cached ? All good, get the cached info, go ahead.

So, the two situations that cache would not work would be:

1 - Developer decided not to put a fixed version, so always pick a new one if exists. But then, something we do not have control.

2 - Supply chain attack that the direct dependency suddenly have injected new code over same version. Hash change, scan again. BUT, this bring then an interesting unintended side effect. On Ort side we should make an error if an already cached version suddenly change. But this info will be EXTREMELY valuable for the security front.
Imagine that we do continuous scanning on several companies today. Meaning that, even not intended as a security/vulnerability space tool, Ort will be the first one to detect that something s wrong in the dependency on specific registry, and unless is a hiccup, is a immediate clear sign that someone is poisoning the origin.

This approach would raise a following question then:
"But, if you are looking only on direct dependencies for cached, do the transient dependencies will suffer the same issue of not knowing a hash change deep down on the tree ?"

Yes an no. On first complete scan, we will have the entire three with correct hashes, and even if in future something is compromised, the cached version of the transient deps will always be the original.

We run on the risk of the first big scan catch the compromised version, though. So in this case the question is valid, but again, based on diversity of companies using same dependencies in their projects, the chance is that someone will caught this error, and then is the job of securitiy teams deal with that. We happily are able to provide the version and hash and details to then on first hand.

Honestly, a proper caching mechanism over downplaying our good results is more beneficial, with the risk of simply banalize entire project as "just another one"

@sschuberth
Copy link
Member Author

A differently idea to approach this would be to simply leverage the various existing tools to produce the GitHub dependency graph JSON format and create an ORT command to convert these to analyzer results. As ORT analyzer results probably contain more mandatory metadata than provided by GitHub's format, additional sources of metadata, like package registries, might need to be queries as part of the conversion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
analyzer About the analyzer tool new feature Issues that are considered to be new features
Projects
None yet
Development

No branches or pull requests

2 participants