Implement alternative analyzers that favor speed over accuracy #8361

sschuberth · 2024-02-29T08:56:29Z

Historically, the ORT analyzer has been pedantic about getting things right (i.e. resolving exactly the same dependencies as the build system does), and gathering all metadata known about a package (even if irrelevant in pretty much every compliance check).

To do so, the ORT analyzer requires the build system configuration of the project under analysis to be self-contained, in good shape, and to follow best practices WRT the build system's own recommendations (e.g. for Gradle, to not evaluate environment variables at task configuration time).

While it's somewhat nice that the ORT analyzer implicitly checks for adhering to those best practices, which can be seen as a build health analysis feature, there are cases where it hinders getting the required information, e.g. when analyzing legacy code bases and projects that are in maintenance mode.

Also, for many compliance checks, e.g. the hierarchy in a dependency graph does not really matter, but a flat list of packages would be enough, as only the combination of distributed packages and their licenses might matter.

So the proposal is to implement more lenient "quick & dirty" analyzer that much faster than the current analyzer that are eager to get things fully correct at all costs. Ideas for making things faster include:

lockfile-only analyzers
only get a flat list of packages with metadata, but no dependency graph
statically parse files instead of using the original build system
leverage information from deps.dev
fully offline analyzers that require no network connection
...

References: #4112 (comment), #5175 (comment), #8278

heliocastro · 2024-03-01T09:12:50Z

First of all, i think this is a dangerous approach.

One of the main reasons Ort being adopted more and more is because the accuracy and level of detailed information we can provide.
I know the cost and the burden to keep this, but indeed is why make the whole software simply better.

Going to the "easy path" of good enough, we would simply downplaying the entire project to the same level of other tools, including commercial ones, and then give enough ammunition to say anything over an OSS project, like the usual arguments.

Said that, there's actually a compromise that could be done in the terms of caching the analyser process similar way is done on the scan front.
Technically, unless we have some vector attack that replace a package in a registry with same version, the package not changes, therefore, not even the transient dependencies. We act over direct dependencies.

So, let's say that Package A=1.0.0 depends on 20 levels of transient dependencies. First analyzer job, we go like we go today, all 20 levels. Slow. One time.
Second run, we detect A=1.0.0. Did the package match same hash/cached ? All good, get the cached info, go ahead.

So, the two situations that cache would not work would be:

1 - Developer decided not to put a fixed version, so always pick a new one if exists. But then, something we do not have control.

2 - Supply chain attack that the direct dependency suddenly have injected new code over same version. Hash change, scan again. BUT, this bring then an interesting unintended side effect. On Ort side we should make an error if an already cached version suddenly change. But this info will be EXTREMELY valuable for the security front.
Imagine that we do continuous scanning on several companies today. Meaning that, even not intended as a security/vulnerability space tool, Ort will be the first one to detect that something s wrong in the dependency on specific registry, and unless is a hiccup, is a immediate clear sign that someone is poisoning the origin.

This approach would raise a following question then:
"But, if you are looking only on direct dependencies for cached, do the transient dependencies will suffer the same issue of not knowing a hash change deep down on the tree ?"

Yes an no. On first complete scan, we will have the entire three with correct hashes, and even if in future something is compromised, the cached version of the transient deps will always be the original.

We run on the risk of the first big scan catch the compromised version, though. So in this case the question is valid, but again, based on diversity of companies using same dependencies in their projects, the chance is that someone will caught this error, and then is the job of securitiy teams deal with that. We happily are able to provide the version and hash and details to then on first hand.

Honestly, a proper caching mechanism over downplaying our good results is more beneficial, with the risk of simply banalize entire project as "just another one"

sschuberth · 2024-11-11T17:02:37Z

A differently idea to approach this would be to simply leverage the various existing tools to produce the GitHub dependency graph JSON format and create an ORT command to convert these to analyzer results. As ORT analyzer results probably contain more mandatory metadata than provided by GitHub's format, additional sources of metadata, like package registries, might need to be queries as part of the conversion.

sschuberth added analyzer About the analyzer tool new feature Issues that are considered to be new features labels Feb 29, 2024

sschuberth mentioned this issue Mar 4, 2024

feat(scanner): add '--packages-depth' parameter. #8372

Open

sschuberth mentioned this issue Apr 10, 2024

Allow usage of GOPROXY variable for go module downloads #8504

Closed

sschuberth mentioned this issue Jul 1, 2024

RFC: analysis result caching #5186

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement alternative analyzers that favor speed over accuracy #8361

Implement alternative analyzers that favor speed over accuracy #8361

sschuberth commented Feb 29, 2024 •

edited

Loading

heliocastro commented Mar 1, 2024 •

edited

Loading

sschuberth commented Nov 11, 2024

Implement alternative analyzers that favor speed over accuracy #8361

Implement alternative analyzers that favor speed over accuracy #8361

Comments

sschuberth commented Feb 29, 2024 • edited Loading

heliocastro commented Mar 1, 2024 • edited Loading

sschuberth commented Nov 11, 2024

sschuberth commented Feb 29, 2024 •

edited

Loading

heliocastro commented Mar 1, 2024 •

edited

Loading