-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement alternative analyzers that favor speed over accuracy #8361
Comments
First of all, i think this is a dangerous approach. One of the main reasons Ort being adopted more and more is because the accuracy and level of detailed information we can provide. Going to the "easy path" of good enough, we would simply downplaying the entire project to the same level of other tools, including commercial ones, and then give enough ammunition to say anything over an OSS project, like the usual arguments. Said that, there's actually a compromise that could be done in the terms of caching the analyser process similar way is done on the scan front. So, let's say that Package A=1.0.0 depends on 20 levels of transient dependencies. First analyzer job, we go like we go today, all 20 levels. Slow. One time. So, the two situations that cache would not work would be: 1 - Developer decided not to put a fixed version, so always pick a new one if exists. But then, something we do not have control. 2 - Supply chain attack that the direct dependency suddenly have injected new code over same version. Hash change, scan again. BUT, this bring then an interesting unintended side effect. On Ort side we should make an error if an already cached version suddenly change. But this info will be EXTREMELY valuable for the security front. This approach would raise a following question then: Yes an no. On first complete scan, we will have the entire three with correct hashes, and even if in future something is compromised, the cached version of the transient deps will always be the original. We run on the risk of the first big scan catch the compromised version, though. So in this case the question is valid, but again, based on diversity of companies using same dependencies in their projects, the chance is that someone will caught this error, and then is the job of securitiy teams deal with that. We happily are able to provide the version and hash and details to then on first hand. Honestly, a proper caching mechanism over downplaying our good results is more beneficial, with the risk of simply banalize entire project as "just another one" |
A differently idea to approach this would be to simply leverage the various existing tools to produce the GitHub dependency graph JSON format and create an ORT command to convert these to analyzer results. As ORT analyzer results probably contain more mandatory metadata than provided by GitHub's format, additional sources of metadata, like package registries, might need to be queries as part of the conversion. |
Historically, the ORT analyzer has been pedantic about getting things right (i.e. resolving exactly the same dependencies as the build system does), and gathering all metadata known about a package (even if irrelevant in pretty much every compliance check).
To do so, the ORT analyzer requires the build system configuration of the project under analysis to be self-contained, in good shape, and to follow best practices WRT the build system's own recommendations (e.g. for Gradle, to not evaluate environment variables at task configuration time).
While it's somewhat nice that the ORT analyzer implicitly checks for adhering to those best practices, which can be seen as a build health analysis feature, there are cases where it hinders getting the required information, e.g. when analyzing legacy code bases and projects that are in maintenance mode.
Also, for many compliance checks, e.g. the hierarchy in a dependency graph does not really matter, but a flat list of packages would be enough, as only the combination of distributed packages and their licenses might matter.
So the proposal is to implement more lenient "quick & dirty" analyzer that much faster than the current analyzer that are eager to get things fully correct at all costs. Ideas for making things faster include:
References: #4112 (comment), #5175 (comment), #8278
The text was updated successfully, but these errors were encountered: