Cavil has three primary components:
- Web Application providing the UI and REST API
- Job queue for processing background jobs
- AI text classification server
The web application and the job queue communicate via PostgreSQL database. All other compoents communicate via HTTP.
+---------+ +---------------------+ +----------------+
| | | | | |
User --> | | | | | |
| Nginx | --> | Web Application | --> | PostgreSQL |
Bots --> | | | | | |
| | | | | |
+---------+ +---------------------+ | |
| |
+---------------------+ | |
| | | |
| | --> | |
| | | |
| | +----------------+
OBS <------------------- | Job Queue |
| | +----------------+
| | | |
| | --> | AI |
| | | |
+---------------------+ +----------------+
Additionally to PostgreSQL, there is also a significant amount of data stored on the file system. The location is configurable, but the structure will be the same everywhere. These are the actual packages Cavil has been tasked with creating legal reviews for and metadata created for those packages.
checkouts # Configurable root directory
|- gcc # Root for all "gcc" packages
| |- 047f96de986898d51f855f99d475b9b6 # Checkout of "gcc" version with that checksum
| |- 29c482ed6d887d9aff78dff785a940b5 # Another checksout of "gcc" with a different checksum
| +- ...
|
|- perl-Mojolicious
| |- 0297a570e088c24551da36a4d31d785e
| +- 11f5ece47a018082eaaedf9f9d038148
| |- Mojolicious-9.31.tar.gz # Compressed archive in checkout
| |- perl-Mojolicious.changes
| |- perl-Mojolicious.spec # Specfile in checkout
| |- .postprocessed.json # Cavil metadata
| |- .report.spdx # Cached SPDX report generated by Cavil for this checkout
| |- .unpacked.json # Cavil metadata
| +- .unpacked # Recursively unpacked archives from this checkout
| |- Mojolicious-9.31 # "Mojolicious-9.31.tar.gz" unpacked
| | |- Changes
| | |- LICENSE
| | |- Makefile.PL
| | |- Makefile.processed.PL
| | +- ...
| |- perl-Mojolicious.changes
| |- perl-Mojolicious.spec # Copy of file that did not need to be unpacked
| +- .unpacked.json # Cavil metadata
|
+- ...
Both, the web application and the job queue interact with these files and can make changes to them.
The use of machine learning models for text classification is entirely optional, but strongly recommended. Because the pattern matching system used for identifying clusters of legal keywords (snippets) has a false-positive rate of about 80%. And even a simple model can identify almost all of them.
There are currently two example implementations for a companion text classification server application:
Cavil is designed for a human driven review process of RPM package sources. Package sources are imported (usually from OBS), recursively unpacked, and a legal report is created for them. This report is then reviewed by a human expert or lawyer and accepted or rejected. Under certain conditions the report may also be automatically accepted or rejected by Cavil.
Legal reports can be in one of five states:
-
new
: Initial state, report ready for review. -
acceptable
: Reviewed and accepted by a human expert or automated system, but not a lawyer. -
acceptable_by_lawyer
: Same asacceptable
, but review was performend by a lawyer. -
unacceptable
: Review by a lawyer that resulted in rejecting the report. -
obsolete
: Report no longer exists.
The entire human driven workflow happens via HTTP web UI. These are the most important menu points:
Open Reviews
: Lists all reports that are currently being prepared or ready for review. The report can be reviewed once the link with the unique report checksum becomes visible. From the report new license patterns can be created for newly identified snippets of potential legal text.Recent Reviwed
: Lists all review results from the past 3 months.Snippets
: Lists recently identified snippets of potential legal text and the associated AI text classification results if available. These results can be validated here to create new training data for future AI models.Products
: Lists all product codestreams and their associated packages that have been synchronized from OBS into Cavil.Licenses
: Lists known licenses and the associated license patterns. License patterns without a name are considered keyword patterns, and are used to identify new snippets of potential legal text.
What level of features are actually available to logged in users depends on the roles that have been assigned to them. These are currently available:
user
: Minimal access level, can view reports amd review results. Not allowed to browse checkouts or make changes.classifier
: Human expert who can validate AI text classification results to create new training data.contributor
: Human expert who can propose new license patterns to be added for reports. These proposals need to be reviewed and accepted by anadmin
however before they become active.manager
: Human expert who can change reports from the statenew
toacceptable
.lawyer
: Lawyers can change reports from the statenew
toacceptable_by_lawyer
.admin
: Full access to all features.
Reports may be automatically accepted by the system under these conditions:
Previous Result
: A previous report with the same checksum (based on licenses and keyword matches) exists for the same package. In this case the previous result will be inherited.No Differences
: A previous report exists where the checksum does not match but there are no significant differences between licenses and unique keyword matches.Low Risk
: The maximum risk of any given license in the report is not higher than 3. And there are no unresolved keyword matches with a risk higher than 3. The resulting state can only beacceptable.
Package Name
: The package has been configured to always beacceptable
. For SUSE instances of OBS this is usually done for empty metadata packages like000product
.
For all of these conditions a prior review by a human expert or lawyer needs to be present in the system however.
Report creation is triggered via REST API, usually by an OBS bot like legal-auto. This results in various background jobs being created that are then performed by the jobs queue. Jobs can often be processed in parallel to make the best use of all available resources.
These jobs are involved in report creation and usually run in the listed order:
obs_import
: Checks out the package sources from OBS.unpack
: Recursively unpacks all archives contained in package sources.index
: Creates file lists and splits them up into batches for parallel processing.index_batch
: Performs two phase pattern matching on all files in the batch with license and keyword patterns. There can be thousands ofindex_batch
jobs at the same time.indexed
: Synchronizes all pattern matching results.analyze
: Combines patterns matching results to create the license report.analyzed
: Checks the report for reasons to automatically accept it.spdx_report
: Creates report in SPDX format.
If AI text classification has been configured there will also be another background job running in irregular intervals. This one is not specific to one package checkout.
classify
: Sends all unclassified snippets of potential legal text to the text classification server, and if necessary updates reports.
Text classification via machine learning model is implemented as an optional HTTP service that needs to be configured
before it can be used. It usually runs on port 5000
.
classifier => 'http://localhost:5000'
The classify
background job needs to be created with the classify
command. Which can be triggered via cron job or
systemd timer in regular intervals. Every 5 minutes has worked well for us in the past.
$ ./script/cavil classify
The only information submitted is the raw snippet of potential legal text, and a JSON document is returned, containing a boolean value indicating if the model believes this to be legal text, with a percentage value indicating its confidence in the decision.
$ curl -X POST --data '# SPDX-License-Identifier: GPL-2.0-only' http://127.0.0.1:5000
{"license": true, "confidence": 87.8}
The confidence value is presented to users in the UI and meant to help select the best possible candidates for new training data for future versions of the model.