Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue/72 new api allow for full fledged processing of protection profiles #466

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

adamjanovsky
Copy link
Collaborator

Closes #72

@adamjanovsky adamjanovsky self-assigned this Jan 23, 2025
@adamjanovsky
Copy link
Collaborator Author

@J08nY first batch of commits that refactored Dataset classes is in. Unfrotunately, I had to merge fresh main into this, so lot of changes underway, better look just at the 2165a66 commit to assess the design.

Some of my early design notes:

  • Dataset class will have aux_handlers attribute that accepts a list of instances that implement the AuxiliaryDatasetHandler interface (in form of ABC base class. These days, I’d opt for Protocol, but to be coherent with the old implementation, let’s stick with inheritance).
  • AuxiliaryDatasetHandler protocol defines process_dataset
  • ProtectionProfile dataset can thus inherit from Dataset class and implement no handlers.
  • Each auxiliary dataset will come with its own handler. This enables code re-use between FIPSDataset and CCDataset classes. Any subclass of Dataset class will simply populate its handlers with the required logic.
  • Computation of individual heuristics is outsourced into functions (not part of any class).

Copy link
Member

@J08nY J08nY left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK. But still has conflicts with main. Is the merge commit a real merge commit?

src/sec_certs/dataset/cc.py Show resolved Hide resolved
src/sec_certs/dataset/cc.py Show resolved Hide resolved
src/sec_certs/dataset/cc.py Outdated Show resolved Hide resolved
src/sec_certs/dataset/cc.py Outdated Show resolved Hide resolved
@adamjanovsky
Copy link
Collaborator Author

Looks OK. But still has conflicts with main. Is the merge commit a real merge commit?

Meh, something was left out, should be fixed by now.

Copy link

codecov bot commented Jan 23, 2025

Codecov Report

Attention: Patch coverage is 60.31281% with 406 lines in your changes missing coverage. Please review.

Project coverage is 66.94%. Comparing base (7407773) to head (29ab25f).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
src/sec_certs/dataset/protection_profile.py 41.49% 134 Missing ⚠️
...rc/sec_certs/dataset/auxiliary_dataset_handling.py 55.56% 84 Missing ⚠️
src/sec_certs/sample/protection_profile.py 58.64% 79 Missing ⚠️
src/sec_certs/utils/label_studio_utils.py 0.00% 56 Missing ⚠️
src/sec_certs/dataset/dataset.py 48.15% 14 Missing ⚠️
src/sec_certs/dataset/cc.py 72.10% 12 Missing ⚠️
src/sec_certs/heuristics/common.py 85.34% 11 Missing ⚠️
src/sec_certs/heuristics/cc.py 90.81% 8 Missing ⚠️
src/sec_certs/dataset/fips.py 76.48% 4 Missing ⚠️
src/sec_certs/sample/document_state.py 92.50% 3 Missing ⚠️
... and 1 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #466      +/-   ##
==========================================
- Coverage   68.55%   66.94%   -1.60%     
==========================================
  Files          62       68       +6     
  Lines        7934     8333     +399     
==========================================
+ Hits         5438     5578     +140     
- Misses       2496     2755     +259     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@adamjanovsky
Copy link
Collaborator Author

adamjanovsky commented Jan 25, 2025

Hey, the initial draft of the functionality is implemented. Some notes below.

Sample usage

Create and fully process PP dataset

pp_dset = ProtectionProfileDataset(root_dir="/path/to/pp/directory")
pp_dset.get_certs_from_web()
pp_dset.process_auxiliary_datasets()
pp_dset.download_all_artifacts()
pp_dset.convert_all_pdfs()
pp_dset.analyze_certificates()

Acess to PP Dataset from CC Dataset

  • A path to PP already-processed PP dataset can be provided in
cc_dset.process_auxiliary_dataset(processed_pp_dataset_root_dir="/path/to/pp/directory)

In such case, PPDataset instance will not be fully processed, but just copied into cc_dset.auxiliary_datasets_dir. The instance can be accessed via cc_dset.aux_handlers[ProtectionProfileDatasetHandler].dset once cc_dset.process_auxiliary_datasets() completes.

Alternatively, CCDataset instance is capable of invoking full processing of PPDataset (it does so similarly with Maintenance updates) when process_auxiliary_datasets() is called without any argument.

When CCDataset processing is complete, the ProtectionProfile instances linked to specific CCCertificates are listed as digests in cert.heuristics.protection_profiles.

Notes on PP processing

  • Primary key of ProtectionProfile instance (also entries to its dgst property implementation) is a three-tuple: (category, name, version)
  • Linking from CC to PP is done purely on identity of PP link in both CCCertificate and ProtectionProfile objects.
  • Some of collaborative PPs have multiple certification reports. I parsed only a single link.
  • Some identical PPs are certified under multiple schemes. Ignoring, taking into account only a single scheme now
  • CSV files not parsed. Only data entry missing is expected archival date of active PPs.
  • PP ID not computed.
  • Collaborative PPs with SDs pending review for compliance with the CC/CEM are not processed.
  • Maintenance updates of PP are left unprocessed

Next steps

  • @adamjanovsky: Write tests for PP Processing (soonish)
  • @J08nY: Test the current design with web, propose changes or start integrating.
  • @adamjanovsky and @J08nY allow for downloading a processed PP snapshot from sec-certs.org
  • @adamjanovsky Regression tests: check how many PPs are linked. Check if we miss any previously linked certs.

Comment on lines +57 to +60
pp_latest_full_archive: AnyHttpUrl = Field(
"https://sec-certs.org/cc/pp.tar.gz",
description="URL from where to fetch the latest full archive of fully processed PP dataset.",
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pp_latest_snapshot config also needs to change. It will no longer live on the /static/ subdir. But have the same layout as the CC and FIPS datasets. Could you make the change pls?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I wanted to discuss this first before changing this.

"Trusted Computing",
]

CC_PORTAL_BASE_URL = "https://www.commoncriteriaportal.org"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember seeing a nuch of URLs to CC portal stuff duplicated (I guess in the CC sample and dataset class). Is it now unified under subpaths of this base url? It is a nice time to unify.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I already did some unification, will double check!

@@ -25,17 +24,25 @@


class AuxiliaryDatasetHandler(ABC):
def __init__(self, root_dir: str | Path) -> None:
self.root_dir = Path(root_dir)
RELATIVE_DIR: ClassVar[str | None] = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the all caps?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a constant

)

if skip_schemes:
del self.aux_handlers[CCSchemeDatasetHandler]
Copy link
Member

@J08nY J08nY Jan 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a good idea for it to delete the handler? It modifies the dataset state and I think that is quite unexpected.

Edit: Maybe modify the only_schemes attr accordingly instead of deleting the handler?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the tip. I needed something to commit on yesterday, I want to revisit this. It's a bad idea and I'm thinking about better design atm.

@@ -479,7 +493,7 @@ def _get_primary_key_str(row: Tag):
x.st_link,
None,
None,
profiles.get(x.dgst, None),
None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the change? Does this mean we will have no links to PPs in the df?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I will enrich the DF with one extra attribute for that!

Comment on lines +226 to +227
"protection_profiles": null,
"eal": null
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these entries null, while the "protection_profile_links" entry has a single PP element? Is it that the toy dataset was not processed fully?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Is in a state after get_certs_from_web() IMO. These are heuristics attributes filled-in only in the analyze_dset() step.

@@ -2,6 +2,7 @@ accessible-pygments==0.0.4
# via pydata-sphinx-theme
aiohappyeyeballs==2.4.0
# via aiohttp
aiohttp==3.10.11
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This merge commit messed this up a bit. Same for the vulnerabilities notebook, but that is now fixed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, sorry for that. I'll bump the reqs once more before merging, treating also some open PRs.

@J08nY
Copy link
Member

J08nY commented Jan 25, 2025

Could you please add some test(s) that run the PP pipeline at least once? I.e. improve the coverage in the
dataset/protection_profile.py and sample/protection_profile.py files here.

@J08nY J08nY added enhancement New feature or request cc Related to CC certification labels Jan 25, 2025
@adamjanovsky
Copy link
Collaborator Author

Could you please add some test(s) that run the PP pipeline at least once? I.e. improve the coverage in the dataset/protection_profile.py and sample/protection_profile.py files here.

Sure 🙃 , see

@adamjanovsky: Write tests for PP Processing (soonish)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cc Related to CC certification enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

New API: Allow for full-fledged processing of protection profiles
2 participants