Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support for multiple file checksums, file types #200

Closed
wants to merge 6 commits into from

Conversation

jotterson
Copy link
Contributor

This change allows multiple checksums and file types on Files. This improves compliance with the SPDX 2.1 and newer specifications.

This was done to allow support for modern cryptographic hashes, e.g. sha256.

All 250 of the tests runs. The multiple file checksum and file types works with all storage types.

Signed-off-by: Jeffrey Otterson [email protected]

@dholth
Copy link
Contributor

dholth commented Nov 29, 2021

I have an overlapping pull request at #197 but I did not implement parsing.

Nice to see that you were able to implement parsing for all formats.

Were you able to validate the json output against https://github.com/spdx/spdx-spec/blob/development/v2.2.2/schemas/spdx-schema.json ?

@jotterson
Copy link
Contributor Author

I have an overlapping pull request at #197 but I did not implement parsing.

Nice to see that you were able to implement parsing for all formats.

Were you able to validate the json output against https://github.com/spdx/spdx-spec/blob/development/v2.2.2/schemas/spdx-schema.json ?

I do not know how to do that. I read the schema, and I would say that I do not believe that my changes are compatible with the schema, but that is because I believe the schema is not compliant with the SPDX spec. I am not confident in my ability to read that schema, but my interpretation of that schema only allows one type for a file, and the specification says the cardinality is 0..* (https://spdx.github.io/spdx-spec/file-information/#83-file-type-field) so ??? What is more correct? Jeff

@dholth
Copy link
Contributor

dholth commented Nov 29, 2021

I have an overlapping pull request at #197 but I did not implement parsing.
Nice to see that you were able to implement parsing for all formats.
Were you able to validate the json output against https://github.com/spdx/spdx-spec/blob/development/v2.2.2/schemas/spdx-schema.json ?

I do not know how to do that. I read the schema, and I would say that I do not believe that my changes are compatible with the schema, but that is because I believe the schema is not compliant with the SPDX spec. I am not confident in my ability to read that schema, but my interpretation of that schema only allows one type for a file, and the specification says the cardinality is 0..* (https://spdx.github.io/spdx-spec/file-information/#83-file-type-field) so ??? What is more correct? Jeff

I don't know about file type cardinality but there are a couple of more mundane details like the placement of the checksum field and name of algorithms in json, and the presence or absence of certain fields.

    from jsonschema import validate
    validate(spdx_data, json.load(open("spdx-schema.json")))

@jotterson
Copy link
Contributor Author

I have an overlapping pull request at #197 but I did not implement parsing.
Nice to see that you were able to implement parsing for all formats.
Were you able to validate the json output against https://github.com/spdx/spdx-spec/blob/development/v2.2.2/schemas/spdx-schema.json ?

I do not know how to do that. I read the schema, and I would say that I do not believe that my changes are compatible with the schema, but that is because I believe the schema is not compliant with the SPDX spec. I am not confident in my ability to read that schema, but my interpretation of that schema only allows one type for a file, and the specification says the cardinality is 0..* (https://spdx.github.io/spdx-spec/file-information/#83-file-type-field) so ??? What is more correct? Jeff

I don't know about file type cardinality but there are a couple of more mundane details like the placement of the checksum field and name of algorithms in json, and the presence or absence of certain fields.

    from jsonschema import validate
    validate(spdx_data, json.load(open("spdx-schema.json")))

Thanks for that. I still don't have the schema verifying, but I have fixed some of the schema validation problems (extra 'checksumAlgorithm_' text, etc.) I will merge what I have into my branch soon. I tried to put the schema validation right into the tests, but I think that unless I have your flatten and unflatten functions, that is not going to work right. I will note that I do have read/write for all file formats working. Jeff.

@pombredanne
Copy link
Member

@jotterson Thank you ++ .. let me review in details possibly later this week

@0t1s1
Copy link

0t1s1 commented Jan 28, 2022

What is the status of this PR? I can assist if needed or I can make a separate PR from work I have done late last year that I believe solves the same issue.

@jotterson
Copy link
Contributor Author

I would still like to get this PR committed, somebody has to approve it. The changes are useful, and the tests all pass. Jeff.

@0t1s1
Copy link

0t1s1 commented Feb 4, 2022

@pombredanne is this still under review?

@jotterson
Copy link
Contributor Author

I sure would like to merge this. What's it going to take? @pombredanne ???

@nicoweidner
Copy link
Collaborator

Hi @jotterson . Sorry this got sidelined for so long!
We are trying to renew activity in this repo and this is certainly an important fix (or rather addition). There is a backlog of open PRs though, and #197 seems to have overlap. We'll try to get that one merged first, then figure out which changes in this PR should still go in.

@nicoweidner nicoweidner added the blocked Blocked by some other PR or issue. Details should be mentioned in comments. label Oct 25, 2022
Copy link
Collaborator

@meretp meretp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your effort! I think these are great additions eventhough it would have been easier to review if the changes were in seperate PRs (one for the multiple checksums, one for file types and for one for the changes concerning xml-parser/writer).

This PR still depends on PR #254 (superseding #197) and there will be some conflicts when rebasing but to speed things up I started the review and added some comments. I have not yet looked closely at the changes for the XML parser and writer but I like the approach. I think that this will go well with the planned refactoring in the direction of separating the builder and parser.

spdx/checksum.py Outdated Show resolved Hide resolved
spdx/file.py Outdated Show resolved Hide resolved
spdx/parsers/jsonyamlxmlbuilders.py Outdated Show resolved Hide resolved
spdx/parsers/lexers/tagvalue.py Outdated Show resolved Hide resolved
spdx/writers/jsonyamlxml.py Show resolved Hide resolved
spdx/writers/jsonyamlxml.py Show resolved Hide resolved
spdx/writers/jsonyamlxml.py Show resolved Hide resolved
spdx/writers/jsonyamlxml.py Outdated Show resolved Hide resolved
spdx/writers/jsonyamlxml.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@nicoweidner nicoweidner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jotterson, thanks for the big contribution! We definitely want to get this merged, as there are a lot of valuable fixes in this PR.

I left quite a lot of comments; I apologize in advance if some of those comments concern issues that are better fixed separately (I am not super familiar with the current state of the tools yet). If you see cases of this, please feel free to point it out, then we can move them to separate issues.

As this PR is quite old by now and didn't get much attention for a long time, could you give an indication whether you want to continue working on it? Else I would try to make the necessary fixes and finally get it merged.

edit: One more general note: Usually, I would request some more unit tests for added functionality, so we can slowly improve test coverage of the Python tools. However, I feel that many of those tests would become redundant with the planned data model refactoring, so it's probably not worth it spending a lot of time on tests right now.

spdx/checksum.py Outdated Show resolved Hide resolved
spdx/checksum.py Outdated Show resolved Hide resolved
spdx/file.py Outdated Show resolved Hide resolved
spdx/file.py Outdated Show resolved Hide resolved
spdx/file.py Outdated Show resolved Hide resolved
tests/data/formats/SPDXXmlExample.xml Outdated Show resolved Hide resolved
Comment on lines +111 to +114
if isinstance(v[0], dict):
v = sorted(v, key=lambda x: json.dumps(x, sort_keys=True))
else:
v = sorted(v)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get the logic. Why don't we just call sort_nested(v) here?
Instead, the sorting used here should be applied to data in the case where it is a list

del(data['SpdxDocument']['Document']['creationInfo'])
parser = xmlparser.Parser(Builder(), StandardLogger())
with io.open(location, encoding='utf-8') as f:
document, _ = parser.parse(f)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should raise an error in case parsing fails?

tests/utils_test.py Show resolved Hide resolved
tests/utils_test.py Show resolved Hide resolved
@nicoweidner nicoweidner removed the blocked Blocked by some other PR or issue. Details should be mentioned in comments. label Nov 3, 2022
@jotterson
Copy link
Contributor Author

I'm working on this. There are many requested changes, plus a potential rebase. It's gonna take a few days.

If you're in a hurry, I'm about half done, all the tests are currently passing, but the example data is still old. Let me know, and I will make a new commit with the fixes that I have.

@nicoweidner
Copy link
Collaborator

I'm working on this. There are many requested changes, plus a potential rebase. It's gonna take a few days.

If you're in a hurry, I'm about half done, all the tests are currently passing, but the example data is still old. Let me know, and I will make a new commit with the fixes that I have.

I'd like to get this merged fairly soon, but there is no immediate hurry. I'd say a timeline of about a week would be better than a month. I feel bad creating pressure here, since this PR was ignored for so long, sorry about that! Our plan is to get all outstanding fixes (and maybe a few additional ones) in and do an as-is release, since the last release was like 3 years ago. Then we plan to tackle some larger structural changes to the tools.
In case you feel you'll be short on time, I can also try to help out with some of the fixes - in particular the rebase may take some work. I believe @meretp already had a look at that and noticed quite a few conflicts.

Also, sorry for the pile of comments, I tend to be very verbose in reviews 😅. I do this with the mindset that it's ok to disagree with comments, I am very rarely dogmatic (some kind of reason should be given though).

@meretp meretp mentioned this pull request Nov 9, 2022
@jotterson
Copy link
Contributor Author

The tag/value parser is going to need a real enema. There are some horrible assumptions in there related to sequencing, package before files, etc. The latest tag/value files from the tools-java project really demonstrate that problem.

IMHO the current yacc solution is excessively complicated.

@nicoweidner
Copy link
Collaborator

nicoweidner commented Nov 14, 2022

The tag/value parser is going to need a real enema. There are some horrible assumptions in there related to sequencing, package before files, etc. The latest tag/value files from the tools-java project really demonstrate that problem.

IMHO the current yacc solution is excessively complicated.

The tag/value parser is a catastrophe. Configuring the parser via docstrings is honestly the most dubious thing I have seen in my coding life so far.
I am not sure how a reasonably simple solution could look like, though. I think tag/value is inherently a bad format for SPDX, because SPDX documents are deeply nested and tag/value has no native support for nesting. The tools-java library uses a custom-built solution that uses less black magic, but it's not really shorter either (see parser and builder)

That said, @meretp, @armintaenzertng and myself dealt with it (read: had to deal with it...) recently while adding some new properties. If it's causing you headaches, feel free to skip tag/value for now and we'll try to finish it afterwards.

json, yaml, xml formats appear to be working.
tag/value and rdf is not working yet.
updated test/data/formats files from tools-java project.
@jotterson
Copy link
Contributor Author

Here is an update to my commit of 51 weeks ago (!).

The CircleCI tests are failing because the tag/value and RDF formats are not working. JSON, YAML, XML all work, spdx data can be converted between those three formats with no apparent loss of fidelity.

@nicoweidner
Copy link
Collaborator

Hi @jotterson. I am afraid I have to step in here, because with your recent massive changes, I see no way to get this PR rebased and merged. It is way too big for a single PR and addresses a variety of issues instead of focusing on a single one (several of those issues are already fixed on main). I do realize that the missing review for this PR (for about a year) played a large part in the current situation. I'm really sorry about that, but it's neither something I can change nor something that changes the situation.

In order to get this merged, the PR needs to be broken up into smaller chunks, such as:

  • support for multiple checksums and file types, as the name suggests (could even be broken up into two)
  • adding example files from tools-java
  • annotation parsing changes (didn't look at them in detail, but I don't think they belong in this PR)
  • changes to external references
  • changes to the xml parser (including a description of how this is preferrable to the previous simple solution that converted the xml to a plain dict)

With the changes concentrated in very few, massive commits, I don't see a way to easily cherry-pick specific aspects. The latest commit is also missing a signoff, though that is probably the smallest problem.

Issues that are already fixed on main:

  • renaming licenseInfoFromSnippet
  • files on document-level instead of nested inside packages
  • adding SPDX v2.3 properties such as packagePrimaryPurpose (2.3 support issues are tracked in Support SPDX 2.3 #227)
  • snippet byte/line range

Things that shouldn't be changed:

  • document_describes and has_files are Json-specific fields (+yaml/xml), they should not be added to the data model. They have to be parsed from Json and written to Json when de-/serializing

I am happy to help get smaller PRs reviewed quickly so the changes can be merged, but rebasing this PR in the current state will simply not be possible, I'm afraid.

@jotterson jotterson closed this Nov 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants