add support for multiple file checksums, file types #200

jotterson · 2021-11-26T14:53:09Z

This change allows multiple checksums and file types on Files. This improves compliance with the SPDX 2.1 and newer specifications.

This was done to allow support for modern cryptographic hashes, e.g. sha256.

All 250 of the tests runs. The multiple file checksum and file types works with all storage types.

Signed-off-by: Jeffrey Otterson [email protected]

Signed-off-by: Jeffrey Otterson <[email protected]>

dholth · 2021-11-29T15:44:58Z

I have an overlapping pull request at #197 but I did not implement parsing.

Nice to see that you were able to implement parsing for all formats.

Were you able to validate the json output against https://github.com/spdx/spdx-spec/blob/development/v2.2.2/schemas/spdx-schema.json ?

jotterson · 2021-11-29T22:23:13Z

I have an overlapping pull request at #197 but I did not implement parsing.

Nice to see that you were able to implement parsing for all formats.

Were you able to validate the json output against https://github.com/spdx/spdx-spec/blob/development/v2.2.2/schemas/spdx-schema.json ?

I do not know how to do that. I read the schema, and I would say that I do not believe that my changes are compatible with the schema, but that is because I believe the schema is not compliant with the SPDX spec. I am not confident in my ability to read that schema, but my interpretation of that schema only allows one type for a file, and the specification says the cardinality is 0..* (https://spdx.github.io/spdx-spec/file-information/#83-file-type-field) so ??? What is more correct? Jeff

dholth · 2021-11-29T22:49:29Z

I have an overlapping pull request at #197 but I did not implement parsing.
Nice to see that you were able to implement parsing for all formats.
Were you able to validate the json output against https://github.com/spdx/spdx-spec/blob/development/v2.2.2/schemas/spdx-schema.json ?

I do not know how to do that. I read the schema, and I would say that I do not believe that my changes are compatible with the schema, but that is because I believe the schema is not compliant with the SPDX spec. I am not confident in my ability to read that schema, but my interpretation of that schema only allows one type for a file, and the specification says the cardinality is 0..* (https://spdx.github.io/spdx-spec/file-information/#83-file-type-field) so ??? What is more correct? Jeff

I don't know about file type cardinality but there are a couple of more mundane details like the placement of the checksum field and name of algorithms in json, and the presence or absence of certain fields.

    from jsonschema import validate
    validate(spdx_data, json.load(open("spdx-schema.json")))

jotterson · 2021-12-01T20:08:04Z

I have an overlapping pull request at #197 but I did not implement parsing.
Nice to see that you were able to implement parsing for all formats.
Were you able to validate the json output against https://github.com/spdx/spdx-spec/blob/development/v2.2.2/schemas/spdx-schema.json ?

I do not know how to do that. I read the schema, and I would say that I do not believe that my changes are compatible with the schema, but that is because I believe the schema is not compliant with the SPDX spec. I am not confident in my ability to read that schema, but my interpretation of that schema only allows one type for a file, and the specification says the cardinality is 0..* (https://spdx.github.io/spdx-spec/file-information/#83-file-type-field) so ??? What is more correct? Jeff

I don't know about file type cardinality but there are a couple of more mundane details like the placement of the checksum field and name of algorithms in json, and the presence or absence of certain fields.
    from jsonschema import validate
    validate(spdx_data, json.load(open("spdx-schema.json")))

Thanks for that. I still don't have the schema verifying, but I have fixed some of the schema validation problems (extra 'checksumAlgorithm_' text, etc.) I will merge what I have into my branch soon. I tried to put the schema validation right into the tests, but I think that unless I have your flatten and unflatten functions, that is not going to work right. I will note that I do have read/write for all file formats working. Jeff.

Signed-off-by: Jeffrey Otterson <[email protected]>

pombredanne · 2021-12-15T14:59:32Z

@jotterson Thank you ++ .. let me review in details possibly later this week

0t1s1 · 2022-01-28T02:34:09Z

What is the status of this PR? I can assist if needed or I can make a separate PR from work I have done late last year that I believe solves the same issue.

jotterson · 2022-02-01T00:00:08Z

I would still like to get this PR committed, somebody has to approve it. The changes are useful, and the tests all pass. Jeff.

0t1s1 · 2022-02-04T18:38:35Z

@pombredanne is this still under review?

jotterson · 2022-06-28T12:49:39Z

I sure would like to merge this. What's it going to take? @pombredanne ???

nicoweidner · 2022-10-25T09:55:04Z

Hi @jotterson . Sorry this got sidelined for so long!
We are trying to renew activity in this repo and this is certainly an important fix (or rather addition). There is a backlog of open PRs though, and #197 seems to have overlap. We'll try to get that one merged first, then figure out which changes in this PR should still go in.

meretp

Thanks for your effort! I think these are great additions eventhough it would have been easier to review if the changes were in seperate PRs (one for the multiple checksums, one for file types and for one for the changes concerning xml-parser/writer).

This PR still depends on PR #254 (superseding #197) and there will be some conflicts when rebasing but to speed things up I started the review and added some comments. I have not yet looked closely at the changes for the XML parser and writer but I like the approach. I think that this will go well with the planned refactoring in the direction of separating the builder and parser.

spdx/checksum.py

spdx/file.py

spdx/parsers/jsonyamlxmlbuilders.py

spdx/parsers/lexers/tagvalue.py

spdx/writers/jsonyamlxml.py

nicoweidner

Hi @jotterson, thanks for the big contribution! We definitely want to get this merged, as there are a lot of valuable fixes in this PR.

I left quite a lot of comments; I apologize in advance if some of those comments concern issues that are better fixed separately (I am not super familiar with the current state of the tools yet). If you see cases of this, please feel free to point it out, then we can move them to separate issues.

As this PR is quite old by now and didn't get much attention for a long time, could you give an indication whether you want to continue working on it? Else I would try to make the necessary fixes and finally get it merged.

edit: One more general note: Usually, I would request some more unit tests for added functionality, so we can slowly improve test coverage of the Python tools. However, I feel that many of those tests would become redundant with the planned data model refactoring, so it's probably not worth it spending a lot of time on tests right now.

spdx/checksum.py

spdx/file.py

tests/data/formats/SPDXXmlExample.xml

nicoweidner · 2022-10-31T15:19:01Z

tests/utils_test.py

+                if isinstance(v[0], dict):
+                    v = sorted(v, key=lambda x: json.dumps(x, sort_keys=True))
+                else:
+                    v = sorted(v)


I don't get the logic. Why don't we just call sort_nested(v) here?
Instead, the sorting used here should be applied to data in the case where it is a list

nicoweidner · 2022-10-31T15:35:59Z

tests/utils_test.py

-        del(data['SpdxDocument']['Document']['creationInfo'])
+    parser = xmlparser.Parser(Builder(), StandardLogger())
+    with io.open(location, encoding='utf-8') as f:
+        document, _ = parser.parse(f)


Maybe we should raise an error in case parsing fails?

tests/utils_test.py

jotterson · 2022-11-05T01:57:57Z

I'm working on this. There are many requested changes, plus a potential rebase. It's gonna take a few days.

If you're in a hurry, I'm about half done, all the tests are currently passing, but the example data is still old. Let me know, and I will make a new commit with the fixes that I have.

nicoweidner · 2022-11-08T11:43:42Z

I'm working on this. There are many requested changes, plus a potential rebase. It's gonna take a few days.

If you're in a hurry, I'm about half done, all the tests are currently passing, but the example data is still old. Let me know, and I will make a new commit with the fixes that I have.

I'd like to get this merged fairly soon, but there is no immediate hurry. I'd say a timeline of about a week would be better than a month. I feel bad creating pressure here, since this PR was ignored for so long, sorry about that! Our plan is to get all outstanding fixes (and maybe a few additional ones) in and do an as-is release, since the last release was like 3 years ago. Then we plan to tackle some larger structural changes to the tools.
In case you feel you'll be short on time, I can also try to help out with some of the fixes - in particular the rebase may take some work. I believe @meretp already had a look at that and noticed quite a few conflicts.

Also, sorry for the pile of comments, I tend to be very verbose in reviews 😅. I do this with the mindset that it's ok to disagree with comments, I am very rarely dogmatic (some kind of reason should be given though).

jotterson · 2022-11-13T15:02:57Z

The tag/value parser is going to need a real enema. There are some horrible assumptions in there related to sequencing, package before files, etc. The latest tag/value files from the tools-java project really demonstrate that problem.

IMHO the current yacc solution is excessively complicated.

nicoweidner · 2022-11-14T14:40:29Z

The tag/value parser is going to need a real enema. There are some horrible assumptions in there related to sequencing, package before files, etc. The latest tag/value files from the tools-java project really demonstrate that problem.

IMHO the current yacc solution is excessively complicated.

The tag/value parser is a catastrophe. Configuring the parser via docstrings is honestly the most dubious thing I have seen in my coding life so far.
I am not sure how a reasonably simple solution could look like, though. I think tag/value is inherently a bad format for SPDX, because SPDX documents are deeply nested and tag/value has no native support for nesting. The tools-java library uses a custom-built solution that uses less black magic, but it's not really shorter either (see parser and builder)

That said, @meretp, @armintaenzertng and myself dealt with it (read: had to deal with it...) recently while adding some new properties. If it's causing you headaches, feel free to skip tag/value for now and we'll try to finish it afterwards.

json, yaml, xml formats appear to be working. tag/value and rdf is not working yet. updated test/data/formats files from tools-java project.

jotterson · 2022-11-20T20:24:45Z

Here is an update to my commit of 51 weeks ago (!).

The CircleCI tests are failing because the tag/value and RDF formats are not working. JSON, YAML, XML all work, spdx data can be converted between those three formats with no apparent loss of fidelity.

nicoweidner · 2022-11-22T17:30:45Z

Hi @jotterson. I am afraid I have to step in here, because with your recent massive changes, I see no way to get this PR rebased and merged. It is way too big for a single PR and addresses a variety of issues instead of focusing on a single one (several of those issues are already fixed on main). I do realize that the missing review for this PR (for about a year) played a large part in the current situation. I'm really sorry about that, but it's neither something I can change nor something that changes the situation.

In order to get this merged, the PR needs to be broken up into smaller chunks, such as:

support for multiple checksums and file types, as the name suggests (could even be broken up into two)
adding example files from tools-java
annotation parsing changes (didn't look at them in detail, but I don't think they belong in this PR)
changes to external references
changes to the xml parser (including a description of how this is preferrable to the previous simple solution that converted the xml to a plain dict)

With the changes concentrated in very few, massive commits, I don't see a way to easily cherry-pick specific aspects. The latest commit is also missing a signoff, though that is probably the smallest problem.

Issues that are already fixed on main:

renaming licenseInfoFromSnippet
files on document-level instead of nested inside packages
adding SPDX v2.3 properties such as packagePrimaryPurpose (2.3 support issues are tracked in Support SPDX 2.3 #227)
snippet byte/line range

Things that shouldn't be changed:

document_describes and has_files are Json-specific fields (+yaml/xml), they should not be added to the data model. They have to be parsed from Json and written to Json when de-/serializing

I am happy to help get smaller PRs reviewed quickly so the changes can be merged, but rebasing this PR in the current state will simply not be possible, I'm afraid.

jotterson and others added 3 commits November 26, 2021 09:47

add support for multiple file checksums, file types

c6043bf

Signed-off-by: Jeffrey Otterson <[email protected]>

Merge branch 'main' into checksums_final

715f992

support python before 3.9

7bbecc0

Signed-off-by: Jeffrey Otterson <[email protected]>

jotterson force-pushed the checksums_final branch from c3c610d to 7bbecc0 Compare November 26, 2021 15:41

allow multiple checksums for package, too.

d217936

Signed-off-by: Jeffrey Otterson <[email protected]>

Merge branch 'main' into checksums_final

7383eb8

nicoweidner added the blocked Blocked by some other PR or issue. Details should be mentioned in comments. label Oct 25, 2022

nicoweidner mentioned this pull request Oct 25, 2022

ISSUES-196 parse spdx 2.2 #250

Closed

meretp mentioned this pull request Oct 27, 2022

[issue-184] validate against json-spec #254

Merged

meretp reviewed Oct 27, 2022

View reviewed changes

nicoweidner requested changes Oct 31, 2022

View reviewed changes

nicoweidner removed the blocked Blocked by some other PR or issue. Details should be mentioned in comments. label Nov 3, 2022

meretp mentioned this pull request Nov 9, 2022

Support SPDX 2.3 #227

Closed

support spdx 2.3

674312e

json, yaml, xml formats appear to be working. tag/value and rdf is not working yet. updated test/data/formats files from tools-java project.

jotterson closed this Nov 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add support for multiple file checksums, file types #200

add support for multiple file checksums, file types #200

jotterson commented Nov 26, 2021

dholth commented Nov 29, 2021 •

edited

Loading

jotterson commented Nov 29, 2021

dholth commented Nov 29, 2021

jotterson commented Dec 1, 2021

pombredanne commented Dec 15, 2021

0t1s1 commented Jan 28, 2022

jotterson commented Feb 1, 2022

0t1s1 commented Feb 4, 2022

jotterson commented Jun 28, 2022

nicoweidner commented Oct 25, 2022

meretp left a comment

nicoweidner left a comment •

edited

Loading

nicoweidner Oct 31, 2022

nicoweidner Oct 31, 2022

jotterson commented Nov 5, 2022

nicoweidner commented Nov 8, 2022

jotterson commented Nov 13, 2022

nicoweidner commented Nov 14, 2022 •

edited

Loading

jotterson commented Nov 20, 2022

nicoweidner commented Nov 22, 2022

add support for multiple file checksums, file types #200

add support for multiple file checksums, file types #200

Conversation

jotterson commented Nov 26, 2021

dholth commented Nov 29, 2021 • edited Loading

jotterson commented Nov 29, 2021

dholth commented Nov 29, 2021

jotterson commented Dec 1, 2021

pombredanne commented Dec 15, 2021

0t1s1 commented Jan 28, 2022

jotterson commented Feb 1, 2022

0t1s1 commented Feb 4, 2022

jotterson commented Jun 28, 2022

nicoweidner commented Oct 25, 2022

meretp left a comment

Choose a reason for hiding this comment

nicoweidner left a comment • edited Loading

Choose a reason for hiding this comment

nicoweidner Oct 31, 2022

Choose a reason for hiding this comment

nicoweidner Oct 31, 2022

Choose a reason for hiding this comment

jotterson commented Nov 5, 2022

nicoweidner commented Nov 8, 2022

jotterson commented Nov 13, 2022

nicoweidner commented Nov 14, 2022 • edited Loading

jotterson commented Nov 20, 2022

nicoweidner commented Nov 22, 2022

dholth commented Nov 29, 2021 •

edited

Loading

nicoweidner left a comment •

edited

Loading

nicoweidner commented Nov 14, 2022 •

edited

Loading