Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reject document if the JSON is not UTF-8 encoded #604

Open
bernhardreiter opened this issue Dec 10, 2024 · 1 comment
Open

Reject document if the JSON is not UTF-8 encoded #604

bernhardreiter opened this issue Dec 10, 2024 · 1 comment

Comments

@bernhardreiter
Copy link
Member

Currently documents that are not UTF-8 encoded will be downloaded and not (most liely) flagged by the checker.

We found examples of latin-1 encoding.

But JSON on the internet must be utf-8 encoded.

The standard https://datatracker.ietf.org/doc/html/rfc8259#section-8.1 is clear on this:

JSON text exchanged between systems that are not part of a closed
ecosystem MUST be encoded using UTF-8

curl -O https://ftp.suse.com/pub/projects/security/csaf/suse-su-2024_1139-1.json
python3
Python 3.11.2 (main, Sep 14 2024, 03:00:30) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import json
>>> with open("suse-su-2024_1139-1.json") as f:
...   json.load(f)
... 
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/usr/lib/python3.11/json/__init__.py", line 293, in load
    return loads(fp.read(),
                 ^^^^^^^^^
  File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xae in position 1591: invalid start byte
>>> b=open("suse-su-2024_1139-1.json","rb")
>>> b.read()[1591]
man latin1 | grep " AE "
       256   174   AE     ®     REGISTERED SIGN
@bernhardreiter
Copy link
Member Author

A call like https://pkg.go.dev/unicode/[email protected]#Valid could help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant