Handle BOMs when loading HRA criteria tables #1461

phargogh · 2023-12-05T23:34:05Z

This PR corrects (and tests for) loading the HRA criteria table using the UTF-8-SIG encoding.

Fixes #1460

Checklist

Updated HISTORY.rst and link to any relevant issue (if these changes are user-facing)
~~- [ ] Updated the user's guide (if needed)~~
~~- [ ] Tested the Workbench UI (if relevant)~~

RE:natcap#1460

…460-hra-csv-bom-loading-issue Conflicts: HISTORY.rst src/natcap/invest/hra.py

phargogh · 2023-12-05T23:36:58Z

With the update to use utils.read_csv_to_dataframe, the only value-add here is the test to make sure we can handle the BOM when parsing.

dcdenu4

BOMs away!

dcdenu4

Oops, I was a little premature with my approval. Looks like the Windows tests are failing on this.

RE:natcap#1460

…p#1460

dcdenu4

Thanks @phargogh , still some funky Windows things unfortunately!

dcdenu4 · 2023-12-07T15:41:32Z

tests/test_hra.py

@@ -331,12 +331,11 @@ def test_criteria_table_parsing_with_bom(self):
        from natcap.invest import hra

        criteria_table_path = os.path.join(self.workspace_dir, 'criteria.csv')
-        with open(criteria_table_path, 'w') as criteria_table:
-            bom_char = "\uFEFF"  # byte-order marker in 16-bit hex value
+        with open(criteria_table_path, 'w', encoding='utf-8-sig') as criteria_table:


utf-8-sig includes the BOM when writing the file. Cool!

dcdenu4 · 2023-12-07T16:02:18Z

tests/test_hra.py

+        bom_char = "\uFEFF"  # byte-order marker in 16-bit hex value
+        with open(criteria_table_path) as criteria_table:
+            assert criteria_table.read().startswith(bom_char)


This bit is still failing on Windows and I think it's because of the following from Python's codec docs (https://docs.python.org/3/library/codecs.html#encodings-and-unicode)

On encoding the utf-8-sig codec will write 0xef, 0xbb, 0xbf as the first three bytes to the file. On decoding utf-8-sig will skip those three bytes if they appear as the first three bytes in the file. In UTF-8, the use of the BOM is discouraged and should generally be avoided.

So maybe it's being stripped out when opening and decoding? Although this is contradictory to what I'm seeing in my terminal:

>>> with open("test-bom.csv", 'w', encoding='utf-8-sig') as my_table: ... my_table.write("HABITAT NAME,eelgrass,,,hardbottom,,,CRITERIA TYPE") >>> >>> with open("test-bom.csv") as table: ... print(table.read()) ... ï»¿HABITAT NAME,eelgrass,,,hardbottom,,,CRITERIA TYPE

Which makes sense because:

To increase the reliability with which a UTF-8 encoding can be detected, Microsoft invented a variant of UTF-8 (that Python calls "utf-8-sig") for its Notepad program: Before any of the Unicode characters is written to the file, a UTF-8 encoded BOM (which looks like this as a byte sequence: 0xef, 0xbb, 0xbf) is written. As it’s rather improbable that any charmap encoded file starts with these byte values (which would e.g. map to

LATIN SMALL LETTER I WITH DIAERESIS
RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
INVERTED QUESTION MARK

So, I'm not sure why the test assertion error looks like the BOM is being decoded and stripped, where as when I open and read it I get the funky character mapping...

phargogh · 2023-12-08T04:59:02Z

Thanks @dcdenu4 ! It looks like the missing part was to open it up in binary mode ... that's going to bypass any interpretation or OS-specific assumptions of whether to strip those leading bytes.

dcdenu4

Thanks James. There are some Workbench related dependencies failing, but they're unrelated.

phargogh added 2 commits December 5, 2023 15:28

Fixing and testing UTF-8-SIG for the criteria table.

753bcb3

RE:natcap#1460

Noting change in HISTORY.

dcf897b

RE:natcap#1460

phargogh requested a review from dcdenu4 December 5, 2023 23:34

phargogh assigned dcdenu4 and phargogh Dec 5, 2023

Merge branch 'main' of https://github.com/natcap/invest into bugfix/1…

3c07a04

…460-hra-csv-bom-loading-issue Conflicts: HISTORY.rst src/natcap/invest/hra.py

phargogh mentioned this pull request Dec 5, 2023

HRA criteria table loading fails cryptically when CSV has a BOM #1460

Closed

dcdenu4 approved these changes Dec 6, 2023

View reviewed changes

dcdenu4 self-requested a review December 6, 2023 19:52

dcdenu4 requested changes Dec 6, 2023

View reviewed changes

phargogh added 2 commits December 6, 2023 18:27

Using utf-8-sig encoding directly.

b19b946

RE:natcap#1460

Making sure the new file is closed after we're done with it. RE:natca…

4dbe7f5

…p#1460

dcdenu4 requested changes Dec 7, 2023

View reviewed changes

Doing the test by opening the file in binary mode. RE:natcap#1460

0bc66be

phargogh requested a review from dcdenu4 December 8, 2023 04:57

phargogh mentioned this pull request Dec 9, 2023

Update logo in API docs #1464

Merged

1 task

dcdenu4 approved these changes Dec 11, 2023

View reviewed changes

dcdenu4 merged commit 9ec52a8 into natcap:main Dec 11, 2023
23 of 25 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle BOMs when loading HRA criteria tables #1461

Handle BOMs when loading HRA criteria tables #1461

phargogh commented Dec 5, 2023

phargogh commented Dec 5, 2023

dcdenu4 left a comment

dcdenu4 left a comment

dcdenu4 left a comment

dcdenu4 Dec 7, 2023

dcdenu4 Dec 7, 2023

phargogh commented Dec 8, 2023

dcdenu4 left a comment

Handle BOMs when loading HRA criteria tables #1461

Handle BOMs when loading HRA criteria tables #1461

Conversation

phargogh commented Dec 5, 2023

Checklist

phargogh commented Dec 5, 2023

dcdenu4 left a comment

Choose a reason for hiding this comment

dcdenu4 left a comment

Choose a reason for hiding this comment

dcdenu4 left a comment

Choose a reason for hiding this comment

dcdenu4 Dec 7, 2023

Choose a reason for hiding this comment

dcdenu4 Dec 7, 2023

Choose a reason for hiding this comment

phargogh commented Dec 8, 2023

dcdenu4 left a comment

Choose a reason for hiding this comment