Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle BOMs when loading HRA criteria tables #1461

Merged
merged 6 commits into from
Dec 11, 2023

Conversation

phargogh
Copy link
Member

@phargogh phargogh commented Dec 5, 2023

This PR corrects (and tests for) loading the HRA criteria table using the UTF-8-SIG encoding.

Fixes #1460

Checklist

  • Updated HISTORY.rst and link to any relevant issue (if these changes are user-facing)
    - [ ] Updated the user's guide (if needed)
    - [ ] Tested the Workbench UI (if relevant)

…460-hra-csv-bom-loading-issue

Conflicts:
	HISTORY.rst
	src/natcap/invest/hra.py
@phargogh
Copy link
Member Author

phargogh commented Dec 5, 2023

With the update to use utils.read_csv_to_dataframe, the only value-add here is the test to make sure we can handle the BOM when parsing.

Copy link
Member

@dcdenu4 dcdenu4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BOMs away!

@dcdenu4 dcdenu4 self-requested a review December 6, 2023 19:52
Copy link
Member

@dcdenu4 dcdenu4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I was a little premature with my approval. Looks like the Windows tests are failing on this.

Copy link
Member

@dcdenu4 dcdenu4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @phargogh , still some funky Windows things unfortunately!

@@ -331,12 +331,11 @@ def test_criteria_table_parsing_with_bom(self):
from natcap.invest import hra

criteria_table_path = os.path.join(self.workspace_dir, 'criteria.csv')
with open(criteria_table_path, 'w') as criteria_table:
bom_char = "\uFEFF" # byte-order marker in 16-bit hex value
with open(criteria_table_path, 'w', encoding='utf-8-sig') as criteria_table:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utf-8-sig includes the BOM when writing the file. Cool!

Comment on lines 355 to 357
bom_char = "\uFEFF" # byte-order marker in 16-bit hex value
with open(criteria_table_path) as criteria_table:
assert criteria_table.read().startswith(bom_char)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bit is still failing on Windows and I think it's because of the following from Python's codec docs (https://docs.python.org/3/library/codecs.html#encodings-and-unicode)

On encoding the utf-8-sig codec will write 0xef, 0xbb, 0xbf as the first three bytes to the file. On decoding utf-8-sig will skip those three bytes if they appear as the first three bytes in the file. In UTF-8, the use of the BOM is discouraged and should generally be avoided.

So maybe it's being stripped out when opening and decoding? Although this is contradictory to what I'm seeing in my terminal:

>>> with open("test-bom.csv", 'w', encoding='utf-8-sig') as my_table:
...     my_table.write("HABITAT NAME,eelgrass,,,hardbottom,,,CRITERIA TYPE")
>>>
>>> with open("test-bom.csv") as table:
...     print(table.read())
...
HABITAT NAME,eelgrass,,,hardbottom,,,CRITERIA TYPE

Which makes sense because:

To increase the reliability with which a UTF-8 encoding can be detected, Microsoft invented a variant of UTF-8 (that Python calls "utf-8-sig") for its Notepad program: Before any of the Unicode characters is written to the file, a UTF-8 encoded BOM (which looks like this as a byte sequence: 0xef, 0xbb, 0xbf) is written. As it’s rather improbable that any charmap encoded file starts with these byte values (which would e.g. map to

LATIN SMALL LETTER I WITH DIAERESIS
RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
INVERTED QUESTION MARK

So, I'm not sure why the test assertion error looks like the BOM is being decoded and stripped, where as when I open and read it I get the funky character mapping...

@phargogh phargogh requested a review from dcdenu4 December 8, 2023 04:57
@phargogh
Copy link
Member Author

phargogh commented Dec 8, 2023

Thanks @dcdenu4 ! It looks like the missing part was to open it up in binary mode ... that's going to bypass any interpretation or OS-specific assumptions of whether to strip those leading bytes.

@phargogh phargogh mentioned this pull request Dec 9, 2023
1 task
Copy link
Member

@dcdenu4 dcdenu4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks James. There are some Workbench related dependencies failing, but they're unrelated.

@dcdenu4 dcdenu4 merged commit 9ec52a8 into natcap:main Dec 11, 2023
23 of 25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

HRA criteria table loading fails cryptically when CSV has a BOM
2 participants