-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle BOMs when loading HRA criteria tables #1461
Handle BOMs when loading HRA criteria tables #1461
Conversation
…460-hra-csv-bom-loading-issue Conflicts: HISTORY.rst src/natcap/invest/hra.py
With the update to use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BOMs away!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, I was a little premature with my approval. Looks like the Windows tests are failing on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @phargogh , still some funky Windows things unfortunately!
@@ -331,12 +331,11 @@ def test_criteria_table_parsing_with_bom(self): | |||
from natcap.invest import hra | |||
|
|||
criteria_table_path = os.path.join(self.workspace_dir, 'criteria.csv') | |||
with open(criteria_table_path, 'w') as criteria_table: | |||
bom_char = "\uFEFF" # byte-order marker in 16-bit hex value | |||
with open(criteria_table_path, 'w', encoding='utf-8-sig') as criteria_table: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utf-8-sig
includes the BOM when writing the file. Cool!
tests/test_hra.py
Outdated
bom_char = "\uFEFF" # byte-order marker in 16-bit hex value | ||
with open(criteria_table_path) as criteria_table: | ||
assert criteria_table.read().startswith(bom_char) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This bit is still failing on Windows and I think it's because of the following from Python's codec docs (https://docs.python.org/3/library/codecs.html#encodings-and-unicode)
On encoding the utf-8-sig codec will write 0xef, 0xbb, 0xbf as the first three bytes to the file. On decoding utf-8-sig will skip those three bytes if they appear as the first three bytes in the file. In UTF-8, the use of the BOM is discouraged and should generally be avoided.
So maybe it's being stripped out when opening and decoding? Although this is contradictory to what I'm seeing in my terminal:
>>> with open("test-bom.csv", 'w', encoding='utf-8-sig') as my_table:
... my_table.write("HABITAT NAME,eelgrass,,,hardbottom,,,CRITERIA TYPE")
>>>
>>> with open("test-bom.csv") as table:
... print(table.read())
...
HABITAT NAME,eelgrass,,,hardbottom,,,CRITERIA TYPE
Which makes sense because:
To increase the reliability with which a UTF-8 encoding can be detected, Microsoft invented a variant of UTF-8 (that Python calls "utf-8-sig") for its Notepad program: Before any of the Unicode characters is written to the file, a UTF-8 encoded BOM (which looks like this as a byte sequence: 0xef, 0xbb, 0xbf) is written. As it’s rather improbable that any charmap encoded file starts with these byte values (which would e.g. map to
LATIN SMALL LETTER I WITH DIAERESIS
RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
INVERTED QUESTION MARK
So, I'm not sure why the test assertion error looks like the BOM is being decoded and stripped, where as when I open and read it I get the funky character mapping...
Thanks @dcdenu4 ! It looks like the missing part was to open it up in binary mode ... that's going to bypass any interpretation or OS-specific assumptions of whether to strip those leading bytes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks James. There are some Workbench related dependencies failing, but they're unrelated.
This PR corrects (and tests for) loading the HRA criteria table using the UTF-8-SIG encoding.
Fixes #1460
Checklist
- [ ] Updated the user's guide (if needed)- [ ] Tested the Workbench UI (if relevant)