-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix time zone handling in index values and metadata #1866
base: master
Are you sure you want to change the base?
Conversation
Add another chak for the tests Address PR comments and add more test cases Fix edge case on Windows Fix broken test on windows Recreate the timestamp with timezone Coerce all timezone to datetime.timezone Add metadata to compat tests Add meta to compat tests Add a check for the buff len from denormalizing timestamps Add a note in the docs about using the time zone info to create other time zone types Add comment about the old PD_TIMESTAMP normalization Handle case when we have a valid tz from a past version Fix tz in meta in backwards compat test Fix install of old protobuf version Remove outdated assert Add 0 offset case Add write of meta data Fix interop test Add metadata to append Rebase Fix Test with string only Fix comparison of time zone information in test_normalization.py Normalize time zone information in _normalization.py Normalize time zone information in _normalization.py and fix crash during normalization of time zones Fix some details Don;t create libs with v2 encoding Update test cases
a8d3bd7
to
b117618
Compare
- Modify normalize_tz function to handle DatetimeIndex objects properly
@@ -317,16 +373,16 @@ def _normalize_single_index(index, index_names, index_norm, dynamic_strings=None | |||
index_norm.fake_field_pos.append(0) | |||
log.debug("Index has no name, defaulting to 'index'") | |||
if isinstance(index, DatetimeIndex) and index.tz is not None: | |||
index_tz = get_timezone(index.tz) | |||
index_tz = index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the naming has gone a bit weird here, index_tz
being pd.Index or np.ndarray
is odd
@@ -18,8 +18,10 @@ | |||
|
|||
|
|||
@pytest.fixture(params=[pytest.param("REAL_S3", marks=REAL_S3_TESTS_MARK)]) | |||
def shared_persistent_arctic_client(real_s3_storage_without_clean_up, encoding_version): | |||
return real_s3_storage_without_clean_up.create_arctic(encoding_version=encoding_version) | |||
# DONT MERGE: there is a problem with interop tests and encoding_version 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO
|
||
ArcticDB takes a more strict approach to time zone handling than Pandas. While Pandas support multiple types for storing time zone information, ArcticDB coerces all time zone information to `pytz.tzfile` as this is the default representation for Pandas as well. This way we can ensure that all time zone information is stored in a consistent manner, and that all time zone information is preserved when data is written to and read from ArcticDB. | ||
|
||
When writing data with time zone information to ArcticDB, we preserve the time name as string (e.g. "Europe/Amsterdam"). When reading data with time zone information from ArcticDB, this data is used to create a `pytz.tzfile` object. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the writer uses a timezone file that is not in the reader's zone database? How does this work across operating systems? I assume the database files are quite different across Linux and Windows, there are quite complex mappings between them. I think this makes it fundamentally impossible to round-trip timezones correctly - at least without relying on some standard database and validating that what is written is in there 🤔
Ah I see - this is all handled by pytz
. It would be interesting to know how this stuff behaves for zones that are not in the Olson database. It would also be interesting to know how this behaves if the reader has an older pytz
and the timezone is not in their database.
|
||
!!! note | ||
|
||
This means that regardles of the timezone types being written in ArcticDB, all time zone types are normalised to `pytz.tzfile`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's important that we round-trip correctly - this could be quite a disruptive change otherwise. At least requiring a new major version - unless I'm misunderstanding? I notice the PR description says that this keeps the interface the same
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't repro the crash in issue #1406 anymore
|
||
### How does ArcticDB handle `time zone` information? | ||
|
||
ArcticDB takes a more strict approach to time zone handling than Pandas. While Pandas support multiple types for storing time zone information, ArcticDB coerces all time zone information to `pytz.tzfile` as this is the default representation for Pandas as well. This way we can ensure that all time zone information is stored in a consistent manner, and that all time zone information is preserved when data is written to and read from ArcticDB. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is pytz
actually the default in Pandas? https://github.com/pandas-dev/pandas/blob/main/pyproject.toml#L33
It seems to have changed a lot in Pandas 2 https://github.com/pandas-dev/pandas/blob/2a10e04a099d5f1633abcdfbb2dd9fdf09142f8d/doc/source/whatsnew/v2.0.0.rst#L555
Reference Issues/PRs
Fixes #1406 and #1827
What does this implement or fix?
This is done by extending the current approach to decompose the timestamp into:
This approach is beneficial as it:
The main problem is that we need to do some post processing to coerce the time zone info to be consistent (done here)
Any other comments?
An alternative approach was considered in a past PR here, but that had a couple of limitation:
datetime