Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix time zone handling in index values and metadata #1866

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

G-D-Petrov
Copy link
Collaborator

@G-D-Petrov G-D-Petrov commented Oct 1, 2024

Reference Issues/PRs

Fixes #1406 and #1827

What does this implement or fix?

  • Unify the way time zones are handled for indexes and metadata, as much as possible
  • Extend the support for time zone in the indexes/metadata to cover all cases supported by pandas
  • Update the documentation to be more explicit about how/where time zones are supported

This is done by extending the current approach to decompose the timestamp into:

  • its value - int
  • its time zone info - str

This approach is beneficial as it:

  • keeps the interface the same
  • yields the most consistent results

The main problem is that we need to do some post processing to coerce the time zone info to be consistent (done here)

Any other comments?

An alternative approach was considered in a past PR here, but that had a couple of limitation:

  • required an interface break
  • was inconsistent because:
    • getting the offset value in some cases was incorrect inputs of type datetime
    • still needed to do some post processing to make sure that the time zone info is in a consistent format

Add another chak for the tests

Address PR comments and add more test cases

Fix edge case on Windows

Fix broken test on windows

Recreate the timestamp with timezone

Coerce all timezone to datetime.timezone

Add metadata to compat tests

Add meta to compat tests

Add a check for the buff len from denormalizing timestamps

Add a note in the docs about using the time zone info to create other time zone types

Add comment about the old PD_TIMESTAMP normalization

Handle case when we have a valid tz from a past version

Fix tz in meta in backwards compat test

Fix install of old protobuf version

Remove outdated assert

Add 0 offset case

Add write of meta data

Fix interop test

Add metadata to append

Rebase

Fix

Test with string only

Fix comparison of time zone information in test_normalization.py

Normalize time zone information in _normalization.py

Normalize time zone information in _normalization.py and fix crash during normalization of time zones

Fix some details

Don;t create libs with v2 encoding

Update test cases
@G-D-Petrov G-D-Petrov marked this pull request as ready for review October 2, 2024 07:05
@@ -317,16 +373,16 @@ def _normalize_single_index(index, index_names, index_norm, dynamic_strings=None
index_norm.fake_field_pos.append(0)
log.debug("Index has no name, defaulting to 'index'")
if isinstance(index, DatetimeIndex) and index.tz is not None:
index_tz = get_timezone(index.tz)
index_tz = index
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the naming has gone a bit weird here, index_tz being pd.Index or np.ndarray is odd

@@ -18,8 +18,10 @@


@pytest.fixture(params=[pytest.param("REAL_S3", marks=REAL_S3_TESTS_MARK)])
def shared_persistent_arctic_client(real_s3_storage_without_clean_up, encoding_version):
return real_s3_storage_without_clean_up.create_arctic(encoding_version=encoding_version)
# DONT MERGE: there is a problem with interop tests and encoding_version 2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO


ArcticDB takes a more strict approach to time zone handling than Pandas. While Pandas support multiple types for storing time zone information, ArcticDB coerces all time zone information to `pytz.tzfile` as this is the default representation for Pandas as well. This way we can ensure that all time zone information is stored in a consistent manner, and that all time zone information is preserved when data is written to and read from ArcticDB.

When writing data with time zone information to ArcticDB, we preserve the time name as string (e.g. "Europe/Amsterdam"). When reading data with time zone information from ArcticDB, this data is used to create a `pytz.tzfile` object.
Copy link
Collaborator

@poodlewars poodlewars Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the writer uses a timezone file that is not in the reader's zone database? How does this work across operating systems? I assume the database files are quite different across Linux and Windows, there are quite complex mappings between them. I think this makes it fundamentally impossible to round-trip timezones correctly - at least without relying on some standard database and validating that what is written is in there 🤔

Ah I see - this is all handled by pytz. It would be interesting to know how this stuff behaves for zones that are not in the Olson database. It would also be interesting to know how this behaves if the reader has an older pytz and the timezone is not in their database.


!!! note

This means that regardles of the timezone types being written in ArcticDB, all time zone types are normalised to `pytz.tzfile`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's important that we round-trip correctly - this could be quite a disruptive change otherwise. At least requiring a new major version - unless I'm misunderstanding? I notice the PR description says that this keeps the interface the same.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't repro the crash in issue #1406 anymore


### How does ArcticDB handle `time zone` information?

ArcticDB takes a more strict approach to time zone handling than Pandas. While Pandas support multiple types for storing time zone information, ArcticDB coerces all time zone information to `pytz.tzfile` as this is the default representation for Pandas as well. This way we can ensure that all time zone information is stored in a consistent manner, and that all time zone information is preserved when data is written to and read from ArcticDB.
Copy link
Collaborator

@poodlewars poodlewars Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ArcticDB crashes when when trying to normalize timezone data with MsgPack
3 participants