Fix time zone handling in index values and metadata #1866

G-D-Petrov · 2024-10-01T14:50:36Z

Reference Issues/PRs

Fixes #1406 and #1827

What does this implement or fix?

Unify the way time zones are handled for indexes and metadata, as much as possible
Extend the support for time zone in the indexes/metadata to cover all cases supported by pandas
Update the documentation to be more explicit about how/where time zones are supported

This is done by extending the current approach to decompose the timestamp into:

its value - int
its time zone info - str

This approach is beneficial as it:

keeps the interface the same
yields the most consistent results

The main problem is that we need to do some post processing to coerce the time zone info to be consistent (done here)

Any other comments?

An alternative approach was considered in a past PR here, but that had a couple of limitation:

required an interface break
was inconsistent because:
- getting the offset value in some cases was incorrect inputs of type datetime
- still needed to do some post processing to make sure that the time zone info is in a consistent format

Add another chak for the tests Address PR comments and add more test cases Fix edge case on Windows Fix broken test on windows Recreate the timestamp with timezone Coerce all timezone to datetime.timezone Add metadata to compat tests Add meta to compat tests Add a check for the buff len from denormalizing timestamps Add a note in the docs about using the time zone info to create other time zone types Add comment about the old PD_TIMESTAMP normalization Handle case when we have a valid tz from a past version Fix tz in meta in backwards compat test Fix install of old protobuf version Remove outdated assert Add 0 offset case Add write of meta data Fix interop test Add metadata to append Rebase Fix Test with string only Fix comparison of time zone information in test_normalization.py Normalize time zone information in _normalization.py Normalize time zone information in _normalization.py and fix crash during normalization of time zones Fix some details Don;t create libs with v2 encoding Update test cases

- Modify normalize_tz function to handle DatetimeIndex objects properly

poodlewars · 2024-10-14T13:24:40Z

python/arcticdb/version_store/_normalization.py

@@ -317,16 +373,16 @@ def _normalize_single_index(index, index_names, index_norm, dynamic_strings=None
                index_norm.fake_field_pos.append(0)
            log.debug("Index has no name, defaulting to 'index'")
        if isinstance(index, DatetimeIndex) and index.tz is not None:
-            index_tz = get_timezone(index.tz)
+            index_tz = index


I think the naming has gone a bit weird here, index_tz being pd.Index or np.ndarray is odd

poodlewars · 2024-10-14T13:27:33Z

python/tests/integration/arcticdb/test_persistent_storage.py

@@ -18,8 +18,10 @@


 @pytest.fixture(params=[pytest.param("REAL_S3", marks=REAL_S3_TESTS_MARK)])
-def shared_persistent_arctic_client(real_s3_storage_without_clean_up, encoding_version):
-    return real_s3_storage_without_clean_up.create_arctic(encoding_version=encoding_version)
+# DONT MERGE: there is a problem with interop tests and encoding_version 2


poodlewars · 2024-10-14T13:39:13Z

docs/mkdocs/docs/faq.md

+
+ArcticDB takes a more strict approach to time zone handling than Pandas. While Pandas support multiple types for storing time zone information, ArcticDB coerces all time zone information to `pytz.tzfile` as this is the default representation for Pandas as well. This way we can ensure that all time zone information is stored in a consistent manner, and that all time zone information is preserved when data is written to and read from ArcticDB.
+
+When writing data with time zone information to ArcticDB, we preserve the time name as string (e.g. "Europe/Amsterdam"). When reading data with time zone information from ArcticDB, this data is used to create a `pytz.tzfile` object.


What happens if the writer uses a timezone file that is not in the reader's zone database? How does this work across operating systems? I assume the database files are quite different across Linux and Windows, there are quite complex mappings between them. I think this makes it fundamentally impossible to round-trip timezones correctly - at least without relying on some standard database and validating that what is written is in there 🤔

Ah I see - this is all handled by pytz. It would be interesting to know how this stuff behaves for zones that are not in the Olson database. It would also be interesting to know how this behaves if the reader has an older pytz and the timezone is not in their database.

poodlewars · 2024-10-14T13:43:34Z

docs/mkdocs/docs/faq.md

+
+!!! note
+
+    This means that regardles of the timezone types being written in ArcticDB, all time zone types are normalised to `pytz.tzfile`.


I think it's important that we round-trip correctly - this could be quite a disruptive change otherwise. At least requiring a new major version - unless I'm misunderstanding? I notice the PR description says that this keeps the interface the same.

poodlewars · 2024-10-14T14:02:03Z

docs/mkdocs/docs/faq.md

I can't repro the crash in issue #1406 anymore

poodlewars · 2024-10-14T14:21:13Z

docs/mkdocs/docs/faq.md

+
+### How does ArcticDB handle `time zone` information?
+
+ArcticDB takes a more strict approach to time zone handling than Pandas. While Pandas support multiple types for storing time zone information, ArcticDB coerces all time zone information to `pytz.tzfile` as this is the default representation for Pandas as well. This way we can ensure that all time zone information is stored in a consistent manner, and that all time zone information is preserved when data is written to and read from ArcticDB.


Is pytz actually the default in Pandas? https://github.com/pandas-dev/pandas/blob/main/pyproject.toml#L33

It seems to have changed a lot in Pandas 2 https://github.com/pandas-dev/pandas/blob/2a10e04a099d5f1633abcdfbb2dd9fdf09142f8d/doc/source/whatsnew/v2.0.0.rst#L555

G-D-Petrov force-pushed the 1406_strings_only branch from a8d3bd7 to b117618 Compare October 2, 2024 06:44

G-D-Petrov added 2 commits October 2, 2024 09:52

Update the docs

b5776b5

Bring back a backwards compat check

e1c3887

G-D-Petrov marked this pull request as ready for review October 2, 2024 07:05

G-D-Petrov requested review from alexowens90, willdealtry and poodlewars as code owners October 2, 2024 07:05

G-D-Petrov added 3 commits October 2, 2024 10:33

Normalize time zones for DatetimeIndex in _PandasNormalizer

8d03e9f

- Update the documentation to clarify time zone handling in ArcticDB

e33b93a

- Modify normalize_tz function to handle DatetimeIndex objects properly

Normalize time zones for DatetimeIndex in _PandasNormalizer

fb8baed

poodlewars reviewed Oct 14, 2024

View reviewed changes

docs/mkdocs/docs/faq.md

Copy link

Collaborator

poodlewars Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't repro the crash in issue #1406 anymore

poodlewars reviewed Oct 14, 2024

View reviewed changes

maxim-morozov added the replicated label Nov 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix time zone handling in index values and metadata #1866

Fix time zone handling in index values and metadata #1866

G-D-Petrov commented Oct 1, 2024 •

edited

Loading

poodlewars Oct 14, 2024

poodlewars Oct 14, 2024

poodlewars Oct 14, 2024 •

edited

Loading

poodlewars Oct 14, 2024

poodlewars Oct 14, 2024

poodlewars Oct 14, 2024 •

edited

Loading


		ArcticDB takes a more strict approach to time zone handling than Pandas. While Pandas support multiple types for storing time zone information, ArcticDB coerces all time zone information to `pytz.tzfile` as this is the default representation for Pandas as well. This way we can ensure that all time zone information is stored in a consistent manner, and that all time zone information is preserved when data is written to and read from ArcticDB.

		When writing data with time zone information to ArcticDB, we preserve the time name as string (e.g. "Europe/Amsterdam"). When reading data with time zone information from ArcticDB, this data is used to create a `pytz.tzfile` object.


		!!! note

		This means that regardles of the timezone types being written in ArcticDB, all time zone types are normalised to `pytz.tzfile`.


		### How does ArcticDB handle `time zone` information?

		ArcticDB takes a more strict approach to time zone handling than Pandas. While Pandas support multiple types for storing time zone information, ArcticDB coerces all time zone information to `pytz.tzfile` as this is the default representation for Pandas as well. This way we can ensure that all time zone information is stored in a consistent manner, and that all time zone information is preserved when data is written to and read from ArcticDB.

Fix time zone handling in index values and metadata #1866

Are you sure you want to change the base?

Fix time zone handling in index values and metadata #1866

Conversation

G-D-Petrov commented Oct 1, 2024 • edited Loading

Reference Issues/PRs

What does this implement or fix?

Any other comments?

poodlewars Oct 14, 2024

Choose a reason for hiding this comment

poodlewars Oct 14, 2024

Choose a reason for hiding this comment

poodlewars Oct 14, 2024 • edited Loading

Choose a reason for hiding this comment

poodlewars Oct 14, 2024

Choose a reason for hiding this comment

poodlewars Oct 14, 2024

Choose a reason for hiding this comment

poodlewars Oct 14, 2024 • edited Loading

Choose a reason for hiding this comment

G-D-Petrov commented Oct 1, 2024 •

edited

Loading

poodlewars Oct 14, 2024 •

edited

Loading

poodlewars Oct 14, 2024 •

edited

Loading