Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

validate column: PostgreSQL exception with --wildcard-include-timestamp and old timestamps #1390

Closed
nj1973 opened this issue Jan 8, 2025 · 4 comments · Fixed by #1397
Closed
Assignees
Labels
type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@nj1973
Copy link
Contributor

nj1973 commented Jan 8, 2025

PostgreSQL test table (note the old date used):

CREATE TABLE dvt_test.tab_old_ts
( id         int primary key
, col_old_ts timestamp(0));
INSERT INTO dvt_test.tab_old_ts VALUES (1,'1900-01-01'::date);

Test command:

data-validation -v validate column -sc pg -tc pg \
 --tables-list=dvt_test.tab_old_ts --sum="col_old_ts" --wildcard-include-timestamp
...
sqlalchemy.exc.DataError: (psycopg2.errors.NumericValueOutOfRange) integer out of range

[SQL: SELECT count(*) AS count, sum(t0.epoch_seconds__col_old_ts) AS sum__epoch_seconds__col_old_ts
FROM (SELECT t1.id AS id, t1.col_old_ts AS col_old_ts, CAST(EXTRACT(epoch FROM t1.col_old_ts) AS INTEGER) AS epoch_seconds__col_old_ts
FROM dvt_test.tab_old_ts AS t1) AS t0]
(Background on this error at: https://sqlalche.me/e/14/9h9h)

PostgreSQL test:

postgres=# select CAST(EXTRACT(epoch FROM date'1901-01-01') AS integer);
ERROR:  integer out of range
postgres=# select CAST(EXTRACT(epoch FROM date'1905-01-01') AS integer);
   extract
-------------
 -2051222400

The number of seconds between 1901 and the Unix epoch overflows a 32 bit integer.

The same would be true of dates far beyond the Unix epoch, for example 9999-12-31, which is sometimes used in applications as a placeholder for an open ended date range.

@nj1973 nj1973 added the type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. label Jan 8, 2025
@nj1973
Copy link
Contributor Author

nj1973 commented Jan 8, 2025

My first thought was that we could change the expression used from:

CAST(EXTRACT(epoch FROM col) AS integer)

to:

CAST(EXTRACT(epoch FROM col) AS bigint)

That allows the SQL statement to finish successfully. But unfortunately it has the effect of changing the result from a Python int into a Decimal. The Decimal is in turn converted to float as part of a Pandas conversion and we have potential for wrong results.

@nj1973
Copy link
Contributor Author

nj1973 commented Jan 13, 2025

I think we ultimately have to override this expression with the BIGINT version. The natural solution to issue 1391 is to cast the output of any SUM aggregations to string which would then deal with the side effect of this proposed solution. It's a shame that Python Decimal is being coerced to float by Ibis/Pandas.

@nj1973 nj1973 self-assigned this Jan 14, 2025
@nj1973
Copy link
Contributor Author

nj1973 commented Jan 14, 2025

I've started adding tests across supported engines and we have a problem with Teradata and SQL Server too.
Teradata:

[Version 20.0.0.21] [Session 19561] [Teradata Database] [Error 2617] Overflow occurred computing an expression involving dvt_tricky_dates.col_dt_low

SQL Server:

E       sqlalchemy.exc.DataError: (pyodbc.DataError) ('22003', '[22003] [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]The datediff function resulted in an overflow. The number of dateparts separating two date/time instances is too large. Try to use datediff with a less precise datepart. (535) (SQLExecDirectW)')

@nj1973
Copy link
Contributor Author

nj1973 commented Jan 14, 2025

SQL Server research:

1> select datediff(s,'1970-01-01 00:00:00',col_ts_low) from pso_data_validator.dvt_tricky_dates;
2> go
Msg 535, Level 16, State 1, Server 969116d95f4d397, Line 1
The datediff function resulted in an overflow. The number of dateparts separating two date/time instances is too large. Try to use datediff with a less precise datepart.

datediff_big() FTW:

1> select datediff_big(s,'1970-01-01 00:00:00',col_ts_low) from pso_data_validator.dvt_tricky_dates;
2> go

--------------------
        -30610224000

@nj1973 nj1973 linked a pull request Jan 14, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
1 participant