Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow partial date parsing when simple datetime formatter is used #11386

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
51 changes: 35 additions & 16 deletions velox/docs/functions/spark/datetime.rst
Original file line number Diff line number Diff line change
Expand Up @@ -82,12 +82,9 @@ These functions support TIMESTAMP and DATE input types.

Adjusts ``unixTime`` (elapsed seconds since UNIX epoch) to configured session timezone, then
converts it to a formatted time string according to ``format``. Only supports BIGINT type for
``unixTime``. Using `Simple <https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html>`_
date formatter in lenient mode that is align with Spark legacy date parser behavior or
`Joda <https://www.joda.org/joda-time/>`_ date formatter depends on ``spark.legacy_date_formatter`` configuration.
``unixTime``.
`Valid patterns for date format
<https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html>`_. When `Simple` date formatter is used,
null is returned for invalid ``format``; otherwise, exception is thrown. This function will convert input to
<https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html>`_. This function will convert input to
milliseconds, and integer overflow is allowed in the conversion, which aligns with Spark. See the below third
example where INT64_MAX is used, -1000 milliseconds are produced by INT64_MAX * 1000 due to integer overflow. ::

Expand All @@ -112,17 +109,11 @@ These functions support TIMESTAMP and DATE input types.
Returns timestamp by parsing ``string`` according to the specified ``dateFormat``.
The format follows Spark's
`Datetime patterns
<https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html>`_.
Using `Simple <https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html>`_
date formatter in lenient mode that is align with Spark legacy date parser behavior or
`Joda <https://www.joda.org/joda-time/>`_ date formatter depends on ``spark.legacy_date_formatter`` configuration.
Returns NULL for parsing error or NULL input. When `Simple` date formatter is used, null is returned for invalid
``dateFormat``; otherwise, exception is thrown. ::
<https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html>`_. ::

SELECT get_timestamp('1970-01-01', 'yyyy-MM-dd); -- timestamp `1970-01-01`
SELECT get_timestamp('1970-01-01', 'yyyy-MM'); -- NULL (parsing error)
SELECT get_timestamp('1970-01-01', null); -- NULL
SELECT get_timestamp('2020-06-10', 'A'); -- (throws exception)

.. spark:function:: hour(timestamp) -> integer

Expand Down Expand Up @@ -291,10 +282,7 @@ These functions support TIMESTAMP and DATE input types.

.. spark:function:: unix_timestamp() -> integer

Returns the current UNIX timestamp in seconds. Using
`Simple <https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html>`_ date formatter in lenient mode
that is align with Spark legacy date parser behavior or `Joda <https://www.joda.org/joda-time/>`_ date formatter
depends on the ``spark.legacy_date_formatter`` configuration.
Returns the current UNIX timestamp in seconds.

.. spark:function:: unix_timestamp(string) -> integer
:noindex:
Expand Down Expand Up @@ -337,3 +325,34 @@ These functions support TIMESTAMP and DATE input types.
part of the 53rd week of year 2004, so the result is 2004. Only supports DATE type.

SELECT year_of_week('2005-01-02'); -- 2004

Simple vs. Joda Date Formatter
------------------------------

To align with Spark, Velox supports both `Simple <https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html>`_
and `Joda <https://www.joda.org/joda-time/>`_ date formmaters to parse/format timestamp/date strings
used in functions :spark:func:`from_unixtime`, :spark:func:`unix_timestamp`, :spark:func:`make_date`
and :spark:func:`to_unix_timestamp`.
If the configuration setting :doc:`spark.legacy_date_formatter <../../configs>` is true,
`Simple` date formmater in lenient mode is used; otherwise, `Joda` is used. It is important
to note that there are some different behaviors between these two formatters.

For :spark:func:`unix_timestamp` and :spark:func:`get_timestamp`, the `Simple` date formatter permits partial date parsing
which means that format can match only a part of input string. For example, if input string is
2015-07-22 10:00:00, it can be parsed using format is yyyy-MM-dd because the parser does not require entire
input to be consumed. In contrast, the `Joda` date formatter performs strict checks to ensure that the
format completely matches the string. If there is any mismatch, exception is thrown. ::

SELECT get_timestamp('2015-07-22 10:00:00', 'yyyy-MM-dd'); -- timestamp `2015-07-22` (for Simple date formatter)
SELECT get_timestamp('2015-07-22 10:00:00', 'yyyy-MM-dd'); -- (throws exception) (for Joda date formatter)
SELECT unix_timestamp('2016-04-08 00:00:00', 'yyyy-MM-dd'); -- 1460041200 (for Simple date formatter)
SELECT unix_timestamp('2016-04-08 00:00:00', 'yyyy-MM-dd'); -- (throws exception) (for Joda date formatter)

For :spark:func:`from_unixtime` and :spark:func:`get_timestamp`, when `Simple` date formatter is used, null is
returned for invalid format; otherwise, exception is thrown. ::

SELECT from_unixtime(100, '!@#$%^&*'); -- NULL (parsing error) (for Simple date formatter)
SELECT from_unixtime(100, '!@#$%^&*'); -- throws exception) (for Joda date formatter)
SELECT get_timestamp('1970-01-01', '!@#$%^&*'); -- NULL (parsing error) (for Simple date formatter)
SELECT get_timestamp('1970-01-01', '!@#$%^&*'); -- throws exception) (for Joda date formatter)

5 changes: 3 additions & 2 deletions velox/functions/lib/DateTimeFormatter.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1588,8 +1588,9 @@ Expected<DateTimeResult> DateTimeFormatter::parse(
}
}

// Ensure all input was consumed.
if (cur < end) {
// Ensure all input was consumed if type_ is not simple datetime formatter.
if (type_ != DateTimeFormatterType::LENIENT_SIMPLE &&
type_ != DateTimeFormatterType::STRICT_SIMPLE && cur < end) {
return parseFail(input, cur, end);
}

Expand Down
9 changes: 9 additions & 0 deletions velox/functions/lib/tests/DateTimeFormatterTest.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2441,4 +2441,13 @@ TEST_F(SimpleDateTimeFormatterTest, formatWeekOfMonth) {
}
}

TEST_F(SimpleDateTimeFormatterTest, parseUsingPartialInput) {
EXPECT_EQ(
fromTimestampString("2024-08-01"),
parseSimple("2024 08 01 5", "yyyy MM", true).timestamp);
EXPECT_EQ(
fromTimestampString("2024-08-01"),
parseSimple("2024 08 01 5", "yyyy MM", false).timestamp);
}

} // namespace facebook::velox::functions
Loading