Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize cast(JSON as ROW) #9449

Closed

Conversation

mbasmanova
Copy link
Contributor

Summary:
boost::algorithm::to_lower is very high on the profile. In most cases field names are all-ASCII, hence, we can use more efficient folly::toLowerAscii.

With this optimization a query that's heavy on this cast runs 3x faster.

Differential Revision: D56014538

@facebook-github-bot facebook-github-bot added CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported labels Apr 11, 2024
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D56014538

Copy link

netlify bot commented Apr 11, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 9124213
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/66183052c18de80008a61b50

mbasmanova added a commit to mbasmanova/velox-1 that referenced this pull request Apr 11, 2024
Summary:

boost::algorithm::to_lower is very high on the profile. In most cases field names are all-ASCII, hence, we can use more efficient folly::toLowerAscii.

With this optimization a query that's heavy on this cast runs 3x faster.

Differential Revision: D56014538
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D56014538

@mbasmanova mbasmanova requested review from Yuhta and xiaoxmeng April 11, 2024 11:54
// boost::algorithm::to_lower is very slow. Use much faster
// folly::toLowerAscii if possible.
if (allFieldsAreAscii) {
folly::toLowerAscii(lowerCaseKey);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is toLowerAscii safe to be called on non-ascii input? If not we may need to check if lowerCaseKey is ascii as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is: "Leaves all other characters unchanged, including those with the 0x80 bit set."

/**
 * Convert ascii to lowercase, in-place.
 *
 * Leaves all other characters unchanged, including those with the 0x80
 * bit set.
 * @param str String to convert
 * @param length Length of str, in bytes
 */
void toLowerAscii(char* str, size_t length);

inline void toLowerAscii(std::string& str) {
  // str[0] is legal also if the string is empty.
  toLowerAscii(&str[0], str.size());
}

mbasmanova added a commit to mbasmanova/velox-1 that referenced this pull request Apr 11, 2024
Summary:

boost::algorithm::to_lower is very high on the profile. In most cases field names are all-ASCII, hence, we can use more efficient folly::toLowerAscii.

With this optimization a query that's heavy on this cast runs 3x faster.

 {F1484171823}

Differential Revision: D56014538
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D56014538

mbasmanova added a commit to mbasmanova/velox-1 that referenced this pull request Apr 11, 2024
Summary:

boost::algorithm::to_lower is very high on the profile. In most cases field names are all-ASCII, hence, we can use more efficient folly::toLowerAscii.

With this optimization a query that's heavy on this cast runs 3x faster.

 {F1484171823}

Differential Revision: D56014538
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D56014538

Copy link
Contributor

@xiaoxmeng xiaoxmeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbasmanova nice optimization. Thanks!

@@ -38,6 +38,17 @@ namespace facebook::velox {

namespace {

bool isAscii(const std::string& str) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: shall we consider to put this into some common utility like common/base/String.h or StrUtil.h. This seems to be general function. Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me switch to existing functions::stringCore::isAscii from velox/functions/lib/string/StringCore.h

bool allFieldsAreAscii = true;
const auto size = rowType.size();
for (auto i = 0; i < size; ++i) {
allFieldsAreAscii &= isAscii(rowType.nameOf(i));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we break out of the loop on the first non-ascii field name?

mbasmanova added a commit to mbasmanova/velox-1 that referenced this pull request Apr 11, 2024
Summary:

boost::algorithm::to_lower is very high on the profile. In most cases field names are all-ASCII, hence, we can use more efficient folly::toLowerAscii.

With this optimization a query that's heavy on this cast runs 3x faster.

 {F1484171823}

Reviewed By: xiaoxmeng, Yuhta

Differential Revision: D56014538
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D56014538

mbasmanova added a commit to mbasmanova/velox-1 that referenced this pull request Apr 11, 2024
Summary:

boost::algorithm::to_lower is very high on the profile. In most cases field names are all-ASCII, hence, we can use more efficient folly::toLowerAscii.

With this optimization a query that's heavy on this cast runs 3x faster.

 {F1484171823}

Reviewed By: xiaoxmeng, Yuhta

Differential Revision: D56014538
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D56014538

mbasmanova added a commit to mbasmanova/velox-1 that referenced this pull request Apr 11, 2024
Summary:

boost::algorithm::to_lower is very high on the profile. In most cases field names are all-ASCII, hence, we can use more efficient folly::toLowerAscii.

With this optimization a query that's heavy on this cast runs 3x faster.

 {F1484171823}

Reviewed By: xiaoxmeng, Yuhta

Differential Revision: D56014538
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D56014538

Summary:

CAST(JSON as ROW(ARRAY()) used to fail with

```
OUT_OF_ORDER_ITERATION: Objects and arrays can only be iterated when they are first encountered.
```

According to simdjson documentation, https://github.com/simdjson/simdjson/blob/master/doc/basics.md, it is not allowed to store object values for later processing. These must be consumed or copied before proceeding.

Also, fixed behavior when JSON object contains duplicate keys. Presto throws, but previous implementation used to allow duplicates.

Also, fix the test to actually verify JSON objects with mixed case keys.

Reviewed By: xiaoxmeng, Yuhta

Differential Revision: D56013293
Summary:

boost::algorithm::to_lower is very high on the profile. In most cases field names are all-ASCII, hence, we can use more efficient folly::toLowerAscii.

With this optimization a query that's heavy on this cast runs 3x faster.

 {F1484171823}

Reviewed By: xiaoxmeng, Yuhta

Differential Revision: D56014538
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D56014538

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in efb0213.

Copy link

Conbench analyzed the 1 benchmark run on commit efb0213a.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

yanngyoung pushed a commit to yanngyoung/velox that referenced this pull request Apr 12, 2024
Summary:
Pull Request resolved: facebookincubator#9449

boost::algorithm::to_lower is very high on the profile. In most cases field names are all-ASCII, hence, we can use more efficient folly::toLowerAscii.

With this optimization a query that's heavy on this cast runs 3x faster.

 {F1484171823}

Reviewed By: xiaoxmeng, Yuhta

Differential Revision: D56014538

fbshipit-source-id: cedfb5f58b59f29ce02344d12fdd79b2fe8fbb21
Joe-Abraham pushed a commit to Joe-Abraham/velox that referenced this pull request Jun 7, 2024
Summary:
Pull Request resolved: facebookincubator#9449

boost::algorithm::to_lower is very high on the profile. In most cases field names are all-ASCII, hence, we can use more efficient folly::toLowerAscii.

With this optimization a query that's heavy on this cast runs 3x faster.

 {F1484171823}

Reviewed By: xiaoxmeng, Yuhta

Differential Revision: D56014538

fbshipit-source-id: cedfb5f58b59f29ce02344d12fdd79b2fe8fbb21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants