Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add word_stem Presto function #9363

Closed
wants to merge 1 commit into from

Conversation

yhwang
Copy link
Contributor

@yhwang yhwang commented Apr 4, 2024

Add snowball libstemmer v2.2.0 as one of the dependencies.
And use it to implement the word_stem() as a scalar UDF.

When using the libstemmer API, each language creates an sb_stemmer
instance which consumes 114 bytes, including the default 10 bytes for the output stem.
It uses the realloc to increase the memory block for the output stem if needed.

Fixes #8487

@facebook-github-bot
Copy link
Contributor

Hi @yhwang!

Thank you for your pull request.

We require contributors to sign our Contributor License Agreement, and yours needs attention.

You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

Copy link

netlify bot commented Apr 4, 2024

Deploy Preview for meta-velox ready!

Name Link
🔨 Latest commit 2c14f09
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/662adbf9c773e70008720409
😎 Deploy Preview https://deploy-preview-9363--meta-velox.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 4, 2024
@yhwang yhwang force-pushed the word_stem-impl branch 2 times, most recently from bedcb88 to 768f5b1 Compare April 4, 2024 22:43
@yhwang
Copy link
Contributor Author

yhwang commented Apr 9, 2024

/cc @mbasmanova @aditi-pandit

/// or create a new one if it doesn't exist. Return NULL if the
/// specified lang is not supported.
static Stemmer* getStemmer(const char* lang) {
thread_local std::map<std::string, Stemmer*> stemmers;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't clean up this thread-local map which contains the Stemmer instances for each language. I assume a working thread is shutting down with the Presto process. Not sure if it's worth adding the cleanup code while tearing down a thread.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if it is a good idea to use thread local variable here. CC: @bikramSingh91

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/zvelo/libstemmer

Creating a stemmer is a relatively expensive operation - the expected
usage pattern is that a new stemmer is created when needed, used
to stem many words, and deleted after some time.

Stemmers are re-entrant, but not threadsafe.  In other words, if
you wish to access the same stemmer object from multiple threads,
you must ensure that all access is protected by a mutex or similar
device.

Based on this information, it seems we could add a member variable to 'WordStemFunction' to store Stemmer instances.

Let's use std::unique_ptr to make cleanup automatic.

Similar to regex functions, we may consider only supporting constant values for 'lang' and create a single Stemmer instance in 'initialize'.

Alternatively, we may allow up to of different languages per function instance. See https://facebookincubator.github.io/velox/functions/presto/regexp.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on this information, it seems we could add a member variable to 'WordStemFunction' to store Stemmer instances.

that's the reason I use thread-local to store the lang<-->stemmer map

Let's use std::unique_ptr to make cleanup automatic.

let me change the map to <string, unique_ptr<Stemmer>>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To add more context against using thread_local variables liberally is that we observed that storing thread local variables inside objects that can be moved between threads can cause an unexpected increase in memory usage as whenever they are accessed in another thread, they would be created and would exist for the lifetime of the thread. eg of an issue we recently fixed: #7646

In this case as well, the driver executing expression eval (and this code) can bounce between multiple available threads (we use folly::executor to execute drivers), and each time it would end up creating new instances. Unlike #7646, I do not expect this hold on to quite as much memory but nevertheless it would be a good idea to avoid this kind of usage pattern unless there is no other alternative. Using a unique_ptr as @mbasmanova suggested would help us to have tighter control over its lifetime by tying it to the lifetime of the Expression and therefore more predictable memory usage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bikramSingh91 I need more clarification on this. In the code here, the thread-local variable is not inside any objects nor be moved between threads. Since the possible number of stemmer instances is fixed, depends on how many languages are used in a running system and we only support 20 languages. The cost in each thread should be a fixed amount of memory. Each time a stemmer is used, it resues an allocated memory to store the stem results, and then the result is copied to the StringWrite. I guess it's different from the PR you referred to. I had a commit to make the map to <string, unique_ptr<Stemmer>>, how do you think after the change?

std::string lowerOutput;
stringImpl::lower<isAscii>(lowerOutput, input);
auto rev = stemmer->stem(lowerOutput);
if (rev == NULL) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I could put UNLIKELY here.

@mbasmanova mbasmanova requested review from kgpai and assignUser April 9, 2024 23:22
Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kgpai @assignUser Krishna, Jacob, would you help review build changes?

@yhwang let's add the new dependency to CMake/resolve_dependency_modules/README.md

@kgpai Krishna, would you help look into what would it take to add this dependency to Meta's repo?

velox/docs/functions/presto/string.rst Outdated Show resolved Hide resolved
velox/docs/functions/presto/string.rst Outdated Show resolved Hide resolved
Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yhwang Thank you for working on this. Some initial questions. Primarily, I'd like to understand how we are allocating memory and what may cause OOM scenario.

CC: @xiaoxmeng

velox/functions/prestosql/StringFunctions.h Outdated Show resolved Hide resolved
velox/functions/prestosql/StringFunctions.h Outdated Show resolved Hide resolved
velox/functions/prestosql/StringFunctions.h Outdated Show resolved Hide resolved
/// or create a new one if it doesn't exist. Return NULL if the
/// specified lang is not supported.
static Stemmer* getStemmer(const char* lang) {
thread_local std::map<std::string, Stemmer*> stemmers;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if it is a good idea to use thread local variable here. CC: @bikramSingh91

velox/functions/prestosql/StringFunctions.h Outdated Show resolved Hide resolved
velox/functions/prestosql/StringFunctions.h Outdated Show resolved Hide resolved
velox/functions/prestosql/StringFunctions.h Outdated Show resolved Hide resolved
velox/functions/prestosql/StringFunctions.h Outdated Show resolved Hide resolved
stringImpl::lower<isAscii>(lowerOutput, input);
auto rev = stemmer->stem(lowerOutput);
if (rev == NULL) {
throw std::runtime_error("out of memory");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you explain how memory allocations are happening here? We need to make sure any large allocations go through Velox's Memory Pool. How can we run out of memory here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enclosed its API doc here:

/** Stem a word.
 *
 *  The return value is owned by the stemmer - it must not be freed or
 *  modified, and it will become invalid when the stemmer is called again,
 *  or if the stemmer is freed.
 *
 *  The length of the return value can be obtained using sb_stemmer_length().
 *
 *  If an out-of-memory error occurs, this will return NULL.
 */
const sb_symbol *   sb_stemmer_stem(struct sb_stemmer * stemmer,
                                    const sb_symbol * word, int size);

that's why I throw an out of memory exception when the returned value is NULL.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use VELOX_CHECK_NOT_NULL(stem, "Stemmer library returned a NULL (out-of-memory)."

We need to make it clear that the error comes from Stemmer library and not from Velox core.

Let's also figure out how much memory Stemmer instances use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use VELOX_CHECK_NOT_NULL(stem, "Stemmer library returned a NULL (out-of-memory)."

Sure thing!

Let's also figure out how much memory Stemmer instances use.

I guess this would be a tricky one. I will dig more on this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can write a simple program that uses stemmer lib to stem some words and capture process-wide memory usage before creating stemmer instance, after, then again after processing some words, then after deleting stemmer instance.

https://stackoverflow.com/questions/63166/how-to-determine-cpu-and-memory-consumption-from-inside-a-process

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can also look into its implementation and see how and how much it allocates for storing the stem. since each stemmer reuses the memory block, it shouldn't consume that much.

velox/functions/prestosql/tests/StringFunctionsTest.cpp Outdated Show resolved Hide resolved
@mbasmanova mbasmanova changed the title Add word_stem() implementation Add word_stem Presto function Apr 9, 2024
@mbasmanova
Copy link
Contributor

CC: @amitkdutta

Copy link
Collaborator

@assignUser assignUser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments for the cmake. I checked the timestamps in the debug build and stemmer build in a few seconds, so we can leave it on BUNDLED for now.

CMake/resolve_dependency_modules/stemmer.cmake Outdated Show resolved Hide resolved
velox/functions/prestosql/CMakeLists.txt Outdated Show resolved Hide resolved
CMakeLists.txt Outdated Show resolved Hide resolved
CMakeLists.txt Show resolved Hide resolved
CMake/resolve_dependency_modules/stemmer.cmake Outdated Show resolved Hide resolved
@yhwang
Copy link
Contributor Author

yhwang commented Apr 10, 2024

thanks for the comments. I will address them in separate commits for an easier review process. I will squash all commits into one when the code is ready for merge.

@assignUser I addressed your comments in this commit: Address Cmake comments

I will work on other comments later.

@yhwang yhwang requested a review from assignUser April 10, 2024 06:27
@mbasmanova
Copy link
Contributor

@rui-mo @PHILO-HE Folks, does Spark have a similar function?

@assignUser
Copy link
Collaborator

The python extension is always build as a shared library, could we build libstemmer with fpic? https://github.com/facebookincubator/velox/actions/runs/8626883438/job/23645998539?pr=9363#step:13:3508

@yhwang
Copy link
Contributor Author

yhwang commented Apr 10, 2024

The python extension is always build as a shared library, could we build libstemmer with fpic? https://github.com/facebookincubator/velox/actions/runs/8626883438/job/23645998539?pr=9363#step:13:3508

@assignUser good question. I didn't think about that since libstemmer comes with the Makefile that only builds static lib. Do you think we should tweak it to build a so?

@yhwang
Copy link
Contributor Author

yhwang commented Apr 10, 2024

@mbasmanova thanks for the comments. I added another commit here to address most of your comments: Update word_stem impl to address comments

I guess the things left are:

  • estimate or know the memory allocation in libstemmer
  • use local_thread or not

we observed that storing thread local variables inside objects that can be moved between threads can cause an unexpected increase in memory usage as whenever they are accessed in another thread

The thread-local variable here is not inside any objects and is not moved between threads. Or my understanding is wrong?

@mbasmanova
Copy link
Contributor

@yhwang

use local_thread or not

Definitely not. Please, take a look at @bikramSingh91's explanation.

@yhwang
Copy link
Contributor Author

yhwang commented Apr 10, 2024

@yhwang

use local_thread or not

Definitely not. Please, take a look at @bikramSingh91's explanation.

I guess my point is I don't see a direct relationship between this change and the thread-local issue mentioned above. I'd like to get more clarification. The stemmers in the use case here would only take a fixed amount of memory, it won't grow infinitely. Just need to know if that's the main concern.

@assignUser
Copy link
Collaborator

@yhwang it doesn't look like it supports shared library but patching the make file with

- CFLAGS=-Iinclude
+ CFLAGS=-Iinclude -fPIC

should be enough for the linker to figure stuff out when building the extension .so

@bikramSingh91
Copy link
Contributor

@yhwang

use local_thread or not

Definitely not. Please, take a look at @bikramSingh91's explanation.

I guess my point is I don't see a direct relationship between this change and the thread-local issue mentioned above. I'd like to get more clarification. The stemmers in the use case here would only take a fixed amount of memory, it won't grow infinitely. Just need to know if that's the main concern.

It appears that a static method is responsible for maintaining the thread-local variables specific to each thread. These variables are accessed each time the simple function is invoked. Although these variables are not contained within an object, the driver that carries this expression can be executed on a different thread for the subsequent batch of input vectors, leading to the creation of additional instances.

The maximum memory these objects can hold is approximately equal to (num_of_langs * threads). While this might constitute a small amount, the presence of thread-local variables complicates the understanding of their lifespan, ownership, and the number of copies created. The alternative we proposed ties the ownership to the Expression object itself, thereby making these aspects more predictable while ensuring thread-safe access. TLDR: the concern is less about the memory utilization and more about developing code that we can confidently reason about.

We would appreciate any insights into specific use-cases or advantages that we might have overlooked, which would make the use of thread-local variables more appealing. For instance, is their purpose to ensure that these objects are shared among potentially different expressions objects within the same thread to save on the cost of creating the stemmer objects?

@yhwang
Copy link
Contributor Author

yhwang commented Apr 11, 2024

@bikramSingh91 thanks for the clarification. let me simplify it to what the lifespan of a stemmer should be.

  • with the expression:
    initialize stemmer(s) in each expression and clean them up after the expression ends
  • or with the thread:
    lazy initialize stemmers when needed and use unique_ptr for it. Basically, they go down with the thread

For me, maintaining the logic in the WordStem UDF struct is more straightforward and clean. Not sure if expression usage makes the code separate in multiple places. But I am good with either choice.

We would appreciate any insights into specific use-cases or advantages that we might have overlooked, which would make the use of thread-local variables more appealing.

This is the API doc from libstemmer:

Creating a stemmer is a relatively expensive operation - the expected
usage pattern is that a new stemmer is created when needed, used
to stem many words, and deleted after some time.

Stemmers are re-entrant, but not threadsafe.  In other words, if
you wish to access the same stemmer object from multiple threads,
you must ensure that all access is protected by a mutex or similar
device.

@mbasmanova posted this above and this is also the main reason I use thread-local map to store the stemmers, avoid mutex, and keep the code simple to read. Again, I am good with either expression or thread-local, since we agree memory shouldn't be the concern, and appreciate your clarification.

@mbasmanova mbasmanova added the ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall label Apr 12, 2024
@yhwang
Copy link
Contributor Author

yhwang commented Apr 12, 2024

@mbasmanova

Velox repo works differently from PrestoDB repo.

Understood. Different repos usually have different rule/bot. I worked on other repos having bots to squash commits and use the PR description and commit messages as the final commit. Good to know rule/settings here.

I know that @assignUser likes separate commits for addressing comments, but I don't share that preference.

I guess there is a way to fulfill both, I could squash reviewed commits into one and have only new commits for review :-) . I use single commit from time to time. But I found the force push diff is not that user-friendly. If the change is small, I prefer a single commit too.

Thanks for the review and comments (@assignUser too). I learned a lot. It's also a good practice for me. I updated the comment and squashed my commits (I know I don't need to do that)

@pedroerp
Copy link
Contributor

@yhwang looks like our internal fuzzer caught a crash with your function:

I0418 07:57:34.180382 1011044 ExpressionFuzzerVerifier.cpp:312] ==============================> Started iteration 101 (seed: 2144156392)                                                       
I0418 07:57:34.189090 1011044 ExpressionVerifier.cpp:51] Executing expression 0 : word_stem("c0",MJw$$A,*3z;D)                                                                                 
=================================================================                                                                                                                              
==1011044==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7f258f990910 at pc 0x00000032eb39 bp 0x7fffc1c98860 sp 0x7fffc1c98020                                                   
READ of size 13 at 0x7f258f990910 thread T0                                                                                                                                                    
SCARINESS: 41 (multi-byte-read-stack-buffer-overflow)                                                                                                                                          
    #0 0x32eb38 in __interceptor_strlen.part.0 ubsan.c                                                                                                                                         
    #1 0x7f25ccb86507 in unsigned long folly::detail::constexpr_strlen_internal<char, 0ul>(char const*, int) fbcode/folly/portability/Constexpr.h:50                                           
    #2 0x7f25ccb864e6 in unsigned long folly::constexpr_strlen<char>(char const*) fbcode/folly/portability/Constexpr.h:106                                                                     
    #3 0x7f25ccb85209 in folly::Range<char const*>::Range(char const*) fbcode/folly/Range.h:303                                                                                                
    #4 0x7f25d6c69e62 in unsigned long folly::detail::TransparentRangeHash<char>::operator()<char const*>(char const* const&) const fbcode/folly/container/HeterogeneousAccess.h:125           
    #5 0x7f25d6c69a94 in unsigned long folly::f14::detail::VectorContainerPolicy<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, std::unique_ptr<facebook::velo
x::functions::detail::Stemmer, std::default_delete<facebook::velox::functions::detail::Stemmer>>, void, void, void, std::integral_constant<bool, true>>::computeKeyHash<char const*>(char const
* const&) const fbcode/folly/container/detail/F14Policy.h:1135                                                                                                                                 
    #6 0x7f25d6c678bb in find<const char *> fbcode/folly/container/detail/F14Table.h:1585                                                                                                      
    #7 0x7f25d6c678bb in find<const char *> fbcode/folly/container/F14Map.h:886                                                                                                                
    #8 0x7f25d6c678bb in facebook::velox::functions::WordStemFunction<facebook::velox::exec::VectorExec>::getStemmer(char const*) fbcode/velox/functions/prestosql/WordStem.h:120              
    #9 0x7f25d6d1b51f in doCall<true> fbcode/velox/functions/prestosql/WordStem.h:101                                                                                                          
    #10 0x7f25d6d1b51f in callAscii fbcode/velox/functions/prestosql/WordStem.h:93                                                                                                             
    #11 0x7f25d6d1b51f in callAsciiImpl fbcode/velox/core/SimpleFunctionMetadata.h:954                                                                                                         
    #12 0x7f25d6d1b51f in callAscii fbcode/velox/core/SimpleFunctionMetadata.h:904
    #13 0x7f25d6d1b51f in doApplyAsciiNotNull<2UL, facebook::velox::StringView, facebook::velox::StringView, 0> fbcode/velox/expression/SimpleFunctionAdapter.h:866
    #14 0x7f25d6d1b51f in doApplyAsciiNotNull<1UL, const facebook::velox::exec::ConstantVectorReader<facebook::velox::Varchar>, facebook::velox::StringView, 0> fbcode/velox/expression/SimpleF
unctionAdapter.h:855
    #15 0x7f25d6d1b51f in doApplyAsciiNotNull<0UL, facebook::velox::exec::FlatVectorReader<facebook::velox::Varchar>, facebook::velox::exec::ConstantVectorReader<facebook::velox::Varchar>, 0>
 fbcode/velox/expression/SimpleFunctionAdapter.h:855
    #16 0x7f25d6d1b51f in operator()<facebook::velox::exec::StringWriter<false>, int> fbcode/velox/expression/SimpleFunctionAdapter.h:685
    #17 0x7f25d6d1b51f in operator()<int> fbcode/velox/expression/SimpleFunctionAdapter.h:721
    #18 0x7f25d6d1b51f in operator()<int> fbcode/velox/expression/EvalCtx.h:102

You can reproduce it by running fuzzer with ubsan enabled.

@yhwang
Copy link
Contributor Author

yhwang commented Apr 18, 2024

@pedroerp thanks for the info. the code in velox/functions/prestosql/WordStem.h:120 is:

if (auto found = stemmers_.find(lang); found != stemmers_.end()) {

It's kind of interesting to me. Let me reproduce it and see how to fix it.

Edit
I just saw the expression: Executing expression 0 : word_stem("c0",MJw$$A,*3z;D)
Is the second parameter a string?

@pedroerp
Copy link
Contributor

Is the second parameter a string?

I suppose so. Try to reproduce by running on that exact same version and same fuzzer seed. You'll probably need to run using clang and enable ubsan.

@yhwang
Copy link
Contributor Author

yhwang commented Apr 22, 2024

@pedroerp sorry for the late response, I was on vacation last Friday. I tried to build velox_expression_fuzzer_test with UBSan today, but I hit some undefined symbol errors in the linking stage. Can you share the steps/instructions to build the fuzzer with UBSan enabled?

@pedroerp
Copy link
Contributor

@yhwang unfortunately we don't have this set up in our github CI (yet), but you need to first compile using clang with the ubsan flags enabled, then run join fuzzer. These are the instructions on how to enable ubsan:

https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html

What is the issue you are seeing?

@yhwang
Copy link
Contributor Author

yhwang commented Apr 23, 2024

@pedroerp

https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html

Thanks and this is the same doc that I followed. The unresolved symbols issue may be caused by the flex that comes with xcode. I am using a Linux VM to build the velox now. The build looks good so far.

@pedroerp
Copy link
Contributor

The unresolved symbols issue may be caused by the flex that comes with xcode

I think you need to compile all dependencies with the same ubsan flags. @kgpai and @assignUser would know this better.

@yhwang
Copy link
Contributor Author

yhwang commented Apr 23, 2024

@pedroerp

I ran the following command:
_build/debug/velox/expression/tests/velox_expression_fuzzer_test --seed=2144156392 --only=word_stem --logtostdout=1 --steps=1000

and I only saw this from time to time:

I20240423 16:/home/yhwang/WS_git/presto/presto-native-execution/velox/velox/common/base/BitUtil.cpp:204:24: runtime error: load of misaligned address 0x7ffe3eb09834 for type 'const long unsigned int', which requires 8 byte alignment
0x7ffe3eb09834: note: pointer points here
  0a 00 00 00 59 28 46 76  27 77 30 3d 4a 66 00 00  00 00 00 00 00 00 00 00  87 98 b0 3e fe 7f 00 00
              ^~

and although I specified -seed, it only uses it for the first iteration. Do you know how to use the same seed across all iterations?

I added this to the very top CMakeLists.xt:

add_link_options(-fsanitize=undefined)
add_compile_options(-fsanitize=undefined)

And set VELOX_DEPENDENCY_SOURCE to BUNDLED

I guess this way it builds most of the deps as bundles and using the UBSan flag.

@pedroerp
Copy link
Contributor

Do you know how to use the same seed across all iterations?

The seed is used for the first iteration, then a new seed is generated for the next iteration. They should deterministically reproduce the same chain of seeds if you're using the same binary with the same set of functions available.

@yhwang
Copy link
Contributor Author

yhwang commented Apr 23, 2024

Then I can only for-loop the command to see if I can hit that error with the same seed. So far, I haven't seen it yet.

@kgpai
Copy link
Contributor

kgpai commented Apr 24, 2024

@yhwang You can see the flags we use here : https://github.com/facebookincubator/velox/blob/main/.github/workflows/scheduled.yml#L360

You can also set the flag in helper-functions.sh without modifying your cmake and set the deps to BUNDLED which will ensure everything has UBSAN (like you did!).

@yhwang
Copy link
Contributor Author

yhwang commented Apr 24, 2024

I applied all the flags, but I only hit this error still:

/home/yhwang/WS_git/presto/presto-native-execution/velox/./velox/common/base/BitUtil.h:826:12: runtime error: store to misaligned address 0x564e2ecaefc3 for type 'short unsigned int', which requires 2 byte alignment
0x564e2ecaefc3: note: pointer points here
 ff  fd ff e7 ff ff ff ff ff  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff
              ^

Based on the error that Pedro posted earlier, I guess it's caused by the const char* retrieving from StringViewer::data(). Let me see if passing std::string would avoid the error.

Add snowball libstemmer v2.2.0 as one of the dependencies.
And use it to implement the word_stem() as a scalar UDF.

When using the libstemmer API, each language creates an sb_stemmer
instance which consumes 114 bytes, including the default 10 bytes for
the output stem.
It uses the realloc to increase the memory block for the output stem if
needed.

Signed-off-by: Yihong Wang <[email protected]>
@yhwang
Copy link
Contributor Author

yhwang commented Apr 25, 2024

@pedroerp I kept trying to recreate the error you posted above. But I can't reproduce it. The only error I saw is the one I mentioned: BitUtil.h:826. So the only thing I can do is change the lang param from const char* to std::string&. Could you help me to verify the change at your end?

Another thing I feel weird is the stack trace in your comment is different from what I saw in my system. I paused at the word_stem function and stepped into the function call one by one to simulate the stack trace you posted. and this is what I got (I didn't go all the way to the very top stack trace you had):

folly::f14::detail::VectorContainerPolicy<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::unique_ptr<facebook::velox::functions::detail::Stemmer, std::default_delete<facebook::velox::functions::detail::Stemmer> >, void, void, void, std::integral_constant<bool, true> >::computeKeyHash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >(const folly::f14::detail::VectorContainerPolicy<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::unique_ptr<facebook::velox::functions::detail::Stemmer, std::default_delete<facebook::velox::functions::detail::Stemmer> >, void, void, void, std::integral_constant<bool, true> > * const this, const std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > & key) (/home/yhwang/WS_git/presto/presto-native-execution/velox/_build/debug/_deps/folly-src/folly/container/detail/F14Policy.h:1134)
folly::f14::detail::F14Table<folly::f14::detail::VectorContainerPolicy<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::unique_ptr<facebook::velox::functions::detail::Stemmer, std::default_delete<facebook::velox::functions::detail::Stemmer> >, void, void, void, std::integral_constant<bool, true> > >::find<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >(const std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > & key, const folly::f14::detail::F14Table<folly::f14::detail::VectorContainerPolicy<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::unique_ptr<facebook::velox::functions::detail::Stemmer, std::default_delete<facebook::velox::functions::detail::Stemmer> >, void, void, void, std::integral_constant<bool, true> > > * const this) (/home/yhwang/WS_git/presto/presto-native-execution/velox/_build/debug/_deps/folly-src/folly/container/detail/F14Table.h:1580)
folly::f14::detail::F14BasicMap<folly::f14::detail::VectorContainerPolicy<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::unique_ptr<facebook::velox::functions::detail::Stemmer, std::default_delete<facebook::velox::functions::detail::Stemmer> >, void, void, void, std::integral_constant<bool, true> > >::find(const folly::f14::detail::F14BasicMap<folly::f14::detail::VectorContainerPolicy<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::unique_ptr<facebook::velox::functions::detail::Stemmer, std::default_delete<facebook::velox::functions::detail::Stemmer> >, void, void, void, std::integral_constant<bool, true> > >::key_type & key, folly::f14::detail::F14BasicMap<folly::f14::detail::VectorContainerPolicy<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::unique_ptr<facebook::velox::functions::detail::Stemmer, std::default_delete<facebook::velox::functions::detail::Stemmer> >, void, void, void, std::integral_constant<bool, true> > > * const this) (/home/yhwang/WS_git/presto/presto-native-execution/velox/_build/debug/_deps/folly-src/folly/container/F14Map.h:854)
facebook::velox::functions::WordStemFunction<facebook::velox::exec::VectorExec>::getStemmer(facebook::velox::functions::WordStemFunction<facebook::velox::exec::VectorExec> * const this, const std::string & lang) (/home/yhwang/WS_git/presto/presto-native-execution/velox/velox/functions/prestosql/WordStem.h:120)
facebook::velox::functions::WordStemFunction<facebook::velox::exec::VectorExec>::doCall<true>(const std::string & lang, const facebook::velox::functions::WordStemFunction<facebook::velox::exec::VectorExec>::arg_type & input, facebook::velox::functions::WordStemFunction<facebook::velox::exec::VectorExec>::out_type & result, facebook::velox::functions::WordStemFunction<facebook::velox::exec::VectorExec> * const this) (/home/yhwang/WS_git/presto/presto-native-execution/velox/velox/functions/prestosql/WordStem.h:101)
facebook::velox::functions::WordStemFunction<facebook::velox::exec::VectorExec>::callAscii(const facebook::velox::functions::WordStemFunction<facebook::velox::exec::VectorExec>::arg_type & input, facebook::velox::functions::WordStemFunction<facebook::velox::exec::VectorExec>::out_type & result, facebook::velox::functions::WordStemFunction<facebook::velox::exec::VectorExec> * const this) (/home/yhwang/WS_git/presto/presto-native-execution/velox/velox/functions/prestosql/WordStem.h:79)

You can see the line numbers in those folly files are different. I am using the folly version from the main branch which is v2024.04.01.00. Is it possible the error is from the different folly version?

@facebook-github-bot
Copy link
Contributor

@pedroerp has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@pedroerp
Copy link
Contributor

@yhwang I run some more tests. The problem was because on the previous version you were taking a std::string or StringView's buffer (which do not necessarily have a \0 at the end), and using it in the function sb_stemmer_new() which expects a C-like string ended by a \0.

Using c_str() instead like what you're doing now fixes the issue. I'll get it merged today.

@yhwang
Copy link
Contributor Author

yhwang commented Apr 27, 2024

@pedroerp thanks for the verification and explanation. So my guess is right :-), although I can’t reproduce the error.

Copy link

Conbench analyzed the 1 benchmark run on commit 47970417.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

zhztheplayer added a commit to zhztheplayer/velox that referenced this pull request Apr 28, 2024
@yhwang yhwang deleted the word_stem-impl branch April 30, 2024 20:52
@facebook-github-bot
Copy link
Contributor

@pedroerp merged this pull request in 4797041.

Joe-Abraham pushed a commit to Joe-Abraham/velox that referenced this pull request Jun 7, 2024
Summary:
Add snowball libstemmer v2.2.0 as one of the dependencies.
And use it to implement the word_stem() as a scalar UDF.

When using the libstemmer API, each language creates an `sb_stemmer`
instance which consumes 114 bytes, including the default 10 bytes for the output stem.
It uses the `realloc` to increase the memory block for the output stem if needed.

Fixes facebookincubator#8487

Pull Request resolved: facebookincubator#9363

Reviewed By: amitkdutta

Differential Revision: D56059511

Pulled By: pedroerp

fbshipit-source-id: b3a66956c3809e3f3dadfc8cc7b397b7116996d5
zhli1142015 pushed a commit to zhli1142015/velox that referenced this pull request Jul 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add word_stem Presto function
8 participants