Add word_stem Presto function #9363

yhwang · 2024-04-04T16:39:47Z

Add snowball libstemmer v2.2.0 as one of the dependencies.
And use it to implement the word_stem() as a scalar UDF.

When using the libstemmer API, each language creates an sb_stemmer
instance which consumes 114 bytes, including the default 10 bytes for the output stem.
It uses the realloc to increase the memory block for the output stem if needed.

Fixes #8487

facebook-github-bot · 2024-04-04T16:39:54Z

Hi @yhwang!

Thank you for your pull request.

We require contributors to sign our Contributor License Agreement, and yours needs attention.

You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

netlify · 2024-04-04T16:40:03Z

✅ Deploy Preview for meta-velox ready!

Name	Link
🔨 Latest commit	`2c14f09`
🔍 Latest deploy log	https://app.netlify.com/sites/meta-velox/deploys/662adbf9c773e70008720409
😎 Deploy Preview	https://deploy-preview-9363--meta-velox.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

yhwang · 2024-04-09T18:02:30Z

/cc @mbasmanova @aditi-pandit

yhwang · 2024-04-09T21:39:44Z

velox/functions/prestosql/StringFunctions.h

+  /// or create a new one if it doesn't exist. Return NULL if the
+  /// specified lang is not supported.
+  static Stemmer* getStemmer(const char* lang) {
+    thread_local std::map<std::string, Stemmer*> stemmers;


I don't clean up this thread-local map which contains the Stemmer instances for each language. I assume a working thread is shutting down with the Presto process. Not sure if it's worth adding the cleanup code while tearing down a thread.

I'm not sure if it is a good idea to use thread local variable here. CC: @bikramSingh91

https://github.com/zvelo/libstemmer

Creating a stemmer is a relatively expensive operation - the expected usage pattern is that a new stemmer is created when needed, used to stem many words, and deleted after some time. Stemmers are re-entrant, but not threadsafe. In other words, if you wish to access the same stemmer object from multiple threads, you must ensure that all access is protected by a mutex or similar device.

Based on this information, it seems we could add a member variable to 'WordStemFunction' to store Stemmer instances.

Let's use std::unique_ptr to make cleanup automatic.

Similar to regex functions, we may consider only supporting constant values for 'lang' and create a single Stemmer instance in 'initialize'.

Alternatively, we may allow up to of different languages per function instance. See https://facebookincubator.github.io/velox/functions/presto/regexp.html

Based on this information, it seems we could add a member variable to 'WordStemFunction' to store Stemmer instances.

that's the reason I use thread-local to store the lang<-->stemmer map

Let's use std::unique_ptr to make cleanup automatic.

let me change the map to <string, unique_ptr<Stemmer>>

To add more context against using thread_local variables liberally is that we observed that storing thread local variables inside objects that can be moved between threads can cause an unexpected increase in memory usage as whenever they are accessed in another thread, they would be created and would exist for the lifetime of the thread. eg of an issue we recently fixed: #7646

In this case as well, the driver executing expression eval (and this code) can bounce between multiple available threads (we use folly::executor to execute drivers), and each time it would end up creating new instances. Unlike #7646, I do not expect this hold on to quite as much memory but nevertheless it would be a good idea to avoid this kind of usage pattern unless there is no other alternative. Using a unique_ptr as @mbasmanova suggested would help us to have tighter control over its lifetime by tying it to the lifetime of the Expression and therefore more predictable memory usage.

@bikramSingh91 I need more clarification on this. In the code here, the thread-local variable is not inside any objects nor be moved between threads. Since the possible number of stemmer instances is fixed, depends on how many languages are used in a running system and we only support 20 languages. The cost in each thread should be a fixed amount of memory. Each time a stemmer is used, it resues an allocated memory to store the stem results, and then the result is copied to the StringWrite. I guess it's different from the PR you referred to. I had a commit to make the map to <string, unique_ptr<Stemmer>>, how do you think after the change?

yhwang · 2024-04-09T21:55:02Z

velox/functions/prestosql/StringFunctions.h

+    std::string lowerOutput;
+    stringImpl::lower<isAscii>(lowerOutput, input);
+    auto rev = stemmer->stem(lowerOutput);
+    if (rev == NULL) {


I guess I could put UNLIKELY here.

mbasmanova

@kgpai @assignUser Krishna, Jacob, would you help review build changes?

@yhwang let's add the new dependency to CMake/resolve_dependency_modules/README.md

@kgpai Krishna, would you help look into what would it take to add this dependency to Meta's repo?

velox/docs/functions/presto/string.rst

mbasmanova

@yhwang Thank you for working on this. Some initial questions. Primarily, I'd like to understand how we are allocating memory and what may cause OOM scenario.

CC: @xiaoxmeng

velox/functions/prestosql/StringFunctions.h

mbasmanova · 2024-04-09T23:28:52Z

velox/functions/prestosql/StringFunctions.h

+  /// or create a new one if it doesn't exist. Return NULL if the
+  /// specified lang is not supported.
+  static Stemmer* getStemmer(const char* lang) {
+    thread_local std::map<std::string, Stemmer*> stemmers;


I'm not sure if it is a good idea to use thread local variable here. CC: @bikramSingh91

velox/functions/prestosql/StringFunctions.h

mbasmanova · 2024-04-09T23:33:15Z

velox/functions/prestosql/StringFunctions.h

+    stringImpl::lower<isAscii>(lowerOutput, input);
+    auto rev = stemmer->stem(lowerOutput);
+    if (rev == NULL) {
+      throw std::runtime_error("out of memory");


Would you explain how memory allocations are happening here? We need to make sure any large allocations go through Velox's Memory Pool. How can we run out of memory here?

enclosed its API doc here:

/** Stem a word. * * The return value is owned by the stemmer - it must not be freed or * modified, and it will become invalid when the stemmer is called again, * or if the stemmer is freed. * * The length of the return value can be obtained using sb_stemmer_length(). * * If an out-of-memory error occurs, this will return NULL. */ const sb_symbol * sb_stemmer_stem(struct sb_stemmer * stemmer, const sb_symbol * word, int size);

that's why I throw an out of memory exception when the returned value is NULL.

Let's use VELOX_CHECK_NOT_NULL(stem, "Stemmer library returned a NULL (out-of-memory)."

We need to make it clear that the error comes from Stemmer library and not from Velox core.

Let's also figure out how much memory Stemmer instances use.

Let's use VELOX_CHECK_NOT_NULL(stem, "Stemmer library returned a NULL (out-of-memory)."

Sure thing!

Let's also figure out how much memory Stemmer instances use.

I guess this would be a tricky one. I will dig more on this.

You can write a simple program that uses stemmer lib to stem some words and capture process-wide memory usage before creating stemmer instance, after, then again after processing some words, then after deleting stemmer instance.

https://stackoverflow.com/questions/63166/how-to-determine-cpu-and-memory-consumption-from-inside-a-process

I can also look into its implementation and see how and how much it allocates for storing the stem. since each stemmer reuses the memory block, it shouldn't consume that much.

velox/functions/prestosql/tests/StringFunctionsTest.cpp

mbasmanova · 2024-04-09T23:35:35Z

CC: @amitkdutta

velox/functions/prestosql/StringFunctions.h

assignUser

Some comments for the cmake. I checked the timestamps in the debug build and stemmer build in a few seconds, so we can leave it on BUNDLED for now.

CMake/resolve_dependency_modules/stemmer.cmake

velox/functions/prestosql/CMakeLists.txt

CMakeLists.txt

CMake/resolve_dependency_modules/stemmer.cmake

yhwang · 2024-04-10T06:26:53Z

thanks for the comments. I will address them in separate commits for an easier review process. I will squash all commits into one when the code is ready for merge.

@assignUser I addressed your comments in this commit: Address Cmake comments

I will work on other comments later.

mbasmanova · 2024-04-10T08:16:49Z

@rui-mo @PHILO-HE Folks, does Spark have a similar function?

assignUser · 2024-04-10T17:14:48Z

The python extension is always build as a shared library, could we build libstemmer with fpic? https://github.com/facebookincubator/velox/actions/runs/8626883438/job/23645998539?pr=9363#step:13:3508

yhwang · 2024-04-10T17:28:56Z

The python extension is always build as a shared library, could we build libstemmer with fpic? https://github.com/facebookincubator/velox/actions/runs/8626883438/job/23645998539?pr=9363#step:13:3508

@assignUser good question. I didn't think about that since libstemmer comes with the Makefile that only builds static lib. Do you think we should tweak it to build a so?

yhwang · 2024-04-10T18:13:51Z

@mbasmanova thanks for the comments. I added another commit here to address most of your comments: Update word_stem impl to address comments

I guess the things left are:

estimate or know the memory allocation in libstemmer
use local_thread or not

we observed that storing thread local variables inside objects that can be moved between threads can cause an unexpected increase in memory usage as whenever they are accessed in another thread

The thread-local variable here is not inside any objects and is not moved between threads. Or my understanding is wrong?

mbasmanova · 2024-04-10T19:14:53Z

@yhwang

use local_thread or not

Definitely not. Please, take a look at @bikramSingh91's explanation.

yhwang · 2024-04-10T19:37:40Z

@yhwang

use local_thread or not

Definitely not. Please, take a look at @bikramSingh91's explanation.

I guess my point is I don't see a direct relationship between this change and the thread-local issue mentioned above. I'd like to get more clarification. The stemmers in the use case here would only take a fixed amount of memory, it won't grow infinitely. Just need to know if that's the main concern.

assignUser · 2024-04-10T21:12:12Z

@yhwang it doesn't look like it supports shared library but patching the make file with

- CFLAGS=-Iinclude
+ CFLAGS=-Iinclude -fPIC

should be enough for the linker to figure stuff out when building the extension .so

bikramSingh91 · 2024-04-10T21:57:33Z

@yhwang

use local_thread or not

Definitely not. Please, take a look at @bikramSingh91's explanation.

I guess my point is I don't see a direct relationship between this change and the thread-local issue mentioned above. I'd like to get more clarification. The stemmers in the use case here would only take a fixed amount of memory, it won't grow infinitely. Just need to know if that's the main concern.

It appears that a static method is responsible for maintaining the thread-local variables specific to each thread. These variables are accessed each time the simple function is invoked. Although these variables are not contained within an object, the driver that carries this expression can be executed on a different thread for the subsequent batch of input vectors, leading to the creation of additional instances.

The maximum memory these objects can hold is approximately equal to (num_of_langs * threads). While this might constitute a small amount, the presence of thread-local variables complicates the understanding of their lifespan, ownership, and the number of copies created. The alternative we proposed ties the ownership to the Expression object itself, thereby making these aspects more predictable while ensuring thread-safe access. TLDR: the concern is less about the memory utilization and more about developing code that we can confidently reason about.

We would appreciate any insights into specific use-cases or advantages that we might have overlooked, which would make the use of thread-local variables more appealing. For instance, is their purpose to ensure that these objects are shared among potentially different expressions objects within the same thread to save on the cost of creating the stemmer objects?

yhwang · 2024-04-11T00:38:27Z

@bikramSingh91 thanks for the clarification. let me simplify it to what the lifespan of a stemmer should be.

with the expression:
initialize stemmer(s) in each expression and clean them up after the expression ends
or with the thread:
lazy initialize stemmers when needed and use unique_ptr for it. Basically, they go down with the thread

For me, maintaining the logic in the WordStem UDF struct is more straightforward and clean. Not sure if expression usage makes the code separate in multiple places. But I am good with either choice.

We would appreciate any insights into specific use-cases or advantages that we might have overlooked, which would make the use of thread-local variables more appealing.

This is the API doc from libstemmer:

Creating a stemmer is a relatively expensive operation - the expected
usage pattern is that a new stemmer is created when needed, used
to stem many words, and deleted after some time.

Stemmers are re-entrant, but not threadsafe.  In other words, if
you wish to access the same stemmer object from multiple threads,
you must ensure that all access is protected by a mutex or similar
device.

@mbasmanova posted this above and this is also the main reason I use thread-local map to store the stemmers, avoid mutex, and keep the code simple to read. Again, I am good with either expression or thread-local, since we agree memory shouldn't be the concern, and appreciate your clarification.

yhwang · 2024-04-12T16:05:55Z

@mbasmanova

Velox repo works differently from PrestoDB repo.

Understood. Different repos usually have different rule/bot. I worked on other repos having bots to squash commits and use the PR description and commit messages as the final commit. Good to know rule/settings here.

I know that @assignUser likes separate commits for addressing comments, but I don't share that preference.

I guess there is a way to fulfill both, I could squash reviewed commits into one and have only new commits for review :-) . I use single commit from time to time. But I found the force push diff is not that user-friendly. If the change is small, I prefer a single commit too.

Thanks for the review and comments (@assignUser too). I learned a lot. It's also a good practice for me. I updated the comment and squashed my commits (I know I don't need to do that)

pedroerp · 2024-04-18T15:16:58Z

@yhwang looks like our internal fuzzer caught a crash with your function:

I0418 07:57:34.180382 1011044 ExpressionFuzzerVerifier.cpp:312] ==============================> Started iteration 101 (seed: 2144156392)                                                       
I0418 07:57:34.189090 1011044 ExpressionVerifier.cpp:51] Executing expression 0 : word_stem("c0",MJw$$A,*3z;D)                                                                                 
=================================================================                                                                                                                              
==1011044==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7f258f990910 at pc 0x00000032eb39 bp 0x7fffc1c98860 sp 0x7fffc1c98020                                                   
READ of size 13 at 0x7f258f990910 thread T0                                                                                                                                                    
SCARINESS: 41 (multi-byte-read-stack-buffer-overflow)                                                                                                                                          
    #0 0x32eb38 in __interceptor_strlen.part.0 ubsan.c                                                                                                                                         
    #1 0x7f25ccb86507 in unsigned long folly::detail::constexpr_strlen_internal<char, 0ul>(char const*, int) fbcode/folly/portability/Constexpr.h:50                                           
    #2 0x7f25ccb864e6 in unsigned long folly::constexpr_strlen<char>(char const*) fbcode/folly/portability/Constexpr.h:106                                                                     
    #3 0x7f25ccb85209 in folly::Range<char const*>::Range(char const*) fbcode/folly/Range.h:303                                                                                                
    #4 0x7f25d6c69e62 in unsigned long folly::detail::TransparentRangeHash<char>::operator()<char const*>(char const* const&) const fbcode/folly/container/HeterogeneousAccess.h:125           
    #5 0x7f25d6c69a94 in unsigned long folly::f14::detail::VectorContainerPolicy<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, std::unique_ptr<facebook::velo
x::functions::detail::Stemmer, std::default_delete<facebook::velox::functions::detail::Stemmer>>, void, void, void, std::integral_constant<bool, true>>::computeKeyHash<char const*>(char const
* const&) const fbcode/folly/container/detail/F14Policy.h:1135                                                                                                                                 
    #6 0x7f25d6c678bb in find<const char *> fbcode/folly/container/detail/F14Table.h:1585                                                                                                      
    #7 0x7f25d6c678bb in find<const char *> fbcode/folly/container/F14Map.h:886                                                                                                                
    #8 0x7f25d6c678bb in facebook::velox::functions::WordStemFunction<facebook::velox::exec::VectorExec>::getStemmer(char const*) fbcode/velox/functions/prestosql/WordStem.h:120              
    #9 0x7f25d6d1b51f in doCall<true> fbcode/velox/functions/prestosql/WordStem.h:101                                                                                                          
    #10 0x7f25d6d1b51f in callAscii fbcode/velox/functions/prestosql/WordStem.h:93                                                                                                             
    #11 0x7f25d6d1b51f in callAsciiImpl fbcode/velox/core/SimpleFunctionMetadata.h:954                                                                                                         
    #12 0x7f25d6d1b51f in callAscii fbcode/velox/core/SimpleFunctionMetadata.h:904
    #13 0x7f25d6d1b51f in doApplyAsciiNotNull<2UL, facebook::velox::StringView, facebook::velox::StringView, 0> fbcode/velox/expression/SimpleFunctionAdapter.h:866
    #14 0x7f25d6d1b51f in doApplyAsciiNotNull<1UL, const facebook::velox::exec::ConstantVectorReader<facebook::velox::Varchar>, facebook::velox::StringView, 0> fbcode/velox/expression/SimpleF
unctionAdapter.h:855
    #15 0x7f25d6d1b51f in doApplyAsciiNotNull<0UL, facebook::velox::exec::FlatVectorReader<facebook::velox::Varchar>, facebook::velox::exec::ConstantVectorReader<facebook::velox::Varchar>, 0>
 fbcode/velox/expression/SimpleFunctionAdapter.h:855
    #16 0x7f25d6d1b51f in operator()<facebook::velox::exec::StringWriter<false>, int> fbcode/velox/expression/SimpleFunctionAdapter.h:685
    #17 0x7f25d6d1b51f in operator()<int> fbcode/velox/expression/SimpleFunctionAdapter.h:721
    #18 0x7f25d6d1b51f in operator()<int> fbcode/velox/expression/EvalCtx.h:102

You can reproduce it by running fuzzer with ubsan enabled.

yhwang · 2024-04-18T18:47:52Z

@pedroerp thanks for the info. the code in velox/functions/prestosql/WordStem.h:120 is:

if (auto found = stemmers_.find(lang); found != stemmers_.end()) {

It's kind of interesting to me. Let me reproduce it and see how to fix it.

Edit
I just saw the expression: Executing expression 0 : word_stem("c0",MJw$$A,*3z;D)
Is the second parameter a string?

pedroerp · 2024-04-19T01:16:09Z

Is the second parameter a string?

I suppose so. Try to reproduce by running on that exact same version and same fuzzer seed. You'll probably need to run using clang and enable ubsan.

yhwang · 2024-04-22T21:06:25Z

@pedroerp sorry for the late response, I was on vacation last Friday. I tried to build velox_expression_fuzzer_test with UBSan today, but I hit some undefined symbol errors in the linking stage. Can you share the steps/instructions to build the fuzzer with UBSan enabled?

pedroerp · 2024-04-23T19:15:33Z

@yhwang unfortunately we don't have this set up in our github CI (yet), but you need to first compile using clang with the ubsan flags enabled, then run join fuzzer. These are the instructions on how to enable ubsan:

https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html

What is the issue you are seeing?

yhwang · 2024-04-23T22:44:29Z

@pedroerp

https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html

Thanks and this is the same doc that I followed. The unresolved symbols issue may be caused by the flex that comes with xcode. I am using a Linux VM to build the velox now. The build looks good so far.

pedroerp · 2024-04-23T23:37:28Z

The unresolved symbols issue may be caused by the flex that comes with xcode

I think you need to compile all dependencies with the same ubsan flags. @kgpai and @assignUser would know this better.

yhwang · 2024-04-23T23:42:49Z

@pedroerp

I ran the following command:
_build/debug/velox/expression/tests/velox_expression_fuzzer_test --seed=2144156392 --only=word_stem --logtostdout=1 --steps=1000

and I only saw this from time to time:

I20240423 16:/home/yhwang/WS_git/presto/presto-native-execution/velox/velox/common/base/BitUtil.cpp:204:24: runtime error: load of misaligned address 0x7ffe3eb09834 for type 'const long unsigned int', which requires 8 byte alignment
0x7ffe3eb09834: note: pointer points here
  0a 00 00 00 59 28 46 76  27 77 30 3d 4a 66 00 00  00 00 00 00 00 00 00 00  87 98 b0 3e fe 7f 00 00
              ^~

and although I specified -seed, it only uses it for the first iteration. Do you know how to use the same seed across all iterations?

I added this to the very top CMakeLists.xt:

add_link_options(-fsanitize=undefined)
add_compile_options(-fsanitize=undefined)

And set VELOX_DEPENDENCY_SOURCE to BUNDLED

I guess this way it builds most of the deps as bundles and using the UBSan flag.

pedroerp · 2024-04-23T23:47:18Z

Do you know how to use the same seed across all iterations?

The seed is used for the first iteration, then a new seed is generated for the next iteration. They should deterministically reproduce the same chain of seeds if you're using the same binary with the same set of functions available.

yhwang · 2024-04-23T23:53:11Z

Then I can only for-loop the command to see if I can hit that error with the same seed. So far, I haven't seen it yet.

kgpai · 2024-04-24T00:02:34Z

@yhwang You can see the flags we use here : https://github.com/facebookincubator/velox/blob/main/.github/workflows/scheduled.yml#L360

You can also set the flag in helper-functions.sh without modifying your cmake and set the deps to BUNDLED which will ensure everything has UBSAN (like you did!).

yhwang · 2024-04-24T06:40:06Z

I applied all the flags, but I only hit this error still:

/home/yhwang/WS_git/presto/presto-native-execution/velox/./velox/common/base/BitUtil.h:826:12: runtime error: store to misaligned address 0x564e2ecaefc3 for type 'short unsigned int', which requires 2 byte alignment
0x564e2ecaefc3: note: pointer points here
 ff  fd ff e7 ff ff ff ff ff  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff
              ^

Based on the error that Pedro posted earlier, I guess it's caused by the const char* retrieving from StringViewer::data(). Let me see if passing std::string would avoid the error.

Add snowball libstemmer v2.2.0 as one of the dependencies. And use it to implement the word_stem() as a scalar UDF. When using the libstemmer API, each language creates an sb_stemmer instance which consumes 114 bytes, including the default 10 bytes for the output stem. It uses the realloc to increase the memory block for the output stem if needed. Signed-off-by: Yihong Wang <[email protected]>

yhwang · 2024-04-25T22:54:13Z

@pedroerp I kept trying to recreate the error you posted above. But I can't reproduce it. The only error I saw is the one I mentioned: BitUtil.h:826. So the only thing I can do is change the lang param from const char* to std::string&. Could you help me to verify the change at your end?

Another thing I feel weird is the stack trace in your comment is different from what I saw in my system. I paused at the word_stem function and stepped into the function call one by one to simulate the stack trace you posted. and this is what I got (I didn't go all the way to the very top stack trace you had):

folly::f14::detail::VectorContainerPolicy<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::unique_ptr<facebook::velox::functions::detail::Stemmer, std::default_delete<facebook::velox::functions::detail::Stemmer> >, void, void, void, std::integral_constant<bool, true> >::computeKeyHash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >(const folly::f14::detail::VectorContainerPolicy<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::unique_ptr<facebook::velox::functions::detail::Stemmer, std::default_delete<facebook::velox::functions::detail::Stemmer> >, void, void, void, std::integral_constant<bool, true> > * const this, const std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > & key) (/home/yhwang/WS_git/presto/presto-native-execution/velox/_build/debug/_deps/folly-src/folly/container/detail/F14Policy.h:1134)
folly::f14::detail::F14Table<folly::f14::detail::VectorContainerPolicy<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::unique_ptr<facebook::velox::functions::detail::Stemmer, std::default_delete<facebook::velox::functions::detail::Stemmer> >, void, void, void, std::integral_constant<bool, true> > >::find<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >(const std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > & key, const folly::f14::detail::F14Table<folly::f14::detail::VectorContainerPolicy<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::unique_ptr<facebook::velox::functions::detail::Stemmer, std::default_delete<facebook::velox::functions::detail::Stemmer> >, void, void, void, std::integral_constant<bool, true> > > * const this) (/home/yhwang/WS_git/presto/presto-native-execution/velox/_build/debug/_deps/folly-src/folly/container/detail/F14Table.h:1580)
folly::f14::detail::F14BasicMap<folly::f14::detail::VectorContainerPolicy<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::unique_ptr<facebook::velox::functions::detail::Stemmer, std::default_delete<facebook::velox::functions::detail::Stemmer> >, void, void, void, std::integral_constant<bool, true> > >::find(const folly::f14::detail::F14BasicMap<folly::f14::detail::VectorContainerPolicy<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::unique_ptr<facebook::velox::functions::detail::Stemmer, std::default_delete<facebook::velox::functions::detail::Stemmer> >, void, void, void, std::integral_constant<bool, true> > >::key_type & key, folly::f14::detail::F14BasicMap<folly::f14::detail::VectorContainerPolicy<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::unique_ptr<facebook::velox::functions::detail::Stemmer, std::default_delete<facebook::velox::functions::detail::Stemmer> >, void, void, void, std::integral_constant<bool, true> > > * const this) (/home/yhwang/WS_git/presto/presto-native-execution/velox/_build/debug/_deps/folly-src/folly/container/F14Map.h:854)
facebook::velox::functions::WordStemFunction<facebook::velox::exec::VectorExec>::getStemmer(facebook::velox::functions::WordStemFunction<facebook::velox::exec::VectorExec> * const this, const std::string & lang) (/home/yhwang/WS_git/presto/presto-native-execution/velox/velox/functions/prestosql/WordStem.h:120)
facebook::velox::functions::WordStemFunction<facebook::velox::exec::VectorExec>::doCall<true>(const std::string & lang, const facebook::velox::functions::WordStemFunction<facebook::velox::exec::VectorExec>::arg_type & input, facebook::velox::functions::WordStemFunction<facebook::velox::exec::VectorExec>::out_type & result, facebook::velox::functions::WordStemFunction<facebook::velox::exec::VectorExec> * const this) (/home/yhwang/WS_git/presto/presto-native-execution/velox/velox/functions/prestosql/WordStem.h:101)
facebook::velox::functions::WordStemFunction<facebook::velox::exec::VectorExec>::callAscii(const facebook::velox::functions::WordStemFunction<facebook::velox::exec::VectorExec>::arg_type & input, facebook::velox::functions::WordStemFunction<facebook::velox::exec::VectorExec>::out_type & result, facebook::velox::functions::WordStemFunction<facebook::velox::exec::VectorExec> * const this) (/home/yhwang/WS_git/presto/presto-native-execution/velox/velox/functions/prestosql/WordStem.h:79)

You can see the line numbers in those folly files are different. I am using the folly version from the main branch which is v2024.04.01.00. Is it possible the error is from the different folly version?

facebook-github-bot · 2024-04-26T20:41:50Z

@pedroerp has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

pedroerp · 2024-04-27T00:00:43Z

@yhwang I run some more tests. The problem was because on the previous version you were taking a std::string or StringView's buffer (which do not necessarily have a \0 at the end), and using it in the function sb_stemmer_new() which expects a C-like string ended by a \0.

Using c_str() instead like what you're doing now fixes the issue. I'll get it merged today.

yhwang · 2024-04-27T00:20:42Z

@pedroerp thanks for the verification and explanation. So my guess is right :-), although I can’t reproduce the error.

conbench-facebook · 2024-04-27T03:38:46Z

Conbench analyzed the 1 benchmark run on commit 47970417.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

This reverts commit 4797041.

facebook-github-bot · 2024-04-30T22:13:33Z

@pedroerp merged this pull request in 4797041.

Summary: Add snowball libstemmer v2.2.0 as one of the dependencies. And use it to implement the word_stem() as a scalar UDF. When using the libstemmer API, each language creates an `sb_stemmer` instance which consumes 114 bytes, including the default 10 bytes for the output stem. It uses the `realloc` to increase the memory block for the output stem if needed. Fixes facebookincubator#8487 Pull Request resolved: facebookincubator#9363 Reviewed By: amitkdutta Differential Revision: D56059511 Pulled By: pedroerp fbshipit-source-id: b3a66956c3809e3f3dadfc8cc7b397b7116996d5

This reverts commit 4797041.

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 4, 2024

yhwang force-pushed the word_stem-impl branch 2 times, most recently from bedcb88 to 768f5b1 Compare April 4, 2024 22:43

yhwang commented Apr 9, 2024

View reviewed changes

mbasmanova requested review from kgpai and assignUser April 9, 2024 23:22

mbasmanova reviewed Apr 9, 2024

View reviewed changes

velox/docs/functions/presto/string.rst Outdated Show resolved Hide resolved

velox/docs/functions/presto/string.rst Outdated Show resolved Hide resolved

mbasmanova reviewed Apr 9, 2024

View reviewed changes

mbasmanova changed the title ~~Add word_stem() implementation~~ Add word_stem Presto function Apr 9, 2024

mbasmanova reviewed Apr 10, 2024

View reviewed changes

velox/functions/prestosql/StringFunctions.h Outdated Show resolved Hide resolved

assignUser requested changes Apr 10, 2024

View reviewed changes

yhwang requested a review from assignUser April 10, 2024 06:27

assignUser approved these changes Apr 10, 2024

View reviewed changes

yhwang force-pushed the word_stem-impl branch from 4546f1e to d79dfe4 Compare April 10, 2024 18:09

mbasmanova added the ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall label Apr 12, 2024

yhwang force-pushed the word_stem-impl branch from 8e76469 to 45d5492 Compare April 12, 2024 16:04

yhwang force-pushed the word_stem-impl branch from 45d5492 to 2c14f09 Compare April 25, 2024 22:40

facebook-github-bot closed this in 4797041 Apr 27, 2024

zhztheplayer added a commit to zhztheplayer/velox that referenced this pull request Apr 28, 2024

Revert "Add word_stem Presto function (facebookincubator#9363)"

e2925d5

This reverts commit 4797041.

yhwang deleted the word_stem-impl branch April 30, 2024 20:52

facebook-github-bot added the Merged label Apr 30, 2024

zhli1142015 pushed a commit to zhli1142015/velox that referenced this pull request Jul 9, 2024

Revert "Add word_stem Presto function (facebookincubator#9363)"

b0698b1

This reverts commit 4797041.

Add word_stem Presto function #9363

Add word_stem Presto function #9363

Conversation

yhwang commented Apr 4, 2024 • edited Loading

facebook-github-bot commented Apr 4, 2024

Process

netlify bot commented Apr 4, 2024 • edited Loading

✅ Deploy Preview for meta-velox ready!

yhwang commented Apr 9, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mbasmanova left a comment

Choose a reason for hiding this comment

mbasmanova left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mbasmanova commented Apr 9, 2024

assignUser left a comment

Choose a reason for hiding this comment

yhwang commented Apr 10, 2024

mbasmanova commented Apr 10, 2024

assignUser commented Apr 10, 2024

yhwang commented Apr 10, 2024

yhwang commented Apr 10, 2024 • edited Loading

mbasmanova commented Apr 10, 2024

yhwang commented Apr 10, 2024

assignUser commented Apr 10, 2024

bikramSingh91 commented Apr 10, 2024

yhwang commented Apr 11, 2024 • edited Loading

yhwang commented Apr 12, 2024

pedroerp commented Apr 18, 2024

yhwang commented Apr 18, 2024 • edited Loading

pedroerp commented Apr 19, 2024

yhwang commented Apr 22, 2024

pedroerp commented Apr 23, 2024

yhwang commented Apr 23, 2024

pedroerp commented Apr 23, 2024

yhwang commented Apr 23, 2024

pedroerp commented Apr 23, 2024

yhwang commented Apr 23, 2024

kgpai commented Apr 24, 2024

yhwang commented Apr 24, 2024

yhwang commented Apr 25, 2024

facebook-github-bot commented Apr 26, 2024

pedroerp commented Apr 27, 2024

yhwang commented Apr 27, 2024

conbench-facebook bot commented Apr 27, 2024

facebook-github-bot commented Apr 30, 2024

yhwang commented Apr 4, 2024 •

edited

Loading

netlify bot commented Apr 4, 2024 •

edited

Loading

yhwang commented Apr 10, 2024 •

edited

Loading

yhwang commented Apr 11, 2024 •

edited

Loading

yhwang commented Apr 18, 2024 •

edited

Loading