-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add word_stem Presto function #9363
Conversation
Hi @yhwang! Thank you for your pull request. We require contributors to sign our Contributor License Agreement, and yours needs attention. You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at [email protected]. Thanks! |
✅ Deploy Preview for meta-velox ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
bedcb88
to
768f5b1
Compare
/// or create a new one if it doesn't exist. Return NULL if the | ||
/// specified lang is not supported. | ||
static Stemmer* getStemmer(const char* lang) { | ||
thread_local std::map<std::string, Stemmer*> stemmers; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't clean up this thread-local map which contains the Stemmer instances for each language. I assume a working thread is shutting down with the Presto process. Not sure if it's worth adding the cleanup code while tearing down a thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if it is a good idea to use thread local variable here. CC: @bikramSingh91
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/zvelo/libstemmer
Creating a stemmer is a relatively expensive operation - the expected
usage pattern is that a new stemmer is created when needed, used
to stem many words, and deleted after some time.
Stemmers are re-entrant, but not threadsafe. In other words, if
you wish to access the same stemmer object from multiple threads,
you must ensure that all access is protected by a mutex or similar
device.
Based on this information, it seems we could add a member variable to 'WordStemFunction' to store Stemmer instances.
Let's use std::unique_ptr to make cleanup automatic.
Similar to regex functions, we may consider only supporting constant values for 'lang' and create a single Stemmer instance in 'initialize'.
Alternatively, we may allow up to of different languages per function instance. See https://facebookincubator.github.io/velox/functions/presto/regexp.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on this information, it seems we could add a member variable to 'WordStemFunction' to store Stemmer instances.
that's the reason I use thread-local to store the lang<-->stemmer map
Let's use std::unique_ptr to make cleanup automatic.
let me change the map to <string, unique_ptr<Stemmer>>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To add more context against using thread_local variables liberally is that we observed that storing thread local variables inside objects that can be moved between threads can cause an unexpected increase in memory usage as whenever they are accessed in another thread, they would be created and would exist for the lifetime of the thread. eg of an issue we recently fixed: #7646
In this case as well, the driver executing expression eval (and this code) can bounce between multiple available threads (we use folly::executor to execute drivers), and each time it would end up creating new instances. Unlike #7646, I do not expect this hold on to quite as much memory but nevertheless it would be a good idea to avoid this kind of usage pattern unless there is no other alternative. Using a unique_ptr as @mbasmanova suggested would help us to have tighter control over its lifetime by tying it to the lifetime of the Expression and therefore more predictable memory usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bikramSingh91 I need more clarification on this. In the code here, the thread-local variable is not inside any objects nor be moved between threads. Since the possible number of stemmer instances is fixed, depends on how many languages are used in a running system and we only support 20 languages. The cost in each thread should be a fixed amount of memory. Each time a stemmer is used, it resues an allocated memory to store the stem results, and then the result is copied to the StringWrite. I guess it's different from the PR you referred to. I had a commit to make the map to <string, unique_ptr<Stemmer>>
, how do you think after the change?
std::string lowerOutput; | ||
stringImpl::lower<isAscii>(lowerOutput, input); | ||
auto rev = stemmer->stem(lowerOutput); | ||
if (rev == NULL) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I could put UNLIKELY
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kgpai @assignUser Krishna, Jacob, would you help review build changes?
@yhwang let's add the new dependency to CMake/resolve_dependency_modules/README.md
@kgpai Krishna, would you help look into what would it take to add this dependency to Meta's repo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yhwang Thank you for working on this. Some initial questions. Primarily, I'd like to understand how we are allocating memory and what may cause OOM scenario.
CC: @xiaoxmeng
/// or create a new one if it doesn't exist. Return NULL if the | ||
/// specified lang is not supported. | ||
static Stemmer* getStemmer(const char* lang) { | ||
thread_local std::map<std::string, Stemmer*> stemmers; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if it is a good idea to use thread local variable here. CC: @bikramSingh91
stringImpl::lower<isAscii>(lowerOutput, input); | ||
auto rev = stemmer->stem(lowerOutput); | ||
if (rev == NULL) { | ||
throw std::runtime_error("out of memory"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you explain how memory allocations are happening here? We need to make sure any large allocations go through Velox's Memory Pool. How can we run out of memory here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
enclosed its API doc here:
/** Stem a word.
*
* The return value is owned by the stemmer - it must not be freed or
* modified, and it will become invalid when the stemmer is called again,
* or if the stemmer is freed.
*
* The length of the return value can be obtained using sb_stemmer_length().
*
* If an out-of-memory error occurs, this will return NULL.
*/
const sb_symbol * sb_stemmer_stem(struct sb_stemmer * stemmer,
const sb_symbol * word, int size);
that's why I throw an out of memory
exception when the returned value is NULL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use VELOX_CHECK_NOT_NULL(stem, "Stemmer library returned a NULL (out-of-memory)."
We need to make it clear that the error comes from Stemmer library and not from Velox core.
Let's also figure out how much memory Stemmer instances use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use VELOX_CHECK_NOT_NULL(stem, "Stemmer library returned a NULL (out-of-memory)."
Sure thing!
Let's also figure out how much memory Stemmer instances use.
I guess this would be a tricky one. I will dig more on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can write a simple program that uses stemmer lib to stem some words and capture process-wide memory usage before creating stemmer instance, after, then again after processing some words, then after deleting stemmer instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can also look into its implementation and see how and how much it allocates for storing the stem. since each stemmer reuses the memory block, it shouldn't consume that much.
CC: @amitkdutta |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments for the cmake. I checked the timestamps in the debug build and stemmer build in a few seconds, so we can leave it on BUNDLED for now.
thanks for the comments. I will address them in separate commits for an easier review process. I will squash all commits into one when the code is ready for merge. @assignUser I addressed your comments in this commit: Address Cmake comments I will work on other comments later. |
The python extension is always build as a shared library, could we build libstemmer with fpic? https://github.com/facebookincubator/velox/actions/runs/8626883438/job/23645998539?pr=9363#step:13:3508 |
@assignUser good question. I didn't think about that since libstemmer comes with the Makefile that only builds static lib. Do you think we should tweak it to build a so? |
@mbasmanova thanks for the comments. I added another commit here to address most of your comments: Update word_stem impl to address comments I guess the things left are:
The thread-local variable here is not inside any objects and is not moved between threads. Or my understanding is wrong? |
Definitely not. Please, take a look at @bikramSingh91's explanation. |
I guess my point is I don't see a direct relationship between this change and the thread-local issue mentioned above. I'd like to get more clarification. The stemmers in the use case here would only take a fixed amount of memory, it won't grow infinitely. Just need to know if that's the main concern. |
@yhwang it doesn't look like it supports shared library but patching the make file with - CFLAGS=-Iinclude
+ CFLAGS=-Iinclude -fPIC should be enough for the linker to figure stuff out when building the extension .so |
It appears that a static method is responsible for maintaining the thread-local variables specific to each thread. These variables are accessed each time the simple function is invoked. Although these variables are not contained within an object, the driver that carries this expression can be executed on a different thread for the subsequent batch of input vectors, leading to the creation of additional instances. The maximum memory these objects can hold is approximately equal to (num_of_langs * threads). While this might constitute a small amount, the presence of thread-local variables complicates the understanding of their lifespan, ownership, and the number of copies created. The alternative we proposed ties the ownership to the Expression object itself, thereby making these aspects more predictable while ensuring thread-safe access. TLDR: the concern is less about the memory utilization and more about developing code that we can confidently reason about. We would appreciate any insights into specific use-cases or advantages that we might have overlooked, which would make the use of thread-local variables more appealing. For instance, is their purpose to ensure that these objects are shared among potentially different expressions objects within the same thread to save on the cost of creating the stemmer objects? |
@bikramSingh91 thanks for the clarification. let me simplify it to what the lifespan of a stemmer should be.
For me, maintaining the logic in the WordStem UDF struct is more straightforward and clean. Not sure if expression usage makes the code separate in multiple places. But I am good with either choice.
This is the API doc from libstemmer:
@mbasmanova posted this above and this is also the main reason I use thread-local map to store the stemmers, avoid mutex, and keep the code simple to read. Again, I am good with either expression or thread-local, since we agree memory shouldn't be the concern, and appreciate your clarification. |
Understood. Different repos usually have different rule/bot. I worked on other repos having bots to squash commits and use the PR description and commit messages as the final commit. Good to know rule/settings here.
I guess there is a way to fulfill both, I could squash reviewed commits into one and have only new commits for review :-) . I use single commit from time to time. But I found the force push diff is not that user-friendly. If the change is small, I prefer a single commit too. Thanks for the review and comments (@assignUser too). I learned a lot. It's also a good practice for me. I updated the comment and squashed my commits (I know I don't need to do that) |
@yhwang looks like our internal fuzzer caught a crash with your function:
You can reproduce it by running fuzzer with ubsan enabled. |
@pedroerp thanks for the info. the code in
It's kind of interesting to me. Let me reproduce it and see how to fix it. Edit |
I suppose so. Try to reproduce by running on that exact same version and same fuzzer seed. You'll probably need to run using clang and enable ubsan. |
@pedroerp sorry for the late response, I was on vacation last Friday. I tried to build |
@yhwang unfortunately we don't have this set up in our github CI (yet), but you need to first compile using clang with the ubsan flags enabled, then run join fuzzer. These are the instructions on how to enable ubsan: https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html What is the issue you are seeing? |
Thanks and this is the same doc that I followed. The unresolved symbols issue may be caused by the flex that comes with xcode. I am using a Linux VM to build the velox now. The build looks good so far. |
I think you need to compile all dependencies with the same ubsan flags. @kgpai and @assignUser would know this better. |
I ran the following command: and I only saw this from time to time:
and although I specified I added this to the very top CMakeLists.xt:
And set I guess this way it builds most of the deps as bundles and using the UBSan flag. |
The seed is used for the first iteration, then a new seed is generated for the next iteration. They should deterministically reproduce the same chain of seeds if you're using the same binary with the same set of functions available. |
Then I can only for-loop the command to see if I can hit that error with the same seed. So far, I haven't seen it yet. |
@yhwang You can see the flags we use here : https://github.com/facebookincubator/velox/blob/main/.github/workflows/scheduled.yml#L360 You can also set the flag in helper-functions.sh without modifying your cmake and set the deps to BUNDLED which will ensure everything has UBSAN (like you did!). |
I applied all the flags, but I only hit this error still:
Based on the error that Pedro posted earlier, I guess it's caused by the |
Add snowball libstemmer v2.2.0 as one of the dependencies. And use it to implement the word_stem() as a scalar UDF. When using the libstemmer API, each language creates an sb_stemmer instance which consumes 114 bytes, including the default 10 bytes for the output stem. It uses the realloc to increase the memory block for the output stem if needed. Signed-off-by: Yihong Wang <[email protected]>
@pedroerp I kept trying to recreate the error you posted above. But I can't reproduce it. The only error I saw is the one I mentioned: Another thing I feel weird is the stack trace in your comment is different from what I saw in my system. I paused at the word_stem function and stepped into the function call one by one to simulate the stack trace you posted. and this is what I got (I didn't go all the way to the very top stack trace you had):
You can see the line numbers in those folly files are different. I am using the folly version from the main branch which is |
@pedroerp has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@yhwang I run some more tests. The problem was because on the previous version you were taking a std::string or StringView's buffer (which do not necessarily have a \0 at the end), and using it in the function sb_stemmer_new() which expects a C-like string ended by a \0. Using |
@pedroerp thanks for the verification and explanation. So my guess is right :-), although I can’t reproduce the error. |
Conbench analyzed the 1 benchmark run on commit There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
This reverts commit 4797041.
Summary: Add snowball libstemmer v2.2.0 as one of the dependencies. And use it to implement the word_stem() as a scalar UDF. When using the libstemmer API, each language creates an `sb_stemmer` instance which consumes 114 bytes, including the default 10 bytes for the output stem. It uses the `realloc` to increase the memory block for the output stem if needed. Fixes facebookincubator#8487 Pull Request resolved: facebookincubator#9363 Reviewed By: amitkdutta Differential Revision: D56059511 Pulled By: pedroerp fbshipit-source-id: b3a66956c3809e3f3dadfc8cc7b397b7116996d5
This reverts commit 4797041.
Add snowball libstemmer v2.2.0 as one of the dependencies.
And use it to implement the word_stem() as a scalar UDF.
When using the libstemmer API, each language creates an
sb_stemmer
instance which consumes 114 bytes, including the default 10 bytes for the output stem.
It uses the
realloc
to increase the memory block for the output stem if needed.Fixes #8487