-
-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Adding additional languages support #27
base: main
Are you sure you want to change the base?
Conversation
Yeah, let's rebase this on the new main, that should remove some noise. |
@bleroy OK, we're down to the original 6 commits on this PR now. |
Thanks, I'll review ASAP! |
…ages for ease of porting
The day job fairy has bitten me again... but my priority is figuring out why the German stemmer port fails, I only know how for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for this massive amount of work. This is glorious.
LunrCore/Multi/Among.cs
Outdated
|
||
namespace Lunr.Multi | ||
{ | ||
public readonly struct Among : IEquatable<Among> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add some doc comments on all new classes and methods?
private const string Data = @"، اض امين اه اها اي ا اب اجل اجمع اخ اخذ اصبح اضحى اقبل اقل اكثر الا ام اما امامك امامك امسى اما ان انا انت انتم انتما انتن انت انشا انى او اوشك اولئك اولئكم اولاء اولالك اوه اي ايا اين اينما اي ان اي اف اذ اذا اذا اذما اذن الى اليكم اليكما اليكن اليك اليك الا اما ان انما اي اياك اياكم اياكما اياكن ايانا اياه اياها اياهم اياهما اياهن اياي ايه ان ا ابتدا اثر اجل احد اخرى اخلولق اذا اربعة ارتد استحال اطار اعادة اعلنت اف اكثر اكد الالاء الالى الا الاخيرة الان الاول الاولى التى التي الثاني الثانية الذاتي الذى الذي الذين السابق الف اللائي اللاتي اللتان اللتيا اللتين اللذان اللذين اللواتي الماضي المقبل الوقت الى اليوم اما امام امس ان انبرى انقلب انه انها او اول اي ايار ايام ايضا ب بات باسم بان بخ برس بسبب بس بشكل بضع بطان بعد بعض بك بكم بكما بكن بل بلى بما بماذا بمن بن بنا به بها بي بيد بين بس بله بئس تان تانك تبدل تجاه تحول تلقاء تلك تلكم تلكما تم تينك تين ته تي ثلاثة ثم ثم ثمة ثم جعل جلل جميع جير حار حاشا حاليا حاي حتى حرى حسب حم حوالى حول حيث حيثما حين حي حبذا حتى حذار خلا خلال دون دونك ذا ذات ذاك ذانك ذان ذلك ذلكم ذلكما ذلكن ذو ذوا ذواتا ذواتي ذيت ذينك ذين ذه ذي راح رجع رويدك ريث رب زيارة سبحان سرعان سنة سنوات سوف سوى ساء ساءما شبه شخصا شرع شتان صار صباح صفر صه صه ضد ضمن طاق طالما طفق طق ظل عاد عام عاما عامة عدا عدة عدد عدم عسى عشر عشرة علق على عليك عليه عليها عل عن عند عندما عوض عين عدس عما غدا غير ف فان فلان فو فى في فيم فيما فيه فيها قال قام قبل قد قط قلما قوة كانما كاين كاي كاين كاد كان كانت كذا كذلك كرب كل كلا كلاهما كلتا كلم كليكما كليهما كلما كلا كم كما كي كيت كيف كيفما كان كخ لئن لا لات لاسيما لدن لدى لعمر لقاء لك لكم لكما لكن لكنما لكي لكيلا للامم لم لما لما لن لنا له لها لو لوكالة لولا لوما لي لست لست لستم لستما لستن لست لسن لعل لكن ليت ليس ليسا ليستا ليست ليسوا لسنا ما ماانفك مابرح مادام ماذا مازال مافتئ مايو متى مثل مذ مساء مع معاذ مقابل مكانكم مكانكما مكانكن مكانك مليار مليون مما ممن من منذ منها مه مهما من من نحن نحو نعم نفس نفسه نهاية نخ نعما نعم ها هاؤم هاك هاهنا هب هذا هذه هكذا هل هلم هلا هم هما هن هنا هناك هنالك هو هي هيا هيت هيا هؤلاء هاتان هاتين هاته هاتي هج هذا هذان هذين هذه هذي هيهات و وا واحد واضاف واضافت واكد وان واها واوضح وراءك وفي وقال وقالت وقد وقف وكان وكانت ولا ولم ومن وهو وهي ويكان وي وشكان يكون يمكن يوم ايان"; | ||
private static readonly ISet<string> WordList = new Set<string>(Data.Split(new[] { " " }, StringSplitOptions.RemoveEmptyEntries)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bummer if it's impractical to replace that with direct initialization with an array of strings instead of having to split a string (not just for Arabic).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was an expedient to porting since the original uses strings. Also, Visual Studio is respecting RTL making splitting the Arabic a bit of a mind warp for me. Now that all stemmers are ported, I can fix it using a light codegen unit test or something similar. Stay tuned.
@bleroy update: I've addressed a good chunk of your feedback. The remaining work is in the specific idiomatic conversions for Snowball and its implementations. Some of it I can intuit, some of it is straight-up magic to me, so I need some time to read the source for the original code and decipher the intentions. |
indent_style=tab | ||
indent_size=tab |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aaaaargh! Nooooooo! Spaces, please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed one of the better parts of SV...
@danielcrenna Still working on this by chance? |
This and the mutable indexes work is still in my sight line, I've had a tumultuous last few months, with the project I'm using this. Thanks for the reminder, I've got it re-queued in my task list. Last time the stumbling block was some errors in the snowball port that are hard to decipher, as I'm not perfectly familiar with its inner operations. But your message has me back on it. |
@danielcrenna A quick 4 month checkin! Are you still looking to get this PR merged (if only in part). I was planning on some more performance related improvements and wanted to make sure we reduce our changes of a conflict. |
This PR follows #26 since it is close to merging. If you prefer I re-open this with just the multi-language commits, let me know, otherwise this can wait for a merge before reviewing.
This is an initial implementation with tests for multi-language support and a French implementation, both ported from https://github.com/MihaiValentin/lunr-languages.
My notes:
I left
Among
andSnowballProgram
variable and method naming conventions, but this is up to you how you want to deal with these, they are a little esoteric.The tests are light in the original project. I adopted your tests for trimmer and stop word list, there are no explicit tests for stemmers: they are only exercised via the search tests.
The original project has stemmer tests but they are pinned to Russian, so we could add them when Russian is ported.
I added
Index.Fr.Build...
convenient overload that is the same asIndex.Build
but plugs in the French language alternates without having to explicitly provide them. This is a matter of taste but intends to mimiclunr.fr...
We need to add attribution for https://github.com/MihaiValentin/lunr-languages in README.md, but it itself is a curation of multiple other language libraries, so I'll leave it to you to decide how to attribute properly.