Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Adding additional languages support #27

Draft
wants to merge 35 commits into
base: main
Choose a base branch
from

Conversation

danielcrenna
Copy link
Contributor

This PR follows #26 since it is close to merging. If you prefer I re-open this with just the multi-language commits, let me know, otherwise this can wait for a merge before reviewing.

This is an initial implementation with tests for multi-language support and a French implementation, both ported from https://github.com/MihaiValentin/lunr-languages.

My notes:

  • I left Among and SnowballProgram variable and method naming conventions, but this is up to you how you want to deal with these, they are a little esoteric.

  • The tests are light in the original project. I adopted your tests for trimmer and stop word list, there are no explicit tests for stemmers: they are only exercised via the search tests.

  • The original project has stemmer tests but they are pinned to Russian, so we could add them when Russian is ported.

  • I added Index.Fr.Build... convenient overload that is the same as Index.Build but plugs in the French language alternates without having to explicitly provide them. This is a matter of taste but intends to mimic lunr.fr...

  • We need to add attribution for https://github.com/MihaiValentin/lunr-languages in README.md, but it itself is a curation of multiple other language libraries, so I'll leave it to you to decide how to attribute properly.

@bleroy
Copy link
Owner

bleroy commented Jan 1, 2021

Yeah, let's rebase this on the new main, that should remove some noise.

@danielcrenna
Copy link
Contributor Author

@bleroy OK, we're down to the original 6 commits on this PR now.

@bleroy
Copy link
Owner

bleroy commented Jan 1, 2021

Thanks, I'll review ASAP!

LunrCore/Index.cs Outdated Show resolved Hide resolved
@danielcrenna
Copy link
Contributor Author

The day job fairy has bitten me again... but my priority is figuring out why the German stemmer port fails, I only know how for now.

@danielcrenna danielcrenna changed the title Proposal: Multi-language support + French Proposal: Adding additional languages support Jan 6, 2021
Copy link
Owner

@bleroy bleroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for this massive amount of work. This is glorious.


namespace Lunr.Multi
{
public readonly struct Among : IEquatable<Among>
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some doc comments on all new classes and methods?

LunrCore/Multi/Among.cs Outdated Show resolved Hide resolved
LunrCore/Multi/Among.cs Outdated Show resolved Hide resolved
LunrCore/Multi/Among.cs Outdated Show resolved Hide resolved
Comment on lines 7 to 8
private const string Data = @"، اض امين اه اها اي ا اب اجل اجمع اخ اخذ اصبح اضحى اقبل اقل اكثر الا ام اما امامك امامك امسى اما ان انا انت انتم انتما انتن انت انشا انى او اوشك اولئك اولئكم اولاء اولالك اوه اي ايا اين اينما اي ان اي اف اذ اذا اذا اذما اذن الى اليكم اليكما اليكن اليك اليك الا اما ان انما اي اياك اياكم اياكما اياكن ايانا اياه اياها اياهم اياهما اياهن اياي ايه ان ا ابتدا اثر اجل احد اخرى اخلولق اذا اربعة ارتد استحال اطار اعادة اعلنت اف اكثر اكد الالاء الالى الا الاخيرة الان الاول الاولى التى التي الثاني الثانية الذاتي الذى الذي الذين السابق الف اللائي اللاتي اللتان اللتيا اللتين اللذان اللذين اللواتي الماضي المقبل الوقت الى اليوم اما امام امس ان انبرى انقلب انه انها او اول اي ايار ايام ايضا ب بات باسم بان بخ برس بسبب بس بشكل بضع بطان بعد بعض بك بكم بكما بكن بل بلى بما بماذا بمن بن بنا به بها بي بيد بين بس بله بئس تان تانك تبدل تجاه تحول تلقاء تلك تلكم تلكما تم تينك تين ته تي ثلاثة ثم ثم ثمة ثم جعل جلل جميع جير حار حاشا حاليا حاي حتى حرى حسب حم حوالى حول حيث حيثما حين حي حبذا حتى حذار خلا خلال دون دونك ذا ذات ذاك ذانك ذان ذلك ذلكم ذلكما ذلكن ذو ذوا ذواتا ذواتي ذيت ذينك ذين ذه ذي راح رجع رويدك ريث رب زيارة سبحان سرعان سنة سنوات سوف سوى ساء ساءما شبه شخصا شرع شتان صار صباح صفر صه صه ضد ضمن طاق طالما طفق طق ظل عاد عام عاما عامة عدا عدة عدد عدم عسى عشر عشرة علق على عليك عليه عليها عل عن عند عندما عوض عين عدس عما غدا غير ف فان فلان فو فى في فيم فيما فيه فيها قال قام قبل قد قط قلما قوة كانما كاين كاي كاين كاد كان كانت كذا كذلك كرب كل كلا كلاهما كلتا كلم كليكما كليهما كلما كلا كم كما كي كيت كيف كيفما كان كخ لئن لا لات لاسيما لدن لدى لعمر لقاء لك لكم لكما لكن لكنما لكي لكيلا للامم لم لما لما لن لنا له لها لو لوكالة لولا لوما لي لست لست لستم لستما لستن لست لسن لعل لكن ليت ليس ليسا ليستا ليست ليسوا لسنا ما ماانفك مابرح مادام ماذا مازال مافتئ مايو متى مثل مذ مساء مع معاذ مقابل مكانكم مكانكما مكانكن مكانك مليار مليون مما ممن من منذ منها مه مهما من من نحن نحو نعم نفس نفسه نهاية نخ نعما نعم ها هاؤم هاك هاهنا هب هذا هذه هكذا هل هلم هلا هم هما هن هنا هناك هنالك هو هي هيا هيت هيا هؤلاء هاتان هاتين هاته هاتي هج هذا هذان هذين هذه هذي هيهات و وا واحد واضاف واضافت واكد وان واها واوضح وراءك وفي وقال وقالت وقد وقف وكان وكانت ولا ولم ومن وهو وهي ويكان وي وشكان يكون يمكن يوم ايان";
private static readonly ISet<string> WordList = new Set<string>(Data.Split(new[] { " " }, StringSplitOptions.RemoveEmptyEntries));
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bummer if it's impractical to replace that with direct initialization with an array of strings instead of having to split a string (not just for Arabic).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was an expedient to porting since the original uses strings. Also, Visual Studio is respecting RTL making splitting the Arabic a bit of a mind warp for me. Now that all stemmers are ported, I can fix it using a light codegen unit test or something similar. Stay tuned.

LunrCore/Globalization/fr/FrenchStemmer.cs Outdated Show resolved Hide resolved
LunrCore/Globalization/fr/FrenchStemmer.cs Outdated Show resolved Hide resolved
LunrCore/Index.cs Outdated Show resolved Hide resolved
LunrCoreLmdb/Globalization/fr/LmdbIndex.cs Outdated Show resolved Hide resolved
LunrCoreLmdb/LmdbIndex.cs Outdated Show resolved Hide resolved
@danielcrenna danielcrenna marked this pull request as draft January 9, 2021 20:52
@danielcrenna
Copy link
Contributor Author

@bleroy update: I've addressed a good chunk of your feedback.

The remaining work is in the specific idiomatic conversions for Snowball and its implementations. Some of it I can intuit, some of it is straight-up magic to me, so I need some time to read the source for the original code and decipher the intentions.

Comment on lines +4 to +5
indent_style=tab
indent_size=tab
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aaaaargh! Nooooooo! Spaces, please.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed one of the better parts of SV...

@iamcarbon
Copy link
Contributor

@danielcrenna Still working on this by chance?

@danielcrenna
Copy link
Contributor Author

@danielcrenna Still working on this by chance?

This and the mutable indexes work is still in my sight line, I've had a tumultuous last few months, with the project I'm using this.

Thanks for the reminder, I've got it re-queued in my task list. Last time the stumbling block was some errors in the snowball port that are hard to decipher, as I'm not perfectly familiar with its inner operations.

But your message has me back on it.

@iamcarbon
Copy link
Contributor

@danielcrenna A quick 4 month checkin! Are you still looking to get this PR merged (if only in part). I was planning on some more performance related improvements and wanted to make sure we reduce our changes of a conflict.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants