Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.0 API Proposal #54

Open
polm opened this issue May 27, 2024 · 0 comments
Open

v1.0 API Proposal #54

polm opened this issue May 27, 2024 · 0 comments

Comments

@polm
Copy link
Owner

polm commented May 27, 2024

Cutlet has been out for a few years now, and while I consider it basically functionally complete, the API is a little awkward as it's evolved over time. Since it's stable, I'd also like to release a v1.0 to indicate the API is reliable in the future. This issue is for my proposal and also to solicit feedback.

This is not a full API proposal - most of the evolution will be iterative and minor, like cleaning up which functions are public vs private. The main thing I want to do is make treatment of the different output options a little more clear. To that end I propose that the Cutlet object has the following main public methods of interest:

  • __callable__ / to_doc: returns a CutletDoc (see below)
  • to_romaji: returns a human legible string, like romaji now
  • to_slug: returns a machine-friendly string, like slug now
  • to_nodes: returns a list of nodes, like romaji_tokens now

A CutletDoc is inspired by a spaCy Doc object and contains:

  • raw input text
  • normalized input text
  • romaji/slug/nodes (lazily available, where appropriate)
  • a reference to the generating Cutlet object (so you can check config options)

The CutletDoc object has a few advantages. One is that if you need two of the above output formats, it allows you to avoid duplicate computation (MeCab calls) without having to manage state yourself. The other is that it can codify linking MeCab tokens to romaji tokens. The linking is very simple, but it's a commonly requested feature (#34, #37, #40, etc.), and (partly due to lack of examples on my part) users often find it confusing, so it would be good to provide a canonical process.

Separately, I will try making RomajiTokens proxy classes for MeCab tokens. I think this will work without issue, but it's possible that MeCab Nodes being Cython objects will be a problem.

While the API will change, the actual internal code will not change very much as part of this process. At the fastest this will take a few months, and a new version with DeprecationWarnings will be released. If you have a stable application and are happy with the current API, please be sure to use version guards.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant