Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor. Translate 2.0 #30

Open
adomasven opened this issue Oct 25, 2023 · 4 comments
Open

Refactor. Translate 2.0 #30

adomasven opened this issue Oct 25, 2023 · 4 comments

Comments

@adomasven
Copy link
Member

adomasven commented Oct 25, 2023

After having a brief attempt at uncluttering the messy code in this codebase while trying to fix zotero/zotero#3182, I have decided that it is pointless.

The actual translate flow is actually quite straightforward, but it is massively complicated by everything being super stateful, and execution flow being controlled by tens of flags set externally and internally.

There shouldn't be a need to instantiate any object when translating, instead, you should choose your Translate (Web, Export, Import, Search), then call translate(), detect() (for Web only?) and getTranslators(). 99% use cases would simply call translate(). If you pass in no translators into translate(), then it runs detect() before translation. If you pass in no translators into detect(), then it runs getTranslators() before translation.

All methods in current translation architecture that begin with set*() should become parameters that are passed into the execution calls above.

Events/handlers should likely disappear from translate at all. Translate should be strictly concerned with returning metadata from a given translate object. Any further processing like item saving or progress display should be performed by the translate calling code. This is simple to do in practice, since most of the saving functionality is already done in other classes per current architecture, specifically Translate.ItemSaver, which all environments that use Translate override.

I propose rewriting the translate code from scratch and providing it as a separate file, then gradually exchanging all existing codebases depending on Translate to use the new architecture, and finally phasing out Translate 1.0 after some years and allowing others to switch over.

This would probably be a huge timesaver going forward, since every bug we encounter in translation, and every time we need to change something in the logic usually means a huge amount of time spent analyzing this complex piece of code and further debugging when a change produces unexpected behavior.

CC @dstillman @AbeJellinek @zoe-translates

@AbeJellinek
Copy link
Member

Things we've discussed:

  • We'd need to keep the child translator API exposed to translators unchanged. That will mostly be straightforward (it's already implemented as a wrapper over the real translation API), but we'll either need to keep setHandler() in the top-level translation API or implement it for child translators as a shim over options to translate().
  • Might need to track pending promises with a waitUntil-style API to support legacy callback APIs (doGet/doPost/processDocuments/child translation), might not, depending on whether we allow simultaneous child translators.
  • done could be replaced by the translate() promise resolving.
  • error could be replaced by the translate() promise rejecting.
  • debug and select could be functions passed as options.
  • itemDone and collectionDone could be replaced by a async generator functions or kept as callbacks.
  • Export translation needs noWait mode, but export translators are sort of weird: they don't yield items, have a bunch of methods that aren't available to other translator types, and are rarely called from other translators. Do they need to be translators at all?

Other potential considerations:

  • We should keep RemoteTranslate in mind while designing the new API. For the most part, if the API is stateless, it should work fine.
  • translatorTester.js will need to be updated to use the new API, and maybe deserves a refactor as well. Scaffold uses it to generate item diffs, sanitize items for display, run all test types, and create web tests, but it doesn't create import/export/search tests, so Scaffold implements those itself. We could incorporate it into Scaffold (and remove it from the Connector), make it handle all test types, and add a command-line option to the client that lets you run translator tests headless. That way we wouldn't need Selenium for translator CI anymore.

@adomasven
Copy link
Member Author

Case in point d7299a7

I've spent a lot of time troubleshooting this and trying to come up with solutions, which in the end turned out just removing error handling there completely. translate.complete(false) is an absolute disaster of the current codebase. E.g.:

_detect: async function() {
// there won't be any translators if we need an RPC call
if(!this._potentialTranslators.length) {
this.complete(true);
return null;
}
try {
await this._loadTranslator(this._potentialTranslators[0]);
return await this._detectTranslatorLoaded();
}
catch (e) {
this.complete(false, e);
}
},

Before the above change this._loadTranslator() may have called this.complete(false), effectively meaning that the current translate operation should be terminated, but since it wasn't throwing, L1692 continued happily executing, to make matters worse, for Import translators, _loadTranslator() is also overloaded:

Zotero.Translate.Import.prototype._loadTranslator = function(translator) {
return Zotero.Translate.Base.prototype._loadTranslator.call(this, translator)
.then(function() {
return this._loadTranslatorPrepareIO(translator);
}.bind(this));
}

So the Base.prototype._loadTranslator() calls this.complete(false), but then this overloaded codepath continues executing, and then the codepath in _detect() also continues executing.


A bug like this shouldn't even exist in the first place if the code was architectured correctly, but even if it did, with proper execution flow figuring out the issue and fixing it should be a matter of minutes, not hours.

@AbeJellinek
Copy link
Member

Horrifying!

@adomasven
Copy link
Member Author

After additional internal discussions I've decided to attempt to approach this by not rewriting Translate, but instead adding a wrapper on top of it with a better, less-stateful API. See https://github.com/zotero/zotero-connectors/pull/527/files#diff-43186027602a6bd9542db518a998d0f44f51101d62fd250a1e00288cd6966e2b

This was done out of necessity to separate ItemSaver and its related logic away from Translate, since translation occurs on a different sandboxed JS context than where progress is reported and where we generally need to run ItemSaver.

Some observations:

  1. Deproxifying URLs is somewhat intermingled with translators and as such the translate process and results. When running detection for a proxied page, some translators (generic like COinS or EM) will detect on unproxied URL, while others will only detect on proxied urls. Translate will automatically unproxify items if translating with a translator that would run on unproxied url. Further, we may want to pass that proxy to ItemSaver or somewhere else. As such, translate() and possibly getTranslators() needs to return the corresponding proxy. Otherwise, we should consider returning the corresponding proxy as part of a translator object from getTranslators()
  2. I was hoping we could reduce the number of listeners that we would attach to translation. Some are not needed, such as attachmentProgress as now non-translate code is supposed to handle saving of attachments. On the other hand, itemSaving e.g. needs keeping since when translating a multiple page, multiple http queries may need to made before the items are translated and as such can be displayed in the progress UI.

Overall I'm happy with the first attempt at simplifying the translate code. This approach does nothing for refactoring or simplifying the internal logic of translate machinery and the especially spaghetti parts like nested translators, but it's a good starting point. Not having to pass ItemSaver properties through a translate instance simplifies code and allows for more flexibility, without needing rewriting too many things. I only needed to edit the existing ItemSaver minimally to adjust it to work as an independent piece of code rather than part of translate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants