Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a token/embedding creation preprocessing pipeline using tf-transform #124

Open
iislucas opened this issue Jul 2, 2018 · 1 comment

Comments

@iislucas
Copy link
Contributor

iislucas commented Jul 2, 2018

Issue:
We currently depend on vocabularies, like glove embeddings, that are:

  1. Weirdly biased (although when you backprop to the embeddings, their initial bias is not very relevant anymore),
  2. Depend on being consistent with the tokenizer we use.
  3. Don't necessarily have the same words as our actual text.

Proposed solution project:
Use https://github.com/tensorflow/transform to develop text preprocessing pipelines, e.g. to select tokens that occur sufficiently frequently, and create either random or smarter word embeddings for them.

@iislucas iislucas changed the title Create a preprocessing pipeline using https://github.com/tensorflow/transform Create a token/embedding creation preprocessing pipeline using https://github.com/tensorflow/transform Jul 2, 2018
@iislucas iislucas changed the title Create a token/embedding creation preprocessing pipeline using https://github.com/tensorflow/transform Create a token/embedding creation preprocessing pipeline using tf-transform Jul 2, 2018
@fprost
Copy link
Collaborator

fprost commented Jul 17, 2018

FYI: Not sure if that helps but here is a basic example with tft: https://github.com/tensorflow/transform/blob/master/examples/sentiment_example.py

ipavlopoulos pushed a commit to ipavlopoulos/conversationai-models that referenced this issue Mar 2, 2019
…eaks

tweaking docs to be clearer and better formatted
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants