Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for embedding pruning #26

Open
danieldk opened this issue Jul 31, 2019 · 1 comment
Open

Add support for embedding pruning #26

danieldk opened this issue Jul 31, 2019 · 1 comment
Assignees
Labels
feature New feature or request

Comments

@danieldk
Copy link
Member

danieldk commented Jul 31, 2019

Add support for pruning embeddings, where N embeddings are retained. Words for which embeddings are removed are mapped to their nearest neighbor.

This should provide more or less the same functionality as pruning in spaCy:

https://spacy.io/api/vocab#prune_vectors

I encourage some investigation here. Some ideas:

  1. The most basic version could simply retain the embeddings of the N most frequent words and map all the remaining words to the nearest neighbor in the N embeddings that are retained.

  2. Select vectors such that the similarities to the pruned vectors is maximized. The challenge here is making it tractable.

  3. An approach similar to quantization, where k-means clustering is performed with N clusters. The embedding matrix is then replaced by the cluster centroid matrix. Each word maps to the cluster it is in. (This could reuse the KMeans stuff from reductive, which is already a dependency of finalfusion).

I would focus on (1) and (3) first.

Benefits:

  • Compresses the embedding matrix.
  • Faster than quantized embedding matrices, because simple lookups are used.
  • Could later be applied to @sebpuetz 's non-hashed subword n-grams as well.
  • Could perhaps be combined with quantization for even better compression.
@danieldk danieldk added the feature New feature or request label Jul 31, 2019
@sebpuetz
Copy link
Member

Somewhat related, mapping all untrained subword embeddings to a NULL vector could also be done if we get some indirection for look ups (which would be introduced by all of the above options). The subword embeddings could be filtered by going through the vocabulary items, extracting their corresponding subword indices and keeping a log of which ones never appear. Those never appearing could be mapped to the same vector (whatever that should be...) or removed without replacement.

In some cases this would massively reduce model size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Development

No branches or pull requests

3 participants