-
Notifications
You must be signed in to change notification settings - Fork 462
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor initialize_encoder
to LaserEncoderPipeline
#256
Changes from all commits
d13174b
9b6f9cd
049f2e2
67ba8bb
a8efad2
d56a3a8
3fc5ea2
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -21,7 +21,11 @@ | |
import numpy as np | ||
import pytest | ||
|
||
from laser_encoders import initialize_encoder, initialize_tokenizer | ||
from laser_encoders import ( | ||
LaserEncoderPipeline, | ||
initialize_encoder, | ||
initialize_tokenizer, | ||
) | ||
|
||
|
||
@pytest.fixture | ||
|
@@ -35,6 +39,27 @@ def input_text() -> str: | |
return "This is a test sentence." | ||
|
||
|
||
@pytest.fixture | ||
def test_readme_params() -> dict: | ||
return { | ||
"lang": "igbo", | ||
"input_sentences": ["nnọọ, kedu ka ị mere"], | ||
"expected_embedding_shape": (1, 1024), | ||
"expected_array": [ | ||
0.3807628, | ||
-0.27941525, | ||
-0.17819545, | ||
0.44144684, | ||
-0.38985375, | ||
0.04719935, | ||
0.20238206, | ||
-0.03934783, | ||
0.0118901, | ||
0.28986093, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just FYI due to the update from Siddharth on Sacremoses preprocessing, these numbers will likely change. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's fix them in the PR where we update the interaction with Sacremoses! |
||
], | ||
} | ||
|
||
|
||
def test_tokenize(tokenizer, input_text: str): | ||
expected_output = "▁this ▁is ▁a ▁test ▁sent ence ." | ||
assert tokenizer.tokenize(input_text) == expected_output | ||
|
@@ -175,3 +200,36 @@ def test_sentence_encoder( | |
assert isinstance(sentence_embedding, np.ndarray) | ||
assert sentence_embedding.shape == (1, 1024) | ||
assert np.allclose(expected_array, sentence_embedding[:, :10], atol=1e-3) | ||
|
||
|
||
def test_laser_encoder_pipeline(tmp_path: Path, test_readme_params: dict): | ||
lang = test_readme_params["lang"] | ||
input_sentences = test_readme_params["input_sentences"] | ||
expected_embedding_shape = test_readme_params["expected_embedding_shape"] | ||
expected_array = test_readme_params["expected_array"] | ||
|
||
encoder = LaserEncoderPipeline(model_dir=tmp_path, lang=lang) | ||
embeddings = encoder.encode_sentences(input_sentences) | ||
|
||
assert isinstance(embeddings, np.ndarray) | ||
assert embeddings.shape == expected_embedding_shape | ||
assert np.allclose(expected_array, embeddings[:, :10], atol=1e-3) | ||
|
||
|
||
def test_separate_initialization_and_encoding( | ||
tmp_path, tokenizer, test_readme_params: dict | ||
): | ||
lang = test_readme_params["lang"] | ||
input_sentences = test_readme_params["input_sentences"] | ||
expected_embedding_shape = test_readme_params["expected_embedding_shape"] | ||
expected_array = test_readme_params["expected_array"] | ||
|
||
tokenized_sentence = tokenizer.tokenize(input_sentences[0]) | ||
sentence_encoder = initialize_encoder(model_dir=tmp_path, lang=lang) | ||
|
||
# Encode tokenized sentences into embeddings | ||
embeddings = sentence_encoder.encode_sentences([tokenized_sentence]) | ||
|
||
assert isinstance(embeddings, np.ndarray) | ||
assert embeddings.shape == expected_embedding_shape | ||
assert np.allclose(expected_array, embeddings[:, :10], atol=1e-3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add a unit test covering this case?
And by the way make sure that all the other examples in the readme are covered by tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, Will do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @avidale @heffernankevin ,
I have just added tests to the PR.
I am seeking some clarification regarding a couple of things:
Should I also include tests for the "Downloading the pre-trained models" section of the readme? It involves running some commands in the terminal and using
SentenceEncoder
andLaserTokenizer
, which I believe are already covered by the current tests. Please let me know your thoughts.Was it an intentional decision to use a temporary directory (
tmp_path
) for the tests? It seems like the same files may be downloaded for every test, which extends the time it takes to run them.Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, if you feel that this significantly slows down your development process, we can think of a better way to run the tests.
For example, we can bundle the tests into one class, and use class-level setup and teardown methods for creating and destroying the cache directory. I don't think it is a must, but you can do this, if it helps you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, Thanks for clarifying.
Please let me know if there's any more changes to be made in this PR