-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPNet implementation #1
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good ob on the implementations, just some minor comments and suggestions if you want to take a look.
index/mapping.py
Outdated
@@ -2,16 +2,19 @@ | |||
import numpy as np | |||
|
|||
from index.embedding import EmbeddingModel | |||
from index.db.model import Terminology, Mapping, Concept, Variable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here might be an issue - the model file was here moved to the db package. Weird that this does not throw an error in the parser test or python package though, I will have to take a look at that.
Could you adjust this?
index/embedding.py
Outdated
@@ -28,19 +27,42 @@ def get_embedding(self, text: str, model="text-embedding-ada-002"): | |||
return None | |||
if isinstance(text, str): | |||
text = text.replace("\n", " ") | |||
return openai.Embedding.create(input=[text], model=model)['data'][0]['embedding'] | |||
return openai.Embedding.create(input=[text], model=model)["data"][0][ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really a problem syntax wise but I am wondering about this formatting choice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm using Black Formatter on VScode and it automatically formats on save. I will try to disable it.
index/visualisation.py
Outdated
save_plot=False, | ||
save_dir="resources/results/plots", | ||
): | ||
if ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be done a bit shorter ;)
if not len(A) == len(B) == len(C) do stuff
index/visualisation.py
Outdated
descriptions_table2 = np.concatenate([table.joined_mapping_table["description"] for table in tables2_cleaned]) | ||
boundaries1 = np.array([table.joined_mapping_table["embedding"].index.size for table in tables1_cleaned]) | ||
boundaries2 = np.array([table.joined_mapping_table["embedding"].index.size for table in tables2_cleaned]) | ||
table1.joined_mapping_table.dropna( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I like to format on one line if possible, but this is fine too
index/evaluation.py
Outdated
raise NotImplementedError( | ||
"Specified matching method is not implemented!" | ||
) | ||
min_distance_indices = np.argsort(np.array(distances))[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
formatting error?
@@ -177,3 +271,69 @@ def score_mappings(matches: pd.DataFrame) -> float: | |||
matches = matches[matches["target_concept_label"].notnull()] | |||
accuracy = matches["correct"].sum() / len(matches) | |||
return accuracy | |||
|
|||
|
|||
def evaluate( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely made sense to move this here instead
index/evaluation.py
Outdated
source, target, matching_method=MatchingMethod.FUZZY_STRING_MATCHING | ||
) | ||
if target == "jadni": | ||
print("check") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this check for? :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was there in the original code, so I just left it :D
index/evaluation.py
Outdated
fuzzy = pd.DataFrame(data_fuzzy, index=labels).T | ||
return gpt, fuzzy | ||
|
||
elif model == "mpnet": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could probably move code out of each if clause to a higher common level to have less code duplication
index/evaluation.py
Outdated
data_mpnet[labels[idx]] = acc_mpnet | ||
# transpose to have from -> to | row -> column like in the paper | ||
mpnet = pd.DataFrame(data_mpnet, index=labels).T | ||
return mpnet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a function return a different number of values for different arguments is not a good idea, since it makes the behaviour somewhat less predictable to the user.
Better solution would be here to have it return either gpt/fuzzy/mpnet results (since you already have a model paprameter) .
No description provided.