Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPNet implementation #1

Merged
merged 3 commits into from
Feb 15, 2024
Merged

MPNet implementation #1

merged 3 commits into from
Feb 15, 2024

Conversation

mehmetcanay
Copy link
Member

No description provided.

@mehmetcanay mehmetcanay changed the title update: MPNet implementation MPNet implementation Feb 14, 2024
Copy link
Member

@tiadams tiadams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good ob on the implementations, just some minor comments and suggestions if you want to take a look.

index/mapping.py Outdated
@@ -2,16 +2,19 @@
import numpy as np

from index.embedding import EmbeddingModel
from index.db.model import Terminology, Mapping, Concept, Variable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here might be an issue - the model file was here moved to the db package. Weird that this does not throw an error in the parser test or python package though, I will have to take a look at that.

Could you adjust this?

@@ -28,19 +27,42 @@ def get_embedding(self, text: str, model="text-embedding-ada-002"):
return None
if isinstance(text, str):
text = text.replace("\n", " ")
return openai.Embedding.create(input=[text], model=model)['data'][0]['embedding']
return openai.Embedding.create(input=[text], model=model)["data"][0][
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really a problem syntax wise but I am wondering about this formatting choice

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using Black Formatter on VScode and it automatically formats on save. I will try to disable it.

index/visualisation.py Outdated Show resolved Hide resolved
save_plot=False,
save_dir="resources/results/plots",
):
if (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be done a bit shorter ;)

if not len(A) == len(B) == len(C) do stuff

descriptions_table2 = np.concatenate([table.joined_mapping_table["description"] for table in tables2_cleaned])
boundaries1 = np.array([table.joined_mapping_table["embedding"].index.size for table in tables1_cleaned])
boundaries2 = np.array([table.joined_mapping_table["embedding"].index.size for table in tables2_cleaned])
table1.joined_mapping_table.dropna(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I like to format on one line if possible, but this is fine too

raise NotImplementedError(
"Specified matching method is not implemented!"
)
min_distance_indices = np.argsort(np.array(distances))[
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

formatting error?

@@ -177,3 +271,69 @@ def score_mappings(matches: pd.DataFrame) -> float:
matches = matches[matches["target_concept_label"].notnull()]
accuracy = matches["correct"].sum() / len(matches)
return accuracy


def evaluate(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely made sense to move this here instead

source, target, matching_method=MatchingMethod.FUZZY_STRING_MATCHING
)
if target == "jadni":
print("check")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this check for? :D

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was there in the original code, so I just left it :D

fuzzy = pd.DataFrame(data_fuzzy, index=labels).T
return gpt, fuzzy

elif model == "mpnet":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could probably move code out of each if clause to a higher common level to have less code duplication

data_mpnet[labels[idx]] = acc_mpnet
# transpose to have from -> to | row -> column like in the paper
mpnet = pd.DataFrame(data_mpnet, index=labels).T
return mpnet
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a function return a different number of values for different arguments is not a good idea, since it makes the behaviour somewhat less predictable to the user.

Better solution would be here to have it return either gpt/fuzzy/mpnet results (since you already have a model paprameter) .

@tiadams tiadams merged commit b1f500a into main Feb 15, 2024
4 checks passed
@tiadams tiadams deleted the mpnet-implementation branch February 15, 2024 08:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants