-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add abbreviation replacement data augmentation op and test #732
base: master
Are you sure you want to change the base?
Conversation
A couple of preparations for the PR:
|
Codecov Report
@@ Coverage Diff @@
## master #732 +/- ##
==========================================
+ Coverage 80.94% 80.98% +0.03%
==========================================
Files 249 251 +2
Lines 18664 18725 +61
==========================================
+ Hits 15108 15164 +56
- Misses 3556 3561 +5
Continue to review full report at Codecov.
|
forte/processors/data_augment/algorithms/abbreviation_replacement_op.py
Outdated
Show resolved
Hide resolved
|
||
class AbbreviationReplacementOp(SingleAnnotationAugmentOp): | ||
r""" | ||
This class is a replacement op utilizing a pre-defined |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring should be more comprehensive. This is what the user is going to see if they want to use this DA op.
@@ -0,0 +1,104 @@ | |||
# Copyright 2020 The Forte Authors. All Rights Reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2022*
super().__init__(configs) | ||
if "dict_path" in configs.keys(): | ||
self.dict_path = configs["dict_path"] | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An if-else loop is not needed here as you are already setting a default value in the default_configs
self, input_anno: Annotation | ||
) -> Tuple[bool, str]: | ||
r""" | ||
This function replaces a word from an abbreviation dictionary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, we should add a better description of what this function will do.
# If the replacement does not happen, return False. | ||
if random.random() > self.configs.prob: | ||
return False, input_anno.text | ||
if input_anno.text in self.data.keys(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you are returning from the function if the program enters the earlier if
statement, you dont need to add this if
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I am not sure is this check (input_anno.text in self.data.keys()
) is necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking if the input phrase does not have a corresponding abbreviation, an error will occur.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- When checking dict existence, use
text in self.data
, don't need to call thekeys()
. - Now we can see that the
prob
only applies to the annotation that has an abbreviation, which should probably be specified in the class docstring.
- dict_path: the `url` or the path to the pre-defined | ||
abbreviation json file. The key is a word / phrase we want | ||
to replace. The value is an abbreviated word of the | ||
corresponding key. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd recommend adding the default value of dict_path
in the docstring as well since this is what will be rendered in the documentation and it would be easier for users to see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your implementation of the Op seems fine but it looks like you might have not gotten the underlying intricacies of how to modify the SingleAnnotationAugmentOp
for different type of annotations. Just take a closer look at that once.
A dictionary with the default config for this processor. | ||
Following are the keys for this dictionary: | ||
- prob: The probability of replacement, | ||
should fall in [0, 1]. Default value is 0.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default value below is 0.5. Make sure you check the documentation thoroughly.
data_pack = DataPack() | ||
text = "see you later" | ||
data_pack.set_text(text) | ||
token = Token(data_pack, 0, len(text)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Token
class is generally used for a single word. When annotating the whole sequence, you should use the Sentence
class. Also, if your SingleAnnotationOp augments an annotation other than a Token
, you must specify that in the default_configs
. For your reference, look at the implementation of the BackTranslationOp
and its test cases. Even that Op augments sentences.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also have Document https://github.com/asyml/forte/blob/master/ft/onto/base_ontology.py#L136 for the whole article.
I know it is just a test case so it doesn't matter too much, but still worth noting.
augmented_data_pack = self.abre.perform_augmentation(data_pack) | ||
|
||
augmented_token = list( | ||
augmented_data_pack.get("ft.onto.base_ontology.Token") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should take the comment above into consideration and rework your test cases accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the changes look good.
abbreviation dictionary to replace word or phrase | ||
with an abbreviation. The abbreviation dictionary can | ||
be user-defined, we also provide a default dictionary. | ||
`prob` indicates the probability of replacement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does "probability of replacement" mean? For example, if prob
is 0.4, is the replacement happen 40% of the case or the other way around. Does it mean that 40% of the words will be replaced, etc. Let's specify this clearly.
# If the replacement does not happen, return False. | ||
if random.random() > self.configs.prob: | ||
return False, input_anno.text | ||
if input_anno.text in self.data.keys(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- When checking dict existence, use
text in self.data
, don't need to call thekeys()
. - Now we can see that the
prob
only applies to the annotation that has an abbreviation, which should probably be specified in the class docstring.
if random.random() > self.configs.prob: | ||
return False, input_anno.text | ||
if input_anno.text in self.data.keys(): | ||
result: str = self.data[input_anno.text] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something about this replacement:
- Do we need to consider the case? Maybe we should lower case your dictionary and user input.
- How about substrings? For example, in
"see you later": "syl8r"
, what if we have an input "i will see you later", it looks like we won't replace this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you need to consider using an Aho-Corasick data sturcture here: https://pyahocorasick.readthedocs.io/en/latest/
to replace. The value is an abbreviated word of the | ||
corresponding key. Default dictionary is from a web-scraped | ||
slang dictionary | ||
("https://github.com/abbeyyyy/JsonFiles/blob/main/abbreviate.json"). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we adopt the dictionary from another library?
data_pack_1 = DataPack() | ||
text_1 = "I will see you later!" | ||
data_pack_1.set_text(text_1) | ||
phrase_1 = Phrase(data_pack_1, 7, len(text_1) - 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that you have to first identify the phrase before doing the match, which is not a very typical use case.
This PR fixes #736.
Description of changes
Add Abbreviation Replacement Augmentation Method
Possible influences of this PR.
This PR provides a new replacement-based data augmentation method
Test Conducted
Test cases included in abbreviation_replacement_op_test.py