-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pre-processing of text #890 issue #990
base: master
Are you sure you want to change the base?
Conversation
#890 added pre-processing of text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some things in your PR which are not exactly in line with what aima-python aims to do
"wordseq = words(federalist)\n", | ||
"wordseq = wordseq[114:-3098]" | ||
"wordseqs = words(federalist)\n", | ||
"wordseqs = wordseqs[114:-3098]" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was it necessary to change the name of the variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't change the name if variable actually I have created a new variable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wasn't wordseq
already present in the repository?
Anyway, its fine if it makes things simpler.
"outputs": [], | ||
"source": [ | ||
"#removing stopwords\n", | ||
"from nltk.corpus import stopwords\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to try to minimize the use of third-party libraries. The point of the nlp module is to have basic implementations of standard functions used in the domain. Importing from nltk is the opposite of what we want to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright , I'll try to create new function in place of nltk library in my next contribution to minimize third-party library
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
"source": [ | ||
"#stemming and lemmatization\n", | ||
"from nltk.stem.wordnet import WordNetLemmatizer\n", | ||
"lmtzr = WordNetLemmatizer()\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, we shouldn't use lemmatizers from third parties. Instead, we could have a lemmatizer within the repository, however basic it may be. The point of this repository is to be able to explain the underlying concepts of these algorithms, not directly import from other modules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sagar-sehgal We can, but make sure you read their license first. We might have to cite/acknowledge them. If the license allows, I think we can save a copy of the file in aima-data
and carry on from there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. I'll try to do that. Thank You!
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"' '.join(wordseq[:100])" | ||
"' '.join(wordseqs[:100])" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was fine already
"metadata": { | ||
"collapsed": true | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"wordseq = [w for w in wordseq if w != 'publius']" | ||
"wordseqs = [w for w in wordseqs if w != 'publius']" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And so was this.
@@ -531,7 +559,7 @@ | |||
"(4, 16, 52)" | |||
] | |||
}, | |||
"execution_count": 6, | |||
"execution_count": 41, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A slightly picky complaint, but you can rerun a notebook to serialize the execution counts.
No description provided.