Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue#58: Adding Grammar Analyzer feature to GatorMiner #90

Open
wants to merge 37 commits into
base: master
Choose a base branch
from

Conversation

Mai1902
Copy link
Collaborator

@Mai1902 Mai1902 commented Apr 21, 2021

What is the current behavior?

We believe that Grammar is one of the important criteria to judge the quality of a reflection; hence, our team want to add a grammar error analyzer as a new feature to Gator Miner. This is the implementation of issue #58.

Purpose of this feature:

The Grammar Analyzer is a tool that will scan an assignment through efficient coding and output two things. It will show where and what words have grammar errors while also revealing a score/grade in relation to the amount of grammatical errors. This tool will be a great addition for GatorMiner because it will add another dimension to revealing an assignment’s meticulousness, quality, and integrity

What is the new behavior if this PR is merged?

As of right now, our source code of grammar analyzer has been fully implemented and tested, which is able to return the correct number of errors and the percentage of error grammar in a text per number of words. However, such feature only work with a short text instead of a long text like reflection.
We also adding a new page into streamline_web.py under the title of Grammar Checker, but it hasn't return anything since we are still struggling with implementing appropriate data frame.

Type of change

Please describe the pull request as one of the following:

  • Bug fix
  • Breaking change
  • New feature
  • Documentation update
  • Other

Other information: Full documentation on our pending work is as follows:

Current outcome of the latest push:

The current outcome of the implementation is that the code is functional but inefficient. What we were hoping to accomplish was a properly working analyzer that scans through all the input values and posts on a table the student ID, the number of errors and the percentage of errors. The code works, however we are having some issues with the efficiency of the library that we are using because it takes too long to run. Though we got what what we wanted in terms of a functioning code, it doesn’t run as proficient as we were hoping.

Implementation of source code explanation:

The source code adopted the grammar checking tool from the library language-tool-python. The only method in this file is taking input text as the parameter and then checking for the number of grammar errors in each line of the text. This method returns the number of grammar errors in a text and error percentage per number of words in the text.

Current issue:

The code in this PR has been tested and proven to be correct in terms of grammar error. However, this source code only functions normally with a small input text (1-2 paragraph as maximum). When my team tried to parse a large input text inside (a reflection), this code took an extremely long run time and failed to produce any output at the end, even though there’s no bugs in the code.
After some trial, we realized that the language-tool-python library is a fork of language_check library, which is already outdated. Hence, the tool is not efficient, resulting in infinite run time of the program.

Possible solution for future implementation:

  1. Alternative solution 1: Put this grammar analyzer feature inside the interactive feature, where the input will/can be smaller → the tool will be able to check its grammar error. A warning message should be placed by this feature to inform the user that it will only work with a small text input (200 words)
  2. Alternative solution 2: Forking the code from library language-tool-python and creating a new tool based on this library but more efficient, by updating the trained data inside it, etc…
  3. Alternative solution 3: Split the big input text into smaller paragraphs (200 words per paragraph) and use suitable logic to parse each paragraph into the tool.

This PR has:

  • Commit messages that are correctly formatted
  • Tests for newly introduced code
  • Docstrings for newly introduced code

Developers

@Mai1902 @Kevin487 @TheShiny1 @Batmunkh0419

m and others added 30 commits April 7, 2021 05:14
@Mai1902 Mai1902 added enhancement New feature or request question Further information is requested Team 4 labels Apr 21, 2021
Copy link
Contributor

@enpuyou enpuyou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Mai1902, thanks for working on this feature! It seems like there are still a couple of errors in the code that need to be fixed so that the program can be executed. See the details below.

Additionally, I noticed that there is a language model being installed during the execution. It is quite large. Perhaps, it would be best to clarify somewhere that the grammar feature requires the installation of such models. I also just want to report that the plots seem to be taking forever to load, is there any bug here or is that just part of the feature?

streamlit_web.py Outdated Show resolved Hide resolved
streamlit_web.py Outdated Show resolved Hide resolved
streamlit_web.py Outdated Show resolved Hide resolved
src/grammar_analyzer.py Outdated Show resolved Hide resolved
err_num = err_num + len(matches)

# Store all alphanumeric characters in the reflection in a list
words = re.sub('[^0-9a-zA-Z]+', ' ', str(text)).lower().split()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not great at regex, although, is this line tokenizing the text?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is expected to replace all non alphanumeric character in the text with white space and then tokenize that text, store under the list call words.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, thanks! That's what I thought! If that's the case, the tokens are already processed and stored in the data frame when markdown documents are being imported. See if you can just reuse them instead of retokenizing from the text.

src/grammar_analyzer.py Outdated Show resolved Hide resolved
Pipfile Outdated Show resolved Hide resolved
@enpuyou
Copy link
Contributor

enpuyou commented Apr 21, 2021

Also, don't forget to resolve all the merge conflicts. The Pipfile and streamlit_web.py should be rather easy. As for the Pipfile.lock, just remove it and regenerate one based on the updated Pipfile.

@codecov
Copy link

codecov bot commented Apr 21, 2021

Codecov Report

Merging #90 (46aef18) into master (1677535) will increase coverage by 0.42%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #90      +/-   ##
==========================================
+ Coverage   91.66%   92.09%   +0.42%     
==========================================
  Files           6        7       +1     
  Lines         240      253      +13     
==========================================
+ Hits          220      233      +13     
  Misses         20       20              
Impacted Files Coverage Δ
src/grammar_analyzer.py 100.00% <100.00%> (ø)

@corlettim
Copy link
Collaborator

Please update your branch/PR with master

@corlettim
Copy link
Collaborator

Please make sure to resolve your conflicts in streamline_web.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested Team 4
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants