This is an implementation of System_Programming_Project in python.
The lexer parsing of text captcha is done in 4 different phases.
-
First phase is ’Fetching’ wherein we download the text captcha file, convert it into string, parse the XML that we downloaded. Here the XML string is parsed into a Document Object Model (DOM) called as minidom. Then we retrieve the first XML tag that the parser finds with name tagName. Then send it to next phase.
-
The second phase is Cleaning. Here we mainly try to extract numbers and expected extra words like ”what”, ”who”, ”digits”, ”which”, ”number”, ”these”,etc.
-
The third phase is Scanning. Here we scan different formats of numbers,names, colours, positions, mathematical equations signs, total count, comparison words, body parts, day reference, list reference,etc which forms the most important part of the project.
-
The fourth phase is Head selection. Here we categorize into digits, relations, range of values, count of body parts, colour,name, mathematical, positional,etc based upon the results obtained from fetch and clean phases and return the processed data. Here in this phase we mainly make use of retrieval based NLP where in required data sets are already provided in the form of list for head selection process.
-
The last part is translation which combines all 4 phases together and helps in solving text captcha challenge.
-
First we obtain text captcha from ”http://api.textcaptcha.com”.
-
Send it to fetch and clean phase to obtain required tokens.
-
Then send the token one by one to scan phase to know what kind of token it is.
-
The scanned tokens are then sent to head selection to know which all categories/levels, the tokens fall into.
-
After the head selection is done the required answer for the challenge is ready and is properly printed on the terminal window.