Lexiful is a powerful, lightweight natural language processing engine designed for high-precision text matching, intelligent suggestion, and advanced correction capabilities. By leveraging cutting-edge NLP techniques, Lexiful provides unparalleled accuracy and flexibility in text processing tasks, particularly in industry-specific scenarios.
- π― Text Matching: Utilizes TF-IDF vectorization and cosine similarity for matching results.
- π Fuzzy Matching: Implements configurable fuzzy matching algorithms for flexible text comparison.
- βοΈ Basic Spelling Correction: Offers spelling correction using Levenshtein distance, phonetic matching, and limited context consideration, with customizable edit distance thresholds.
- π Abbreviation Handling: Generates and processes various types of abbreviations based on predefined rules.
- π Phonetic Matching: Employs Soundex and Metaphone algorithms for sound-based text comparison.
- π N-gram Frequency Analysis: Uses n-gram frequency to support context-based word selection.
- 𧬠Word Embedding Integration: Incorporates Word2Vec embeddings for word representation.
- βοΈ Configurable: Customizable via YAML configuration file.
- π Updatable Model: Supports model updates with new descriptions and user-defined corrections.
Lexiful is engineered as a robust solution for industry-specific scenarios where matching user input against predefined data is crucial. It excels in:
- π― Targeted Matching: Optimized for specific industry terminologies and data structures.
- π Data Consistency: Reduces free-type errors by matching user input to standardized entries.
- β‘ Efficiency: Faster and more resource-efficient than broad AI models for specific matching tasks.
- π οΈ Customizability: Easily adaptable to various industries and specific organizational needs.
- π Privacy-Focused: Operates on local, predefined datasets without relying on external knowledge bases.
-
Clone the repository:
git clone https://github.com/alvinmurimi/lexiful.git cd lexiful
-
Install the required dependencies:
pip install -r requirements.txt
-
Download NLTK data:
python -c "import nltk; nltk.download('stopwords')"
Customize the config.yaml
file to adjust Lexiful's behavior:
input_file: 'text.txt'
csv_description_column: 1
csv_encodings: ['utf-8', 'iso-8859-1', 'windows-1252']
conjunctions: ['and', '&', '+', '/']
fuzzy_match_algorithm: 'token_set_ratio'
ngram_size: 3
embedding_size: 100
window_size: 5
max_edit_distance: 2
model_file: 'model.pkl'
from lexiful import Lexiful
# Initialize Lexiful
lexiful = Lexiful('config.yaml')
# Match input text
matches = lexiful.match("Your input text", threshold=60, max_matches=5)
print(matches)
lexiful.learn_correction("original_word", "corrected_word")
new_descriptions = ["New description 1", "New description 2"]
lexiful.update_model(new_descriptions)
# Save model
lexiful.save_model("model.pkl")
# Load model
loaded_lexiful = Lexiful.load_model("model.pkl")
We use test.py
to evaluate our model's performance on medical terminology. The model is trained on data from descriptions.csv
, which contains 11 medical terms.
- Standard Inputs: Tests partial terms and common medical phrases.
- Abbreviation: Checks recognition of medical acronyms.
- Fuzzy Matching: Evaluates handling of misspellings and typos.
- Phonetic Matching: Tests ability to match phonetically similar inputs.
Below are the test results:
## Standard Input Tests
| Input | Matches |
|:------------------------|:--------------------------------------|
| acute myo inf | Acute Myocardial Infarction |
| COPD | Chronic Obstructive Pulmonary Disease |
| gastro reflux | Gastroesophageal Reflux Disease |
| rheumatoid arth | Rheumatoid Arthritis |
| diabetus type 2 | Diabetes Mellitus Type 2 |
| hyper tension | Hypertension |
| coronary artery dis | Coronary Artery Disease |
| congestive heart failur | Congestive Heart Failure |
| osteo arthritis | Osteoarthritis, Rheumatoid Arthritis |
| bronchial asthma | Asthma |
## Abbreviation Tests
| Input | Matches |
|:--------|:----------------------------|
| AMI | Acute Myocardial Infarction |
| RA | Rheumatoid Arthritis |
| CAD | Coronary Artery Disease |
| CHF | Congestive Heart Failure |
| OA | Osteoarthritis |
## Fuzzy Matching Tests
| Input | Matches |
|:-------------------------------|:--------------------------------|
| acut myocardial infraction | Acute Myocardial Infarction |
| gastroesophagal reflux desease | Gastroesophageal Reflux Disease |
| rheumatoid arthritus | Rheumatoid Arthritis |
| diebetes mellitus | Diabetes Mellitus Type 2 |
| hipertension | Hypertension |
## Phonetic Matching Tests
| Input | Matches |
|:-------------|:-------------------------|
| nimonia | Pneumonia |
| asma | Asthma |
| dayabites | Diabetes Mellitus Type 2 |
| athraitis | Osteoarthritis |
| hipertenshun | Hypertension |
We also provide a simple web interface for testing Lexiful's matching capabilities. This interface is implemented using Flask and can be found in app.py
.
To run the web interface:
-
Ensure you have Flask installed:
pip install flask
-
Run the Flask application:
Open a web browser and navigate to http://localhost:5000
The web interface provides a simple input field where you can enter text. As you type, suggestions will appear based on Lexiful's matching algorithm.
Example usage:
This web interface is particularly useful for quick, interactive testing and demonstrations of Lexiful's capabilities.
Lexiful provides a solid starting point for text matching and entity recognition. Key areas for potential enhancements include:
- Implementing more sophisticated pre-processing steps in the
preprocess
method - Adding new matching algorithms to the
match
method - Expanding language support by incorporating multilingual resources
- Optimizing performance for large datasets through efficient data structures
- Fully integrating word embeddings into the matching process
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.
For any questions or feedback, please open an issue or contact Alvin Mayende