Similar Records' clustering tool for databases

Identification and elimination of duplicate records in huge databases is a major issue faced by every organization involved in big data. The problem of finding exact duplicates can, in large part, be solved using hash tables.The problem is aggravated if we need to find not only the exact duplicates, but also records with differences such as typos or different words or other anomalies. We provide a novel solution to this problem by clustering similar data records based on a threshold and a further verification interface for manually eliminating similar records. We also provide a Malayalam to English transliteration module for processing Malayalam unicode in our records. We use LSH (Locality Sensitive Hashing) to identify similar records bounded by a given threshold. We serve this as a Desktop Application. The main module is written in Python and the UI is made with ElectronJS, thus enabling us to serve the application cross platform.

Libraries Used

NodeJS : Electron, zerorpc.
Python3: datasketch, zerorpc, ezodf, pandas, nltk.

Steps for Execution

Clone this repo.

For Python3 Part

Create a virtual environment for python3.
Install the above said python packages.
Edit Line 42 of main.js and add the location to the virualenv python interpreter.

For NodeJS Part

Run the installation by setting up npm install.

Authors:

Deepu Shaji, Research Assistant ICFOSS
Anzi A S, Research Assistant ICFOSS

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
app		app
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
api.spec		api.spec
bootstrap.min.css		bootstrap.min.css
bootstrap.min.js		bootstrap.min.js
fav.ico		fav.ico
favicon.png		favicon.png
icfoss.jpeg		icfoss.jpeg
index.html		index.html
index.txt		index.txt
jquery.min.js		jquery.min.js
loading.gif		loading.gif
main.js		main.js
npm-debug.log		npm-debug.log
package-lock.json		package-lock.json
package.json		package.json
popper.min.js		popper.min.js
renderer.js		renderer.js
style.css		style.css

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Similar Records' clustering tool for databases

Libraries Used

Steps for Execution

For Python3 Part

For NodeJS Part

Authors:

About

Releases

Packages

Languages

License

anziasharaf/data-clustering-tool

Folders and files

Latest commit

History

Repository files navigation

Similar Records' clustering tool for databases

Libraries Used

Steps for Execution

For Python3 Part

For NodeJS Part

Authors:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages