Consists of text files containing 150k+ Urdu words for all your dictionary/word-based projects e.g: auto-completion / autosuggestion / Embedding networks / Tagging
I pulled out the words into a simple new-line-delimited text file. Which is more useful when building apps or importing into databases etc.
- words.txt Contains all urdu words.
- bigram_words.txt Contains all urdu bigram words.
- trigram_words.txt Contains all urdu trigram words.
I have added words for labelling Named Entity Recognition(NER) Data. These labels contain words related to different categories like Persons, Locations, Organizations and Dates etc. These words give a good starting point for labelling NER data. Below are the files containing different label words.
- locations.txt Contains locations from across the world
- persons.txt Contains Person Names
- organizations.txt Contains Organization names
- dates.txt Contains time and date related words
All contributions are more than welcomed. Contributions may close an issue, fix a bug (reported or not reported), improve the existing code and so on. If you would like to add a word or a new set of words, send a PR.
Have a bug or a feature request? If you wish to remove or update some of the words, please file an issue first before sending a PR on the repo. [please open a new issue]
Special thanks to everyone who contributed to getting the Urdu hack to the current state. Thanks to Center for Language Engineering for providing the word list.
Thank you to all our backers! 🙏 [Become a backer]
Support this project by becoming a sponsor. [Become a sponsor]
Code released under the MIT License.