This repository is dedicated to the intersection of Sanskrit Manuscriptology and Computational Linguistics, focusing on the application of modern AI and OCR techniques to the study and preservation of Sanskrit manuscripts.
It is estimated that only about 1/10th of Sanskrit literature is exposed to the daylight. The vast knowledge hidden in the manuscript will take 100s of years to decipher. But if AI can be leveraged to get it into readable form, the period can be reduced dramatically. That's the attempt.
तमसो मा ज्योतिर्गमय ।
- Introduction
- Key References
- AI and OCR Techniques
- Institutes and Research Centers
- Notable Researchers
- Websites and Online Resources
- Video Playlists and Lectures
- Personal Perspective: Why Pursue Sanskrit Manuscriptology?
- How to Get Started
- Technical Approach
- Contributing
- License
Sanskrit Manuscriptology is the study of Sanskrit manuscripts, their history, preservation, and interpretation. With the advent of computational linguistics and AI technologies, new avenues have opened up for the analysis, digitization, and understanding of these ancient texts.
- Post Graduate Diploma In Manuscriptology And Palaeography (PGDMP)
- Online Diploma Program in Manuscriptology and Paleography
-
Sahoo, J., & Mohanty, B. (2015). "Digitization of Indian manuscripts heritage: Role of the National Mission for Manuscripts." IFLA Journal, 41(3), 237-250.
-
Hellwig, O. (2010). "Improving the Morphological Analysis of Classical Sanskrit." In Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation.
-
Goyal, P., & Huet, G. (2016). "Design and analysis of a lean interface for Sanskrit corpus annotation." Journal of Language Modelling, 4(2), 117-144.
-
Kulkarni, A., & Huet, G. (2009). "Sanskrit computational linguistics." In Third International Symposium, Hyderabad, India, January 15-17, 2009, Proceedings (Vol. 5402). Springer.
- SanskritOCR: An OCR system specifically designed for Sanskrit manuscripts - GitHub: SanskritOCR
- Tesseract OCR with Sanskrit support - GitHub: tesseract-ocr/tesseract
- Devanagari Optical Character Recognition using Convolutional Neural Networks - Paper: arXiv:1708.03543
- NLTK Sanskrit Library - GitHub: sanskrit-nltk
- National Mission for Manuscripts Website
- Indian Institute of Advanced Study, Shimla, India Website
- Bhandarkar Oriental Research Institute, Pune, India Website
- French Institute of Pondicherry, India Website
- Sanskrit Department, Harvard University, USA Website
- Oxford Centre for Hindu Studies, UK Website
- SAMHiTA South Asian Manuscript Histories and Textual Archive Website
- Sangrah, is a part of Dharohar working on making India’s ancient wisdom easily, and globally accessible to scholars around the world.
- SRI Sanskrit OCR
- Vedvaapi
- Sunbird Anuvaad bootstrapped by EkStep Foundation in late 2019 as a solution to enable easier translation of legal documents from English to Indic languages & vice-versa.
- Prof. Amba Kulkarni - Sanskrit Computational Linguistics - Profile
- Dr. Oliver Hellwig - Digital Sanskrit Philology - Profile
- Prof. Gérard Huet - Sanskrit Heritage Site - Website
- Dr. Pawan Goyal - Sanskrit NLP and Digital Humanities - Profile
- Malhar Arvind KulkarniProfile
- Dharmapuri Vedaratna LinkedIn
- Dr. Diwakar Mishra LinkedIn
- Girish Nath (Girish Nath Jha) Jha LinkedIn
- Anil Kumar
- Prof. Ganesh Ramakrishnan, OCR
- Ayush Maheshwari. pe-ocr-sanskrit Source and Data of our EMNLP Paper 'A Benchmark and Dataset for Post-OCR text correction in Sanskrit'
- Sanskrit Documents - Website
- sanskrit-ocr
- DevDigitizer Project (Sanskrit OCR) aims to build a state of the art Optical Character Recognition Software for Sanskrit/ Samskritam (Devanagari Script).
- SRI, Sanskrit OCR Tool
- 𝐒𝐡𝐚𝐫𝐞𝐎𝐂𝐑 1, ShareOCR 2 𝐓𝐡𝐞 𝐄𝐧𝐝-𝐭𝐨-𝐄𝐧𝐝 𝐎𝐂𝐑 𝐟𝐨𝐫 𝐈𝐧𝐝𝐢𝐜 𝐂𝐨𝐧𝐭𝐞𝐧𝐭
- GRETIL - Göttingen Register of Electronic Texts in Indian Languages - Website
- Sanskrit Heritage Site - Website
- Digital Corpus of Sanskrit - Website
- SARIT - Search and Retrieval of Indic Texts - Website
- Manuscriptology & Paleography National Workshop SAMSKRITAM & BHARATIYASAMSKRITI 1
- कार्यशाला- ग्रंथ संधानम्
- Manuscriptology - I, Editing Process
- Manuscriptology: Introduction, Definition of Manuscript, Manuscript composition. (Common elements of Manuscript)
- 18 CME Dr Mohan Joshi - Basics of Manuscriptology
- Manuscripts Treasure of India : Repository of our Heritage || Dr. Sarwarul Haque ||
- भारतीय पाण्डुलिपि विज्ञान | डॉ० कीर्ति कान्त शर्मा | कलानिधि, IGNCA
- NYCIKS 2023 - Workshop on Manuscriptology – Prof. Gauri Mahulikar
- Manuscriptology - Prof. Malhar Kulkarni
- 61. Manuscriptology - Grantha Script - Dr.Krishnamachari
- leap-pe-tool A framework for assisting human while correcting the translation/OCR errors in documents, mostly dedicated to Indian Languages., Udaan Projecy
- Sanskrit and Indian Manuscriptology Series by IIAS Shimla - YouTube Playlist
- Computational Sanskrit and Digital Humanities by IIT Kharagpur - NPTEL Course
- Introduction to Sanskrit Computational Linguistics by Amba Kulkarni - YouTube Playlist
Sanskrit Manuscriptology is a field that offers unique opportunities and challenges:
- It's a much-needed area of study with significant potential for research and development.
- The work is monk-like, requiring dedication and a lifelong commitment to learning.
- There's a heavy scope and need for AI applications in this field.
- It can be an "ikigai" - a reason for being that combines passion, mission, profession, and vocation.
- It involves specific knowledge that's not widely available, making it a valuable niche.
- The field has international relevance, with opportunities for collaboration in countries like Germany and the US.
- There are ample chances to write research papers and books for both academic and general audiences.
- Projects like Namami are unearthing vast knowledge, providing opportunities to become an expert in the field.
-
Pursue a course or degree in Sanskrit Manuscriptology:
- Bhandarkar Oriental Research Institute (BORI)
- Savitribai Phule Pune University (SPPU)
- Online playlists and courses (see Video Playlists and Lectures)
-
Focus on various scripts:
- Sanskrit
- Modi
- Sharada
- Study the works of experts like Shrinand Bapat
-
Develop AI skills:
- Learn libraries like spaCy, OpenCV, scikit-learn, and PyTorch
- Focus on handwritten recognition models and custom OCR techniques (e.g., pytesseract)
-
Approach the field as real R&D:
- Embrace the lack of pressure and view it as an "ikigai"
- Develop specific knowledge in the intersection of AI and manuscriptology
-
Learn Sanskrit on the side:
- Continuously improve your language skills while working on AI and manuscriptology projects
-
Consider coaching or teaching AI applications in this field
A typical workflow for manuscriptology using AI might include:
-
Image Processing:
- Input: Manuscript image
- Process: Edge detection
- Output: SVG (Scalable Vector Graphics)
-
Feature Recognition:
- Input: Vector graphics
- Process: AI-based feature recognition
- Output: JSON data structure
-
Knowledge Extraction:
- Input: JSON data
- Process: RAG (Retrieval-Augmented Generation)
- Output: Structured information and insights from the manuscript
This approach combines computer vision, machine learning, and natural language processing techniques to extract and understand information from ancient manuscripts.
We welcome contributions to this repository. Please read our CONTRIBUTING.md file for guidelines on how to submit issues, feature requests, and pull requests.
This project is licensed under the MIT License - see the LICENSE.md file for details.