Skip to content

Latest commit

 

History

History
167 lines (129 loc) · 11.5 KB

Awesome_manuscriptology.md

File metadata and controls

167 lines (129 loc) · 11.5 KB

Awesome Sanskrit Manuscriptology

This repository is dedicated to the intersection of Sanskrit Manuscriptology and Computational Linguistics, focusing on the application of modern AI and OCR techniques to the study and preservation of Sanskrit manuscripts.

It is estimated that only about 1/10th of Sanskrit literature is exposed to the daylight. The vast knowledge hidden in the manuscript will take 100s of years to decipher. But if AI can be leveraged to get it into readable form, the period can be reduced dramatically. That's the attempt.

तमसो मा ज्योतिर्गमय ।

Table of Contents

  1. Introduction
  2. Key References
  3. AI and OCR Techniques
  4. Institutes and Research Centers
  5. Notable Researchers
  6. Websites and Online Resources
  7. Video Playlists and Lectures
  8. Personal Perspective: Why Pursue Sanskrit Manuscriptology?
  9. How to Get Started
  10. Technical Approach
  11. Contributing
  12. License

Introduction

Sanskrit Manuscriptology is the study of Sanskrit manuscripts, their history, preservation, and interpretation. With the advent of computational linguistics and AI technologies, new avenues have opened up for the analysis, digitization, and understanding of these ancient texts.

Degrees and courses

Key References

  1. Sahoo, J., & Mohanty, B. (2015). "Digitization of Indian manuscripts heritage: Role of the National Mission for Manuscripts." IFLA Journal, 41(3), 237-250.

  2. Hellwig, O. (2010). "Improving the Morphological Analysis of Classical Sanskrit." In Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation.

  3. Goyal, P., & Huet, G. (2016). "Design and analysis of a lean interface for Sanskrit corpus annotation." Journal of Language Modelling, 4(2), 117-144.

  4. Kulkarni, A., & Huet, G. (2009). "Sanskrit computational linguistics." In Third International Symposium, Hyderabad, India, January 15-17, 2009, Proceedings (Vol. 5402). Springer.

AI and OCR Techniques

  • SanskritOCR: An OCR system specifically designed for Sanskrit manuscripts - GitHub: SanskritOCR
  • Tesseract OCR with Sanskrit support - GitHub: tesseract-ocr/tesseract
  • Devanagari Optical Character Recognition using Convolutional Neural Networks - Paper: arXiv:1708.03543
  • NLTK Sanskrit Library - GitHub: sanskrit-nltk

Institutes and Research Centers

  • National Mission for Manuscripts Website
  • Indian Institute of Advanced Study, Shimla, India Website
  • Bhandarkar Oriental Research Institute, Pune, India Website
  • French Institute of Pondicherry, India Website
  • Sanskrit Department, Harvard University, USA Website
  • Oxford Centre for Hindu Studies, UK Website
  • SAMHiTA South Asian Manuscript Histories and Textual Archive Website
  • Sangrah, is a part of Dharohar working on making India’s ancient wisdom easily, and globally accessible to scholars around the world.
  • SRI Sanskrit OCR
  • Vedvaapi
  • Sunbird Anuvaad bootstrapped by EkStep Foundation in late 2019 as a solution to enable easier translation of legal documents from English to Indic languages & vice-versa.

Notable Researchers

  • Prof. Amba Kulkarni - Sanskrit Computational Linguistics - Profile
  • Dr. Oliver Hellwig - Digital Sanskrit Philology - Profile
  • Prof. Gérard Huet - Sanskrit Heritage Site - Website
  • Dr. Pawan Goyal - Sanskrit NLP and Digital Humanities - Profile
  • Malhar Arvind KulkarniProfile
  • Dharmapuri Vedaratna LinkedIn
  • Dr. Diwakar Mishra LinkedIn
  • Girish Nath (Girish Nath Jha) Jha LinkedIn
  • Anil Kumar
  • Prof. Ganesh Ramakrishnan, OCR
  • Ayush Maheshwari. pe-ocr-sanskrit Source and Data of our EMNLP Paper 'A Benchmark and Dataset for Post-OCR text correction in Sanskrit'

Websites and Online Resources

Video Playlists and Lectures

  • Sanskrit and Indian Manuscriptology Series by IIAS Shimla - YouTube Playlist
  • Computational Sanskrit and Digital Humanities by IIT Kharagpur - NPTEL Course
  • Introduction to Sanskrit Computational Linguistics by Amba Kulkarni - YouTube Playlist

Personal Perspective: Why Pursue Sanskrit Manuscriptology?

Sanskrit Manuscriptology is a field that offers unique opportunities and challenges:

  • It's a much-needed area of study with significant potential for research and development.
  • The work is monk-like, requiring dedication and a lifelong commitment to learning.
  • There's a heavy scope and need for AI applications in this field.
  • It can be an "ikigai" - a reason for being that combines passion, mission, profession, and vocation.
  • It involves specific knowledge that's not widely available, making it a valuable niche.
  • The field has international relevance, with opportunities for collaboration in countries like Germany and the US.
  • There are ample chances to write research papers and books for both academic and general audiences.
  • Projects like Namami are unearthing vast knowledge, providing opportunities to become an expert in the field.

How to Get Started

  1. Pursue a course or degree in Sanskrit Manuscriptology:

    • Bhandarkar Oriental Research Institute (BORI)
    • Savitribai Phule Pune University (SPPU)
    • Online playlists and courses (see Video Playlists and Lectures)
  2. Focus on various scripts:

    • Sanskrit
    • Modi
    • Sharada
    • Study the works of experts like Shrinand Bapat
  3. Develop AI skills:

    • Learn libraries like spaCy, OpenCV, scikit-learn, and PyTorch
    • Focus on handwritten recognition models and custom OCR techniques (e.g., pytesseract)
  4. Approach the field as real R&D:

    • Embrace the lack of pressure and view it as an "ikigai"
    • Develop specific knowledge in the intersection of AI and manuscriptology
  5. Learn Sanskrit on the side:

    • Continuously improve your language skills while working on AI and manuscriptology projects
  6. Consider coaching or teaching AI applications in this field

Technical Approach

A typical workflow for manuscriptology using AI might include:

  1. Image Processing:

    • Input: Manuscript image
    • Process: Edge detection
    • Output: SVG (Scalable Vector Graphics)
  2. Feature Recognition:

    • Input: Vector graphics
    • Process: AI-based feature recognition
    • Output: JSON data structure
  3. Knowledge Extraction:

    • Input: JSON data
    • Process: RAG (Retrieval-Augmented Generation)
    • Output: Structured information and insights from the manuscript

This approach combines computer vision, machine learning, and natural language processing techniques to extract and understand information from ancient manuscripts.

Contributing

We welcome contributions to this repository. Please read our CONTRIBUTING.md file for guidelines on how to submit issues, feature requests, and pull requests.

License

This project is licensed under the MIT License - see the LICENSE.md file for details.