やがて夜は明ける.
graph TD
A["Images of Books (ex. 国立国会図書館デジタルコレクション)"] -->|OCR| B
B["Text File (MD? XML?; UTF-8? Shift-JIS?)"] -->|"Parser? (Should I do this?)"| C
B --> |"Iterate + Modify by Human (Editor in Browser or Git; cf. Wiki, Qiita, Zenn)"| B
C["Aozora Bunko File Format?"] -->| | D
D["Publish to Aozora Bunko?"]
- Use Existing Aozora Bunko Files as Training Data
- We can find original texts since Aozora Bunko shows the original version of the texts ("底本").
- Supervised learning with these data
- Text Recognition
- OCR with Python
- Aim to generate texts accurately and quickly also in Japanese vertical texts
- Viewer/Editor
- Simple and Fast Viewer and Editor working on Browser
- Anyone can modify the generated texts either in the Built-in Editor or GitHub (Can we compare the original pictures and the generated texts?)
- Can this editor be built with Python as well?
- Text Matching Game
- Matching Game for Japanese Texts
- Aim to improve the accuracy of OCR (also for fun, of course!)
- This game can be a learning material for Japanese learners (like the original concept of Duolingo)
- cf. Google Captcha
- aozorahack
- Web Page
- ideathon: There are many ideas similar to this project!
- kosakuin: Aozora Bunko Editor (MIT License)
- aozora-cli: Aozora Bunko CLI (MIT License)
- aozora-parser.js
- aozoraflow
- kyukyunyorituryo/AozoraEditor: 青空文庫エディタ
- kyukyunyorituryo/html2aozora
- gearsns/AozoraJavaScriptParser