Skip to content
Firoj Alam edited this page Aug 13, 2017 · 13 revisions

Welcome to the Katha: Bangla Text to Speech wiki!

Katha: Bangla Text to Speech is a software package for the Bangla language which can help to tackle the illiteracy problem, empower the visually impaired and increase the possibilities of improved human-machine interaction. This project aims to develop a TTS system for Bangla using diphone and unit selection concatenation techniques based on the Festival speech synthesis technology.

Different modules of TTS project:

Phoneme inventory:

A defined set of phoneme inventory is important for a language. This is not only important for phonetic analysis but also important for speech processing application such as TTS and ASR. There have been several studies in the past, mostly based on articulatory phonetics. We concentrated on the acoustic characteristics of Bangla phonemes, obtained by analyzing the recordings of male and female voices. The goal of this task was to determine the total number of phonemes and their acoustic properties in Bangla language. For this purpose we collected text in different format and then recorded the text. We hired gender equivalent professional and non-professional speakers and recording studio for recording. Then acoustic analysis was done on the recorded speech. Finally, we concluded with 30 consonants, 14 vowels, and 21 diphthongs. paper

Text Normalization:

We have developed rule based text normalization system in two different technologies such as java and festival scheme. The job of text normalize system is to convert non-standard word representation to standard form. Currently this system can handle number, phone number, ordinal, cardinal, acronym, and abbreviation. paper

Letter to sound system/lexicon:

Our team developed a rule based pronunciation generator for Bangla words. It takes a word and finds the pronunciations for the graphemes of the word. A grapheme is a unit in writing that cannot be analyzed into smaller components. Resolving the pronunciation of a polyphone grapheme (i.e. a grapheme that generates more than one phoneme) is the major hurdle that the Automated Pronunciation Generator (APG) encounters. Bangla is partially phonetic in nature, thus we can define rules to handle most of the cases. Besides, up till now we lack a balanced corpus which could be used for a statistical pronunciation generator. This system is extending day by day to make the accuracy up to the mark. A pronunciation lexicon also developed which consists 93K lexical entries.

Intonation Modeling:

In linguistics, intonation is a variation of pitch while speaking which is not used to distinguish words. Intonation and stress are two main elements of linguistic prosody. Since, we do not have existing system for intonation in Bangla, so we are trying to make an intonation model using statistical system from speech corpus. A read speech corpus was developed to develop intonation model.

Diphone database for TTS:

Developed a diphone database consisting 4355 diphones. Diphone is the number of square of phones. We identified 30 consonants, 14 vowels and 21 diphthong phonemes. This includes designing nonsense sentences from diphone list, recording by professional speaker, splitting and labeling. Please download speech corpora

Clone this wiki locally