forked from readbeyond/aeneas
-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.txt
412 lines (305 loc) · 14.8 KB
/
README.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
aeneas
======
**aeneas** is a Python library and a set of tools to automagically
synchronize audio and text.
- Version: 1.2.0
- Date: 2015-09-27
- Developed by: `ReadBeyond <http://www.readbeyond.it/>`__
- Lead Developer: `Alberto Pettarin <http://www.albertopettarin.it/>`__
- License: the GNU Affero General Public License Version 3 (AGPL v3)
- Contact: [email protected]
Goal
----
**aeneas** automatically generates a **synchronization map** between a
list of text fragments and an audio file containing the narration of the
(same) text.
For example, given `this text
file <aeneas/tests/res/container/job/assets/p001.xhtml>`__ and `this
audio file <aeneas/tests/res/container/job/assets/p001.mp3>`__,
**aeneas** computes the following abstract map:
::
[00:00:00.000, 00:00:02.680] <=> 1
[00:00:02.680, 00:00:05.480] <=> From fairest creatures we desire increase,
[00:00:05.480, 00:00:08.640] <=> That thereby beauty's rose might never die,
[00:00:08.640, 00:00:11.960] <=> But as the riper should by time decease,
[00:00:11.960, 00:00:15.279] <=> His tender heir might bear his memory:
[00:00:15.279, 00:00:18.519] <=> But thou contracted to thine own bright eyes,
[00:00:18.519, 00:00:22.760] <=> Feed'st thy light's flame with self-substantial fuel,
[00:00:22.760, 00:00:25.719] <=> Making a famine where abundance lies,
[00:00:25.719, 00:00:31.239] <=> Thy self thy foe, to thy sweet self too cruel:
[00:00:31.239, 00:00:34.280] <=> Thou that art now the world's fresh ornament,
[00:00:34.280, 00:00:36.960] <=> And only herald to the gaudy spring,
[00:00:36.960, 00:00:40.640] <=> Within thine own bud buriest thy content,
[00:00:40.640, 00:00:43.600] <=> And tender churl mak'st waste in niggarding:
[00:00:43.600, 00:00:48.000] <=> Pity the world, or else this glutton be,
[00:00:48.000, 00:00:53.280] <=> To eat the world's due, by the grave and thee.
The map can be output to file in several formats: SMIL for EPUB 3,
SRT/TTML/VTT for closed captioning, JSON/RBSE for Web usage, or raw
CSV/SSV/TSV/TXT/XML for further processing.
System Requirements, Supported Platforms and Installation
---------------------------------------------------------
System Requirements
~~~~~~~~~~~~~~~~~~~
1. a reasonably recent machine (recommended 4 GB RAM, 2 GHz 64bit CPU)
2. ``ffmpeg`` and ``ffprobe`` executables available in your ``$PATH``
3. ``espeak`` executable available in your ``$PATH``
4. Python 2.7.x
5. Python modules ``BeautifulSoup``, ``lxml``, ``numpy``, and
``scikits.audiolab``
6. (Optional but strongly suggested) Python C headers to compile the
Python C extensions
Depending on the format(s) of audio files you work with, you might need
to install additional audio codecs for ``ffmpeg``. Similarly, you might
need to install additional voices for ``espeak``, depending on the
language(s) you work on. (Installing *all* the codecs and *all* the
voices available might be a good idea.)
If installing the above dependencies proves difficult on your OS,
consider using the `Vagrant box <http://www.vagrantup.com>`__ created by
`aeneas-vagrant <https://github.com/readbeyond/aeneas-vagrant>`__.
Supported Platforms
~~~~~~~~~~~~~~~~~~~
**aeneas** has been developed and tested on **Debian 64bit**, which is
the **only supported OS** at the moment.
However, **aeneas** has been confirmed to work on other Linux
distributions (Ubuntu, Slackware), on Mac OS X (with developer tools
installed) and on Windows Vista/7/8.1/10.
Whatever your OS is, make sure ``ffmpeg``, ``ffprobe`` (which is part of
``ffmpeg`` distribution), and ``espeak`` are properly installed and
callable by the ``subprocess`` Python module. A way to ensure the latter
consists in adding these three executables to your ``$PATH``.
If installing **aeneas** natively on your OS proves difficult, you can
use VirtualBox and `Vagrant <http://www.vagrantup.com>`__ to run
**aeneas** inside a virtualized Debian image, using
`aeneas-vagrant <https://github.com/readbeyond/aeneas-vagrant>`__.
Installation
~~~~~~~~~~~~
Linux and Mac OS X
^^^^^^^^^^^^^^^^^^
1. If you are a user of a ``deb``-based Linux distribution (e.g., Debian
or Ubuntu), you can install all the dependencies by running `the
provided ``install_dependencies.sh``
script <install_dependencies.sh>`__
.. code:: bash
$ sudo bash install_dependencies.sh
2. If you have another Linux distribution or Mac OS X, just make sure
you have ``ffmpeg``, ``ffprobe`` (part of the ``ffmpeg`` package),
and ``espeak`` installed and available on your command line. You also
need Python 2.x and its "developer" package containing the C headers.
3. Run the following commands:
.. code:: bash
$ git clone https://github.com/readbeyond/aeneas.git
$ cd aeneas
$ pip install -r requirements.txt
$ python setup.py build_ext --inplace
$ python check_dependencies.py
If the last command prints a success message, you have all the required
dependencies installed and you can confidently run **aeneas** in
production.
Windows
^^^^^^^
Please read the installation instructions contained in the `"Using
aeneas for Audio-Text Synchronization"
PDF <http://software.sil.org/scriptureappbuilder/resources/>`__, based
on `these
directions <https://groups.google.com/d/msg/aeneas-forced-alignment/p9cb1FA0X0I/8phzUgIqBAAJ>`__,
written by Richard Margetts.
Usage
-----
1. Install ``aeneas`` as described above. (Only the first time!)
2. Open a command prompt/shell/terminal and go to the root directory of
the aeneas repository, that is, the one containing this ``README.md``
file.
3. To compute a synchronization map ``map.json`` for a pair
(``audio.mp3``, ``text.txt`` in ``plain`` format), you can run:
.. code:: bash
$ python -m aeneas.tools.execute_task audio.mp3 text.txt "task_language=en|os_task_file_format=json|is_text_type=plain" map.json
The third parameter (the *configuration string*) can specify several
parameters/options. See the
`documentation <http://www.readbeyond.it/aeneas/docs/>`__ for
details.
4. To compute a synchronization map ``map.smil`` for a pair
(``audio.mp3``, ``page.xhtml`` containing fragments marked by ``id``
attributes like ``f001``), you can run:
.. code:: bash
$ python -m aeneas.tools.execute_task audio.mp3 page.xhtml "task_language=en|os_task_file_format=smil|os_task_file_smil_audio_ref=audio.mp3|os_task_file_smil_page_ref=page.xhtml|is_text_type=unparsed|is_text_unparsed_id_regex=f[0-9]+|is_text_unparsed_id_sort=numeric" map.smil
5. If you have several tasks to run, you can create a job container and
a configuration file, and run them all at once:
.. code:: bash
$ python -m aeneas.tools.execute_job job.zip /tmp/
File ``job.zip`` should contain a ``config.txt`` or ``config.xml``
configuration file, providing **aeneas** with all the information
needed to parse the input assets and format the output sync map
files. See the
`documentation <http://www.readbeyond.it/aeneas/docs/>`__ for
details.
You might want to run ``execute_task`` or ``execute_job`` without
arguments to get an usage message and some examples:
.. code:: bash
$ python -m aeneas.tools.execute_task
$ python -m aeneas.tools.execute_job
See the `documentation <http://www.readbeyond.it/aeneas/docs/>`__ for an
introduction to the concepts of ``task`` and ``job``, and for a list of
the available options.
Documentation
-------------
Online: http://www.readbeyond.it/aeneas/docs/
Generated from the source (requires ``sphinx``):
.. code:: bash
$ git clone https://github.com/readbeyond/aeneas.git
$ cd aeneas/docs
$ make html
Tutorial: `A Practical Introduction To The aeneas
Package <http://www.albertopettarin.it/blog/2015/05/21/a-practical-introduction-to-the-aeneas-package.html>`__
Mailing list: https://groups.google.com/d/forum/aeneas-forced-alignment
Changelog: http://www.readbeyond.it/aeneas/docs/changelog.html
Supported Features
------------------
- Input text files in plain, parsed, subtitles, or unparsed format
- Text extraction from XML (e.g., XHTML) files using ``id`` and
``class`` attributes
- Arbitrary text fragment granularity (single word, subphrase, phrase,
paragraph, etc.)
- Input audio file formats: all those supported by ``ffmpeg``
- Batch processing
- Output sync map formats: CSV, JSON, SMIL, SSV, TSV, TTML, TXT, VTT,
XML
- Tested languages: BG, CA, CY, DA, DE, EL, EN, ES, ET, FA, FI, FR, GA,
GRC, HR, HU, IS, IT, LA, LT, LV, NL, NO, RO, RU, PL, PT, SK, SR, SV,
SW, TR, UK
- Robust against misspelled/mispronounced words, local rearrangements
of words, background noise/sporadic spikes
- Code suitable for a Web app deployment (e.g., on-demand AWS
instances)
- Adjustable splitting times, including a max character/second
constraint for CC applications
- Automated detection of audio head/tail
- MFCC and DTW computed as Python C extensions to reduce the processing
time
Limitations and Missing Features
--------------------------------
- Audio should match the text: large portions of spurious text or audio
might produce a wrong sync map
- Audio is assumed to be spoken: not suitable/YMMV for song captioning
- No protection against memory trashing if you feed extremely long
audio files
TODO List
---------
- Improving robustness against music in background
- Isolate non-speech intervals (music, prolonged silence)
- Automated text fragmentation based on audio analysis
- Auto-tuning DTW parameters
- Reporting the alignment score
- Improving (removing?) dependency from ``espeak``, ``ffmpeg``,
``ffprobe`` executables
- Multilevel sync map granularity (e.g., multilevel SMIL output)
- Supporting input text encodings other than UTF-8
- Better documentation
- Testing other approaches, like HMM
- Publishing the package on PyPI
How Does This Thing Work?
-------------------------
One Word Explanation
~~~~~~~~~~~~~~~~~~~~
Math.
One Sentence Explanation (Layman Edition)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A good deal of math and computer science, a handful of software
engineering and some optimization tricks.
One Sentence Explanation (Pro Edition)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Using the Sakoe-Chiba Band Dynamic Time Warping (DTW) algorithm to align
the Mel-frequency cepstral coefficients (MFCCs) representation of the
given (real) audio wave and the audio wave obtained by synthesizing the
text fragments with a TTS engine, eventually mapping the computed
alignment back onto the (real) time domain.
Extended Explanation
~~~~~~~~~~~~~~~~~~~~
To be written. Eventually. Some day.
License
-------
**aeneas** is released under the terms of the GNU Affero General Public
License Version 3. See the `LICENSE <LICENSE>`__ file for details.
The code for computing the MFCCs ```aeneas/mfcc.py`` <aeneas/mfcc.py>`__
is a verbatim copy from the `CMU Sphinx III
project <http://cmusphinx.sourceforge.net/>`__.
Audio files contained in the unit tests ``aeneas/tests/res/`` directory
are adapted from recordings produced by the `LibriVox
Project <http://www.librivox.org>`__ and they are in the public domain.
Text files contained in the unit tests ``aeneas/tests/res/`` directory
are adapted from files produced by the `Project
Gutenberg <http://www.gutenberg.org>`__ and they are in the public
domain.
No copy rights were harmed in the making of this project.
Supporting and Contributing
---------------------------
Sponsors
~~~~~~~~
- **July 2015**: `Michele
Gianella <https://plus.google.com/+michelegianella/about>`__
generously supported the development of the boundary adjustment code
(v1.0.4)
- **August 2015**: `Michele
Gianella <https://plus.google.com/+michelegianella/about>`__
partially sponsored the port of the MFCC/DTW code to C (v1.1.0)
- **September 2015**: friends in West Africa partially sponsored the
development of the head/tail detection code (v1.2.0)
Supporting
~~~~~~~~~~
Would you like supporting the development of **aeneas**?
We are open to accept sponsorships to
- fix bugs,
- add new features,
- improve the quality and the performance of the code,
- port the code to other languages/platforms,
- support of third party installations, and
- improve the documentation.
Feel free to `get in touch <mailto:[email protected]>`__.
Contributing
~~~~~~~~~~~~
If you are able to contribute code directly, that's great!
Please do not work on the ``master`` branch. Instead, please create a
new branch, and open a pull request from there. I will be glad to have a
look at it!
Please make your code consistent with the existing code base style (see
the `Google Python Style
Guide <https://google-styleguide.googlecode.com/svn/trunk/pyguide.html>`__),
and test your contributed code against the unit tests before opening the
pull request. Ideally, add some unit tests on the code written by you.
**Please note that, by opening a pull request, you automatically agree
to apply the AGPL v3 license to the code you contribute.**
If you think you found a bug, please use the `GitHub issue
tracker <https://github.com/readbeyond/aeneas/issues>`__ to file a bug
report.
Development History
-------------------
**Early 2012**: Nicola Montecchio and Alberto Pettarin co-developed an
initial experimental package to align audio and text, intended to be run
locally to compute Media Overlay (SMIL) files for EPUB 3 Audio-eBooks
**Late 2012-June 2013**: Alberto Pettarin continued engineering and
tuning the alignment tool, making it faster and memory efficient,
writing the I/O functions for batch processing of multiple audio/text
pairs, and started producing the first EPUB 3 Audio-eBooks with Media
Overlays (SMIL files) computed automatically by this package
**July 2013**: incorporation of ReadBeyond Srl
**July 2013-March 2014**: development of ReadBeyond Sync, a SaaS version
of this package, exposing the alignment function via APIs and a Web
application
**March 2014**: launch of ReadBeyond Sync beta
**April 2015**: ReadBeyond Sync beta ended
**May 2015**: release of this package on GitHub
**August 2015**: release of v1.1.0, including Python C extensions to
speed the computation of audio/text alignment up
**September 2015**: release of v1.2.0, including code to automatically
detect the audio head/tail
Acknowledgments
---------------
Many thanks to **Nicola Montecchio**, who suggested using MFCCs and DTW,
and co-developed the first experimental code for aligning audio and
text.
**Paolo Bertasi**, who developed the APIs and Web application for
ReadBeyond Sync, helped shaping the structure of this package for its
asynchronous usage.
All the mighty `GitHub
contributors <https://github.com/readbeyond/aeneas/graphs/contributors>`__,
and the members of the `Google
Group <https://groups.google.com/d/forum/aeneas-forced-alignment>`__.