Skip to content
jyomaj edited this page May 28, 2017 · 5 revisions

Welcome to the search-prototype wiki!

Initial Goal

To have the video subtitles of the LTI repository searchable. First in English, then other languages can follow.

Long term Goal

(As described by Ray)

The intention is to provide the entire world with a single place to look up a variety of things, and get the results back in a variety of formats. For example:

In what materials can I find the following keywords? Give them back to me with the timestamps or page numbers included.

How far along is the Arabic translation of [any project]? And tell me how I can join the effort.

How many people are currently part of the Serbian Team? And tell me how to join.

Where can I find the official public distribution of [any project]? And tell me if there is one specific to my language.

etc., etc., etc..

Does the [any language] Team have an active Twitter (or Facebook, etc.) account? Take me there so I can subscribe to it.

Current Status

  • Search prototype has been upgraded to use Ubuntu 17.04 and Elasticsearch 5.4.

  • Import script scrapes contents from http://wiki.linguisticteam.org/w/Video_Repository and imports their metadata and their English subtitles (focused on English for the time being) into an Elasticsearch index. The subtitles are stored as attachments. A preliminary search on keywords using a JSON aware front end for Elasticsearch (e.g. sense.qbox.io/gist) such as RBE, creativity, humanity, Venus, Jacque Fresco, Zeitgeist yield results from the title, description and file contents (along with the timestamp at which the word is found).

  • An index (equivalent to a database) containing data for 304 videos takes up 40.3 mb. This hasn't been tested for performance or optimised but it does the job for a prototype.

Next steps

  • Fix "Parsing of undecoded UTF-8 will give garbage when decoding entities..." warning
  • Try importing subtitles for a different language
  • Generate time coded link to video at search term(s)
  • Show results from terms entered in a search text box