GitHub - uyjco0/ted-scimu: The TED-SCIMU project

       TED-SCIMU

==============================

The system is able to make the following:

Extract the english subtitles for all the availables TED talks [1] and generate a local corpus (TED talks corpus). The local corpus is stored in files and also it is possible to store them in a PostgreSql database.
Get the data available at the Science Museum collections [2] and generate a local corpus (Science Museum corpus). The local corpus is stored in files and also it is possible to store them in a PostgreSql database.
Pre-process the generated corpus in order to generate better representation for later analysis over the corpus. This pre-processing includes:
- Replacing non ASCII characters for their closest ASCII equivalent
- Removing punctuation and stopwords
- Tokenization, tagging (including filtering by tagging) and lemmatization It is accomplished using the excelent NLTK software [3]
For each document from the available local corpus, augment the document information with the english abstract from Dbpedia Spotlight annotations [4]
Generate models for the available local corpus in different Vector Space Models (VSMs). The used Vector Space Models are:
- The Term Frequency–Inverse Document Frequency (TF-IDF) [5]
- The Latent Dirichlet Allocation (LDA) [6] The generation of these Vector Space Models is done through the use of excellent gensim software [7]
Using all the functionality described above, TED-SCIMU is enabling a new way to navigate and discover TED talks: To the user is displayed a list of images associated with the objects from the Science Museum collection, and when the user selects an image according her/his interests and preferences, then it is possible to show her/him a related TED talk. The relationship between the images and the TED talks is found through the main topics discovered in the TED talks using the Latent Dirichelt Allocation model for the TED talks corpus.

[1] http://on.ted.com/23 [2] http://api.sciencemuseum.org.uk/documentation/collections [3] http://nltk.org [4] https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki [5] http://en.wikipedia.org/wiki/Tf%E2%80%93idf [6] http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation [7] https://github.com/piskvorky/gensim

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
scimu_data		scimu_data
sql-scripts/postgresql		sql-scripts/postgresql
tests		tests
.gitignore		.gitignore
COPYING		COPYING
LICENCE		LICENCE
README		README
README.md		README.md
base.py		base.py
dbpedia.py		dbpedia.py
generate_relationships_ted_scimu		generate_relationships_ted_scimu
generate_scimu_corpus		generate_scimu_corpus
generate_ted_corpus		generate_ted_corpus
generate_vsm.py		generate_vsm.py
generate_vsm_scimu		generate_vsm_scimu
script_tests		script_tests
similarity_query_scimu.py		similarity_query_scimu.py
ted_extra_stopwords.txt		ted_extra_stopwords.txt
text_augmentation.py		text_augmentation.py
train_lda_model_ted		train_lda_model_ted
train_tfidf_model_scimu		train_tfidf_model_scimu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

License

uyjco0/ted-scimu

Folders and files

Latest commit

History

Repository files navigation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages