This directory contains all is needed for setting up a formal RDF knowledge base to analyze the historical presence of politicians in the main TV shows in Italy.
The data ingestion process is managed by the LinkedData.Center SDaaS platform community edition according KEES specifications.
Following data sources are considered:
- AGCOM open data
- AUDITEL public data
- Linked Open Data from Camera dei deputati and Senato
- Linked Open Data from Wikidata
- Linked Open Data from UK services
- KEES configuration data
- LODMAP configuration data
AGCOM publishes in its web site periodic reports about the presence of politicians in main TV shows. See this page as example
AGCOM collects data about the speaking time of a person in a specific political or institutional role, detected during a specific TV show in a reference period. AGCOM reports distinguish between main news programs and other in-depth programs for journalistic publications.
For example, ISTAT collects the total speaking time in february 2019 of Matteo Salvini with the institutional role of Goverment Ministry in the main news program (TG1) of the RAI 1 broadcast network. In the same period, the same subject, in the same tv show, can also speak with the political role of Lega party leader. In this case AGCOM produces two distinct records (i.e. observations).
AUDITEL is a private consortium that collects data about italian TV shows audience. It publishes some aggregated data in its web site as a pdf table.
This project uses the 2018 Sintesi Annuale 2018 file to estimate the audience of the broadcaster networks refereed in AGCOM data. The data are manually extracted form the AUDITEL site and stored in the data/2018_auditel.ttl file
The name and picture of all Italian parliament members in the XVIII legislatura are extracted from official SPARQL end points
Provides pictures and descriptions about persons, TV shows, networks and editors.
The UK reference linked data resources are used to formally describe observation periods
The LODMAP Bubble Graph Ontology is used to describe the graphical objects that represent the presence of politicians in TV.
Configuration data are contained in data/agcom-strings.json file and data/g0v_app.ttl file. Configuration information are intended to be used with any LODMAP-2D compliant application such as g0v.it web-budget
The data/kees.ttl file contains meta data about knowledge base according KEES specifications.
Reference periods are published as linked data by GOV.UK team
Raw data are annotated according with agcom vocabulary and auditel vocabulary that extend the RDF Data Cube Vocabulary
Raw data are are translated to RDF turtle data stream through simple PHP gateways and the resulting triples are stored in a RDF graph database.
For each AGCOM observation, a normalized dailiy speaking time (nst) is calculated using the formula nst := seconds_in(speakingTime)/days_in(refPeriod )
For each AGCOM observation, the broadcast weight index (bwi) is a subjective rank related to an estimated audience of TV programs
that is computed starting from the potential audience data provided by AUDITEL according with the formula bwi(observation) = COALESCE( avg_audience(observation.context), avg_audience(observation.context.nework))
that considers the potential audience of a specific tv program (e.g. Tg1) or of the whole TV channel (e.g. Rai 1).
For each AGCOM observation, the normalized daily listening time (dlt) is defined as bwi(observation) * bwi(observation))
There are some heuristics & guidelines that estimates that a speaker can pronunciate
an average rate of 100 - 125 words per minute. That is 2 words per second. Because an average sentence is composed by
10 words, we introduce a metric called TV impressions (tvi) computed by dividing by 5 the daily listening time. That is bwi(observation) * nst(observation) / 5
In other words, the TV impressions are just a rough estimation of the daily number of sentences delivered to all potential TV watchers.
For example: a 30-second speech in a TV program in a day with an audience of 1000000 of people is equivalent to 6000000 of TV impressions (i.e. 30*1000000/5 ).
The metrics tvi, bwi and nst are computed by a specific sparql axioms.
knowledge base build process requires to:
- if needed, edit files in the data directory adding known linked data about facts.
- if needed, develop the gateways for transforming web resources in linked data. See gateways doc.
- if needed, write new axioms and rules to generate new data. See axioms doc.
- edit the build script that drives the data ingestion process. Add new resources to ingest
- run sdaas agent
- query the resulting knowledge base
the test of the build script require at least 2GB of ram available to the docker machine:
docker run -d -p 9999:8080 -v $PWD/.:/workspace --name kb linkeddatacenter/sdaas-ce
docker exec -ti kb bash
apk --no-cache add php7 php7-mbstring
# run build process
sdaas --debug -f build.sdaas --reboot
# Access the workbench pointing browser to http://localhost:9999/sdaas
exit
docker rm -f kb
logs info and debug traces will be created in .cache directory
You can pack data and services with :
docker build . -t sdaas
docker run -d -p 8889:8080 --name datastore sdaas
The resulting container will provide a read only distribution of the whole knowlede base in a stand-alone read-only graph database with a SPARQL interface.
- the build.sdaas file is a script to populate the knowledge base from scratch. It requires sdaas platform community edition 2.0+
- the axioms directory contains inferences to be computed during reasoning windows.
- the data directory contains local data files
- the linked_data directory contains constructor for linked data imported from external sparql end points
- the gateways directory contains the code to transform raw data in linked data
- the tests directory contains some axioms that must be verified on knowledge base building termination in order to validate the whole building process.
- the .cache temporary directory that contains logs and debugging info. Not saved in repo.
- the sdaas platform is derived from LinkedData.Center SDaas Product and licensed with CC-by-nd-nc by LinkedData.Center to g0v community