This research project aims to create a large and open evidence base on Collective Intelligence (CI) research and its intersection with AI.
The work in this repository is organised in a metaflow pipeline with the following steps:
- Create a PostgreSQL database and the required tables as shown in the ER diagram. If they already exist, the initialisation is skipped.
- Collect papers from MAG based on Fields of Study (FoS). The pickled responses are stored locally in
data/raw/
. - Parse the MAG API response in a PostgreSQL database.
- Collect the level of a Field of Study in MAG's hierarchy.
- Tag papers as CI and AI+CI. This method could be modified to divide a dataset to core and control groups.
- Geocode author affiliation using Google Places API.
- Tag journals as open access based on a seed list.
- Find the type (industry, non-industry) of affiliations based on a seed list.
- Process the data used in EDA. This involves changing data types, merging and grouping tables.
- Exploratory data analysis of the CI research landscape. Produce Altair plots and store them in
reports/figures
as HTML pages (some of them are interactive).- Annual publication increase (base year: 2000)
- Annual sum of citations
- Publications by industry and non-industry affiliations
- International collaborations: % of cross-country teams in CI, AI+CI
- Industry - academia collaborations: % in CI, AI+CI
- Adoption of open access by CI, AI+CI
- Field of study comparison for CI, AI+CI. Produce plots for levels 1, 2 and 3 of the MAG hierarchy. Also produce a plot for a pre-selected list of Fields of Study.
- Annual publications in conferences and journals.
- Number of annual publications in CI, AI+CI.
- Collect metadata (publication date, title, abstract etc) for paper referenced by a CI paper. The pickled responses are stored locally in
data/interim/
. - Calculate annual research diversity using their Fields of Study and Shannon diversity index. This produces an Altair plot which is stored in
reports/figures
as an HTML page.
- You can use the same pipeline to query MAG with a conference or journal name as described in Orion's docs.
- All of the parameters are stored in the
model_config.yaml
file. Exception: Parameters of Altair plots, like width and height, are hardcoded.
- Clone the repository.
$ git clone https://github.com/nestauk/ci_mapping
- Change your working directory to
ci_mapping/
and in an Anaconda environment, install the requirements.
$ pip install -r requirements.txt
- Obtain access to Microsoft Knowledge and Google Places APIs.
- Create a
.env
file and add your secrets. You can use.env.example
as an example. - Run the metaflow pipeline.
$ python ci_mapping/run_pipeline.py --no-pylint run
The project assumes you have a working PostgreSQL distribution installed and running locally.
Project based on the Nesta cookiecutter data science project template.