The RICardo dataset compiles trade statistics sources (primary, secondary and recent estimations) of international trade bilateral flows of the 19th century.
We created a web application to visually explore this dataset. This application is not only a final product but a research tool which helped us in curating this dataset by providing data quality feedbacks and support research works.
This dataset is meant to evolve. You can follow our work in the RIcardo hypothèses.org blog.
Dedinger, Béatrice, et Paul Girard. 2017. « Exploring trade globalization in the long run: The RICardo project ». Historical Methods: A Journal of Quantitative and Interdisciplinary History 50 (1): 30‑48. doi:10.1080/01615440.2016.1220269.
the paper at Historical Methods
our preprint version: 01-May-2016
To download the data you can :
- use the published dataset by downloading the DOI below or in the release section;
- to get the last data version, clone this repository and use a database script to combine he data (see dedicated section)
Béatrice Dedinger, & Paul Girard. (2017). RICardo dataset 2017.12 (Version 2017.12) [Data set]. Zenodo. http://doi.org/10.5281/zenodo.1119592
The RICardo dataset is made available under the Open Database License: http://opendatacommons.org/licenses/odbl/1.0/. Any rights in individual contents of the database are licensed under the Database Contents License: http://opendatacommons.org/licenses/dbcl/1.0/
If you want to get the latest data version through our deduplication algorithm you can use our database_scripts.
First prepare your python environment:
$ pyenv virtualenv 3.8 ricardo_data
$ pyenv activate ricardo_data
$ pip install -r requirements.txt
Only the pip install is mandatory but using pyenv and virtualenv is more than recommended.
Aggregate the many data/flows/source(s).csv into one data/flows.csv
$ cd database_scripts
$ python flows.py aggregate
deduplicate trade flows (primary sources, general/special...)
$ python flows.py deduplicate
This script outputs a sqlite database and RICardo_trade_flows_deduplicated.csv. Those deduplicated data are the one used into our data exploration website.
Data are provided in csv format (utf-8, comma separated):
- flows/: the trade flows transcribed from sources, one CSV file by source. See ./Database_scripts to learn how to combine the flow data
- sources.csv: volumes of statistics, books or research papers used to compile the flows table
- RICentities.csv: RICentites are the unified nomencalture of trade reporting and partner names
- RICentities_group.csv: Some RICentities are of type 'group'. This table show which entities are part of RICentities groups
- entity_names.csv: This table documents how the partner and reporting original names in sources have been translated in a unified nomemclature
- exchange_rates.csv: exchange rates used to convert trade flows to pound sterling
- currencies.scv: currencies translation table
- expimp_spegen.csv:export/import and special/general translation table
The precise format (list of type of fields) of those csv files is described in the datapackage.json file. Learn more about data packages on the frictionless data website.
This folder contains some python and bash scripts used to:
- deduplicate_flows.py: prepare and filter flows data and combine them into a sqlite database ready to serve the RICardo online exploration tool. This scripts also create the few csv exports including in the tool.
- deploy_data.sh: copy RICardo data in the RICardo web application folder pointed in the config.py configuration file.
and more to be documented soon...
- RICardo_sqlite_creation.py: compile data csv files in a sqlite database (see RICardo_schema.sql)
- update_csv_from_sqlite.py: update the data folder from the RICardo sqlite database. This script is used to update the data folder after having edited data in batch through sql queries. Some examples of such scripts can be found in the update_data_scripts folder.
- test folder: a series of python scripts which applies some automatic tests to the RICardo_viz.sqlite database. It outputs various data quality reports in the out_data folder
This folder is used to document the data update sessions made: original files, data update sql queries, notes... Note that not all modifications were listed in this folder. To keep track of exhaustive changes made to data, use the historic feature of git.
This work has been supported by l’Agence National de la Recherche under the reference RICARDO ANR-06-BLAN-0332 and by Sciences Po Scientific Advisory Board.