Skip to content

Creating a Dataset for the ODN

Lane Aasen edited this page Sep 15, 2016 · 5 revisions

The first step in adding data to the ODN is creating a dataset that the ODN can use. It is important for this process to be repeatable so that someone else can regenerate or update the data if necessary.

Download Source Files

First, find the source files for your dataset and download them. If possible, write a script that will do this automatically or at least document the process.

Update Entities

If your dataset is about US Census regions, then you should not need to update the ODN Entities dataset. If it isn't, or if you aren't sure, see Updating ODN Entities.

Format Data

Next, transform the data into a format that the ODN can understand.

ODN Dataset Schema

Each row in an ODN dataset must contain an entity, variable, value, and a set of constraints.

ODN dataset schema:

  • id (required): Entity ID (e.g. 0400000US53)
  • name (optional): An optional entity name column may be included. This is not required by the ODN, but it is recommended since it makes the dataset easier to use by others. (e.g. Washington)
  • type (required): Fully qualified entity type. (e.g. region.state)
  • variable (required): Variable ID. (e.g. count)
  • value (required): Value of the variable for the entity. (e.g 6919450)

Constraints

Each dataset may also have constraint columns that are used to slice the data. For example, if we had a dataset containing population by year, we would use population as the variable and add a constraint column called year. The dataset would look like this:

id,name,type,year,variable,value
0400000US53,Washington,region.state,2014,population,6919450
0400000US53,Washington,region.state,2015,population,6970450
0400000US53,Washington,region.state,2016,population,7023970

Try to keep the number of variables in your dataset low and use constraints instead when possible.

Each dataset can have any number of constraint columns. For example, the occupation dataset has occupation and year constraints:

id,name,type,year,occupation,variable,value
0400000US53,Washington,region.state,Farming,2014,count,6346
0400000US53,Washington,region.state,Farming,2015,count,6330
0400000US53,Washington,region.state,Farming,2016,count,6290

Datasets can also have extra columns that are not constraints. This is useful when the customer requires certain fields that the ODN doesn't need. If a column is not registered as a constraint column, it will be ignored by the ODN.

Transform

Transform the data from its source format into the ODN dataset schema. If possible, write a script to do this add add it to the odn-pargen project. This will make it easier for people to reproduce your work in the future.

Null Values

If a value is null, it should not be included in the dataset. If it is, make sure that the value is null and not zero.

Upload

Upload your dataset to Socrata once it is formatted. If the dataset is hosted on odn.data.socrata.com, follow the naming convention: {nation} {topic} - {dataset}, e.g. "US Education - Student Teacher Ratios".

Now, you can move on to the next step: Adding Data to the ODN Backend