-
Notifications
You must be signed in to change notification settings - Fork 12
Creating a Dataset for the ODN
The first step in adding data to the ODN is creating a dataset that the ODN can use. It is important for this process to be repeatable so that someone else can regenerate or update the data if necessary.
First, find the source files for your dataset and download them. If possible, write a script that will do this automatically or at least document the process.
If your dataset is about US Census regions, then you should not need to update the ODN Entities dataset. If it isn't, or if you aren't sure, see Updating ODN Entities.
Next, transform the data into a format that the ODN can understand.
Each row in an ODN dataset must contain an entity, variable, value, and a set of constraints.
ODN dataset schema:
-
id
(required): Entity ID (e.g.0400000US53
) -
name
(optional): An optional entity name column may be included. This is not required by the ODN, but it is recommended since it makes the dataset easier to use by others. (e.g.Washington
) -
type
(required): Fully qualified entity type. (e.g.region.state
) -
variable
(required): Variable ID. (e.g.count
) -
value
(required): Value of the variable for the entity. (e.g6919450
)
Each dataset may also have constraint columns that are used to slice the data.
For example, if we had a dataset containing population by year,
we would use population
as the variable and add a constraint
column called year
. The dataset would look like this:
id,name,type,year,variable,value
0400000US53,Washington,region.state,2014,population,6919450
0400000US53,Washington,region.state,2015,population,6970450
0400000US53,Washington,region.state,2016,population,7023970
Try to keep the number of variables in your dataset low and use constraints instead when possible.
Each dataset can have any number of constraint columns.
For example, the occupation dataset has occupation
and year
constraints:
id,name,type,year,occupation,variable,value
0400000US53,Washington,region.state,Farming,2014,count,6346
0400000US53,Washington,region.state,Farming,2015,count,6330
0400000US53,Washington,region.state,Farming,2016,count,6290
Datasets can also have extra columns that are not constraints. This is useful when the customer requires certain fields that the ODN doesn't need. If a column is not registered as a constraint column, it will be ignored by the ODN.
Transform the data from its source format into the ODN dataset schema. If possible, write a script to do this add add it to the odn-pargen project. This will make it easier for people to reproduce your work in the future.
If a value is null, it should not be included in the dataset. If it is, make sure that the value is null and not zero.
Upload your dataset to Socrata once it is formatted.
If the dataset is hosted on odn.data.socrata.com,
follow the naming convention: {nation} {topic} - {dataset}
, e.g. "US Education - Student Teacher Ratios".
Now, you can move on to the next step: Adding Data to the ODN Backend