This repo hosts Synthea-generated sample FHIR bulk export results, useful for testing downstream workflows. It also hosts a script for generating samples of custom sizes. See the regenerating a dataset section below for instructions on generating custom sized datasets.
- Small (10 patients, 1.9MB zipped, 19MB unzipped)
- Medium (100 patients, 17MB zipped, 129MB unzipped)
- Large (1,000 patients, 183MB zipped, 1.3GB unzipped)
- AllergyIntolerance
- Condition
- Device
- DiagnosticReport
- DocumentReference
- Encounter
- Immunization
- Medication
- MedicationRequest
- Observation
- Patient
- Procedure
The 100-patient dataset looks like this, for example:
sample-bulk-fhir-datasets-100-patients/
AllergyIntolerance.000.ndjson
Condition.000.ndjson
Device.000.ndjson
DiagnosticReport.000.ndjson
DocumentReference.000.ndjson
Encounter.000.ndjson
Immunization.000.ndjson
log.ndjson
MedicationRequest.000.ndjson
Observation.000.ndjson
Observation.001.ndjson
Patient.000.ndjson
Procedure.000.ndjson
Each file holds a list of FHIR json records (one per line) like:
{"resourceType":"Condition","id":"000023ef-c498-02cc-c9b7-20aab279b262",...}
Each file is also less than 50MB for convenience when working with them. As you can see above, two files were needed to hold all the Observations.
The log.ndjson
file you see is a sample
bulk export log,
like those generated by other Smart on FHIR bulk export tools.
It is included for verisimilitude and
because some tools that process exported data may expect to see it.
The script that generates these datasets is Apache 2, but the datasets themselves can be treated as CC0 licensed (i.e. as close to public domain as possible).
- No end-user generation of data should be necessary. All examples are pre-generated.
- Offer several different sized datasets as one-click downloads.
- The exact definitions of those sizes are flexible.
- Limits imposed by GitHub may affect our options.
- Each dataset should look like the plausible result of a FHIR bulk export.
- A reasonable effort will be made to keep data consistent over time.
- That is, patient 1234 will not suddenly change addresses in a month.
- This is not a guarantee, but a best-effort feature.
- This dataset does not need to serve everyone's needs.
- It's primary purpose is to be useful to other SMART on FHIR projects, like Cumulus.
- If you need something different (like, a new resource), it's easy to generate your own with Synthea.
There are several other similar sample FHIR datasets or generators, with slightly different purposes:
-
custom-sample-data (2017): focused on providing a few small validated JSON transaction bundles
-
sample-patients (2018): focused on generating individual JSON files based off a custom text file format
-
generated-sample-data (2021): focused on generating a JSON transaction bundle for insertion into a FHIR server
-
ctakes-examples (2022): focused on realistic plaintext physician notes
-
synthea (ongoing): and of course Synthea, the general purpose FHIR generator, used to generate this dataset
The following must be installed for the generation script to succeed:
- Java
- GNU
sed
- For mac users, this script requires manual installation of GNU's
sed
command. We have had success with this gnu-sed homebrew package
- For mac users, this script requires manual installation of GNU's
- GNU
split
- For mac users, this script requires manual installation of GNU's
split
command. We have had success with this coreutils homebrew package, which includessplit
- For mac users, this script requires manual installation of GNU's
Then you should be able to clone and run the script:
git clone --single-branch [email protected]:smart-on-fhir/sample-bulk-fhir-datasets.git
cd sample-bulk-fhir-datasets
./generate.sh 10 # generates a ten patient dataset