Skip to content

Latest commit

 

History

History
132 lines (102 loc) · 4.9 KB

README.md

File metadata and controls

132 lines (102 loc) · 4.9 KB

Sample FHIR Bulk Export Datasets

This repo hosts Synthea-generated sample FHIR bulk export results, useful for testing downstream workflows. It also hosts a script for generating samples of custom sizes. See the regenerating a dataset section below for instructions on generating custom sized datasets.

Downloads

  • Small (10 patients, 1.9MB zipped, 19MB unzipped)
  • Medium (100 patients, 17MB zipped, 129MB unzipped)
  • Large (1,000 patients, 183MB zipped, 1.3GB unzipped)

Which FHIR Resources Are Included?

  • AllergyIntolerance
  • Condition
  • Device
  • DiagnosticReport
  • DocumentReference
  • Encounter
  • Immunization
  • Medication
  • MedicationRequest
  • Observation
  • Patient
  • Procedure

What Do the Contents of a Dataset Look Like?

The 100-patient dataset looks like this, for example:

sample-bulk-fhir-datasets-100-patients/
  AllergyIntolerance.000.ndjson
  Condition.000.ndjson
  Device.000.ndjson
  DiagnosticReport.000.ndjson
  DocumentReference.000.ndjson
  Encounter.000.ndjson
  Immunization.000.ndjson
  log.ndjson
  MedicationRequest.000.ndjson
  Observation.000.ndjson
  Observation.001.ndjson
  Patient.000.ndjson
  Procedure.000.ndjson

Each file holds a list of FHIR json records (one per line) like:

{"resourceType":"Condition","id":"000023ef-c498-02cc-c9b7-20aab279b262",...}

Each file is also less than 50MB for convenience when working with them. As you can see above, two files were needed to hold all the Observations.

The log.ndjson file you see is a sample bulk export log, like those generated by other Smart on FHIR bulk export tools. It is included for verisimilitude and because some tools that process exported data may expect to see it.

License

The script that generates these datasets is Apache 2, but the datasets themselves can be treated as CC0 licensed (i.e. as close to public domain as possible).

Goals of This Dataset Collection

  • No end-user generation of data should be necessary. All examples are pre-generated.
  • Offer several different sized datasets as one-click downloads.
    • The exact definitions of those sizes are flexible.
    • Limits imposed by GitHub may affect our options.
  • Each dataset should look like the plausible result of a FHIR bulk export.
  • A reasonable effort will be made to keep data consistent over time.
    • That is, patient 1234 will not suddenly change addresses in a month.
    • This is not a guarantee, but a best-effort feature.

Non-Goals

  • This dataset does not need to serve everyone's needs.
    • It's primary purpose is to be useful to other SMART on FHIR projects, like Cumulus.
    • If you need something different (like, a new resource), it's easy to generate your own with Synthea.

Prior Art

There are several other similar sample FHIR datasets or generators, with slightly different purposes:

  • custom-sample-data (2017): focused on providing a few small validated JSON transaction bundles

  • sample-patients (2018): focused on generating individual JSON files based off a custom text file format

  • generated-sample-data (2021): focused on generating a JSON transaction bundle for insertion into a FHIR server

  • ctakes-examples (2022): focused on realistic plaintext physician notes

  • synthea (ongoing): and of course Synthea, the general purpose FHIR generator, used to generate this dataset

Regenerating a Dataset

The following must be installed for the generation script to succeed:

  • Java
  • GNU sed
    • For mac users, this script requires manual installation of GNU's sed command. We have had success with this gnu-sed homebrew package
  • GNU split
    • For mac users, this script requires manual installation of GNU's split command. We have had success with this coreutils homebrew package, which includes split

Then you should be able to clone and run the script:

git clone --single-branch [email protected]:smart-on-fhir/sample-bulk-fhir-datasets.git
cd sample-bulk-fhir-datasets
./generate.sh 10 # generates a ten patient dataset