In this document we review the mapping (a.k.a. projection) from FHIR resources to Parquet schema. A few high-level reminders:
- The basic idea for the projection follows the Simplified SQL Projection of FHIR Resources proposal.
- The projection is done using a forked version of the Bunsen library. The entry point for this conversion logic is AvroConverter. As the name suggests, the conversion logic is from FHIR StructureDefinition to Apache Avro.
- Conversion from Avro to Parquet is done using the parquet-avro library.
If you want to look at examples before reading the details, you can check Patient_no-extension.schema for the projection of base Patient resource. Patient_US-Core.schema provides an example for US Core Patient profile. To see the intermediate Avro schema for this resource, see us-core-patient-schema.json.
Note: In the following subsections we cover the rules for mapping a FHIR
type to a Parquet schema. As mentioned above, this involves the intermediate
Avro types which are covered as well. In all cases, the real Avro type
is a union because
all fields are nullable. So, for example, when we say the FHIR code
type is
mapped to Avro string
, it is really the ["null", "string"]
union type. This
is not reiterated below but that is also the reason all Parquet fields are
optional
. This is even true for fields whose cardinality is exactly one like
Observation.status
.
The FHIR primitive types are mapped according to this table (code reference):
FHIR type | Avro type | Parquet type |
---|---|---|
base64Binary |
string |
STRING |
boolean |
boolean |
boolean |
canonical |
string |
STRING |
code |
string |
STRING |
date |
string |
STRING |
datetime |
string |
STRING |
decimal |
double |
double * |
id |
string |
STRING |
instant |
string |
STRING |
integer |
int |
int32 |
markdown |
string |
STRING |
oid |
string |
STRING |
positiveInt |
int |
int32 |
string |
string |
STRING |
time |
string |
STRING |
unsignedInt |
int |
int32 |
xhtml |
string |
STRING |
uri |
string |
STRING |
url |
string |
STRING |
uuid |
string |
STRING |
* The original Bunsen used to use Avro
decimal type to
represent FHIR decimal
. But we changed this because of precision issues as
described in Issue #156.
A FHIR record type, i.e., a complex type that has one or more fields,
are mapped to an
Avro record,
which in turn is mapped to Parquet group
. FHIR examples include any
Complex Type,
BackboneElement,
and Resource.
For example a period
field with FHIR
Period type is mapped to the
following group
in Parquet:
optional group period {
optional binary start (STRING);
optional binary end (STRING);
}
Many FHIR record types have fields that can be repeated. Each element with
max cardinality
higher than 1 is mapped to an
Avro array which in
turn is mapped to a
Parquet LIST.
As an example, here is the schema for the address
field of a
Patient
resource:
optional group address (LIST) {
repeated group array {
optional binary use (STRING);
optional binary type (STRING);
optional binary text (STRING);
optional group line (LIST) {
repeated binary array (STRING);
}
optional binary city (STRING);
optional binary district (STRING);
optional binary state (STRING);
optional binary postalCode (STRING);
optional binary country (STRING);
optional group period {
optional binary start (STRING);
optional binary end (STRING);
}
}
}
A FHIR "choice type", i.e., fields ending with [x]
which can take multiple
types, are modeled as a record. The fields of the record are
named after the possible types. For example,
Patient.deceased[x]
can be a boolean
or a dateTime
; hence it is modeled with the following
Parquet schema:
optional group deceased {
optional boolean boolean;
optional binary dateTime (STRING);
}
FHIR references are also
records but because they frequently participate in JOIN
queries between
different resource tables, they have some extra special fields. These fields
represent each resource type that a reference can refer to and make it easier to
write JOIN
queries. For example, the
Patient.generalPractitioner
can be a reference to an Organization
or Practitioner
or PractitionerRole
.
Therefor, it is mapped to the following Parquet schema (only special fields are
shown; note there might be multiple generalPractitioner
, hence the LIST
):
optional group generalPractitioner (LIST) {
repeated group array {
optional binary organizationId (STRING);
optional binary practitionerId (STRING);
optional binary practitionerRoleId (STRING);
... [rest of the usual fields]
}
}
When mapping FHIR types to Parquet schema, we sometime need to break recursive
structures. For example, a FHIR
references has an identifier
field which has an
assigner
field which is a reference itself. Therefor, there is a
recursiveDepth
configuration parameter that controls how many times a recursive type should
be traversed in the same branch.
To make it easier to query extension fields, top-level fields are created for
them. For example, in the
US-Core Patient profile
there is an extension for
birthsex
whose type
is code
; therefor we get the following field at the topmost level
in the Patient Parquet schema:
optional binary birthsex (STRING);
The above example is a "simple" extension. For
"complex" extensions, i.e.,
extensions that have nested extensions (and have no value
), the same structure
is repeated in the generated schema as well. For example, the US-Core Patient
profile has a complex
race extension
which has a list of ombCategory
values, a list of detailed
values, and a
text
. Therefor the corresponding Parquet schema is:
optional group race {
optional group ombCategory (LIST) {
repeated group array {
optional binary system (STRING);
optional binary version (STRING);
optional binary code (STRING);
optional binary display (STRING);
optional boolean userSelected;
}
}
optional group detailed (LIST) {
repeated group array {
optional binary system (STRING);
optional binary version (STRING);
optional binary code (STRING);
optional binary display (STRING);
optional boolean userSelected;
}
}
optional binary text (STRING);
}
As mentioned above, this race
would be a top-level field,
i.e., Patient.race
.
In a profile, it is possible that a single resource type, may have multiple
extension files, each having a StructureDefinition
. As long as these
extensions are compatible (which is expected in a single profile), all of them
are merged into a single schema. For example, if one extension adds a new
field X
on resource type R
and another extension adds Y
, the generated
Parquet schema of R
has both fields X
and Y
.