We have an external service whose database is not accessible to us. All we have is a UI, through which we've gathered some data. Our task is to load these data into our DB.
- Our DB already contains a lot of data, including in those tables into which we plan to load the external data. If a record from the external service is already present in our DB, we need to update it; if it’s not present, we need to create it.
- External data may contain errors and require validation. If external data requires additional processing that cannot be automated, we need to provide analysts with a convenient way of editing this data.
- The external portal data is versioned. Each month, a new version is released, which may contain not only new records, but also old records with updated data. We need to retain information about each “layer” (=="version") and provide convenient access to them.
- All external data has a unique id.
- Data
- Technical specification
- How to generate an ORM for the external data
- How to increase performance
- How to handle Enums
- How to implement version storage
- How to map data
- How to validate data
- How to ensure an Event model for each of our models
- How to store the original data on the backend
- How data is loaded into the DB
- How to ensure an Event model for each of our models
The external ORM
will be built on pydantic
. To auto-generate the file with its types, we use datamodel-codegen
, which can be run with the just gen_types
command. That same just-script will also run ruff fix on the generated file. Be careful—datamodel-codegen
consumes a huge amount of RAM.
A QuerySet
is lazy, meaning it does not store data in itself. For our task, this is not beneficial, since we repeatedly filter, validate, and change data. The number of DB queries would be in the millions, and importing 3GB of data would take days, if not weeks. To solve this, we have LCache
and HCache
—classes that store and filter data, collect update
/create
operations, and apply them with bulk requests.
The ext_profiles
and db_profiles
attributes are necessary because when looking up profiles, we need to do a case-insensitive search, and the string method lower()
is extremely expensive in Python. Hence, at the start of the import, we create a cache of profiles that exist in our DB by storing their initials, the first letter of each initial, and an instance in DBProfileData
. When loading data, we create a similar structure for the external data’s full names: a dictionary whose key is the external data’s full name and whose value is an instance of ExtProfileData
.
Because of this logic, the next time we come across a similar full name, we can retrieve it by passing the key into that dictionary. If we can’t find a match in the DB in any way, we try searching by the first letters of the initials for a full match. If that doesn’t work, we try matching via Levenshtein distance.
Aside from the obvious point that “when handling large numbers of requests to optimize DB operations, use bulk,” I want to mention a library whose performance surpasses Django’s built-in bulk requests: django-bulk-load
.
- Splitting into layers and saving in “hard memory” — loading 3GB of JSON data into the Python program means that any
Exception
generating a traceback will take hours. Hence, we split the JSON file by layers and save each part into a temporary directory. When serializing/deserializing JSON data, we useorjson
, which shows better performance than the built-injson
library. At the same step, we check if there is enough free hard memory to store these data. - The performance of the
orgjson
library surpasses the built-injson
utility.
The project uses snakeviz
, which gives a good visual representation of Profiler
data.
-
“enum-like” models are key/value pairs in a DB table. Our import often includes enum data (e.g. measurement units), but since we cannot know all enum values from the external data right away (they might be added with each new version), instead of hardcoding them in code, we store them in a DB table so that analysts can supplement values if needed. All enum-like models inherit from
loader.models.enum_like.EnumLikeModel
. Examples of such tables:loader_measureunit
,loader_repairtype
,loader_signcode
. -
“mapping” models store a set of values that map to a particular record of an
EnumLike
model. For everyEnumLikeModel
, a dedicated model inheritingMappingModel
must exist.
-
Though
django-simple-history
is popular, we should not use it, since the django-triggers architecture cannot track the results of bulk requests and negatively impacts performance.django-pghistory
provides higher speed and, thanks to using postgres-triggers, can track bulk requests. -
Every record in the
Event
model represents a loaded layer of data. When we import data, we essentially update a given record. If we need the historical information about changes (layer/version details), we’ll get that from the historical table. -
Since we need to know which historical-table record belongs to which version, in the main
Loader
model we add a12many
link toReport
, which holds the layer info. All the tables involved in data loading have areport
field. -
Since analysts/users of our portal must be able to edit the loaded data via DB/api, a problem arises: how to distinguish data loaded automatically from data loaded manually. The solution can be seen in the
Loader
model: we add theauto_imported
field to all models that we load data into automatically, plus a pg-trigger that setsauto_imported
based on the custom settingcustom.auto_import_flag
. During data load, we setcustom.auto_import_flag=true
; after loading completes, we revert it tofalse
. Thus, if changes are made via api/DB, the field becomesfalse
, but if loaded automatically, it becomestrue
.
These are generic classes containing information about mapping import data to our DB. Classes that map external data to our DB. The schemas are an intermediate layer between external data and our DB, allowing us to unify external data.
Generic import schemas are described in schemas/generics.py
. The main class, GenericImportSchema
, inherits from the abstract SchemaValidatorMeta
, which inherits from generic validators. This architecture allows the system to check schemas
when the Django app starts. If there’s an error in schemas
, the app cannot start. GenericImportSchema
: model
is the model matching the imported entity, import_entity
is the entity being imported.
Takes a pydantic model for external data, the corresponding Django model from our app, and a set of fields to be loaded.
A mapping of any field. Accepts the string from the pydantic model and a field from the Django model.
comparison
: fields used by the backend to search for a matching record in the DB. Usually, this is theext_id
field. Analysts’ job was to assignext_id
for most existing DB records, because other fields aren’t good for mapping.rest
: normal fields whose values must be transferred to the DB. If a field requires additional processing, the third argument ofFieldMapping
is a function that either takes the field’s value or the field’s value plusContext
, if needed.parent_field
: some entities have a field with anid
of the parent entity. Because not all entities have such a field, specifyingext_field
is optional, but specifyingdb
is important. During data import, the backend can find the parent entity in the DB via nesting, and the system pulls the parent-child linking data fromschema.fields.parent
.children
: takes “schemas” for ordinary child entities.nested
: takes a tuple withNestedField
, which must be passed a string from the pydantic model plus the schema.
/loader/validate/{ivnb_file_id}/
The goal is to find data that requires manual handling by analysts. For instance, if the imported data references users not present on our backend.
- Validation can be run via
django
orcelery
. - The validation endpoint requires the
id
of the JSON file uploaded to the backend that holds the DB data.
Validation processes each layer recursively, checking each entity per the scenario described in ValidatePassportReports
. Each entity can have a different set of checks.
Error descriptions (also listed in the swagger description of
/validate/
)
- not_found: if the DB record corresponding to the import entity is not found
- multiple_records: if multiple DB records match the import entity when only one is expected
- cant_parse: if a value cannot be parsed correctly
- required: if the value is required by the schema but doesn’t exist in the data
- mapping_not_found: if the DB record of an enum-like model wasn’t found either by mapping or by the original external value
- duplicated_ext_ids: if ext_ids are duplicated within one report
After collecting all errors, the backend removes duplicates by grouping them by maintenance, version. It then returns them via API.
/loader/gen_json_for_users_creation/
For some error types, solutions exist to speed up or simplify the analyst’s work. gen_json_for_users_creation
helps address a not_found
error regarding a profile that doesn’t exist in the auth_user
table. The endpoint takes the output from /importer/validate/
, extracts errors about missing profiles from the JSON file that aren’t in our DB. The output
of this endpoint must be reviewed. If everything is fine, that output can be passed to /loader/users/
. That endpoint will create users from the provided data.
You need to create an SQL script that inserts records into the mapping tables.
/loader/file/
This endpoint lets you upload and download files with external service data. The info about uploaded files is stored in the loader_file
table. When a file is uploaded, a hash function is applied, after which the system checks if a file with the same hash
is already present in the DB. If so, an error is returned.
Because the original task handled large amounts of data—over 3GB of JSON—the server supports working with a 7z archive that compresses 3GB to a few dozen MB so we don’t have to send so much data directly to the server.
/loader/load/{file_id}/
- This endpoint is atomic. The DB state changes only if the code completes with no errors.
- The endpoint for validation requires the
id
of the JSON file with the external data uploaded to the backend. - The import can be done via
django
orcelery
.
First, data validation is performed. If no error is found, the import begins.
-
We find all models involved in the import. We parse all models from the schemas. Some models like
Profile
aren’t mentioned inschemas
, so they’re hardcoded infind_all_models()
. -
We run
SELECT set_config('custom.auto_loader_flag', 'true', false)
. -
We build caches for models.
-
We split the data by layers and store them in “hard memory.” Because loading 3GB of JSON into Python will produce an
Exception
that can take hours to traceback, we divide the JSON file by layers, saving each part in a temporary directory, verifying along the way that there’s enough free hard memory.-
- If the file is 7z, we extract it to get JSON.
-
-
We iterate over the layers using
ReportsIter
. -
process_report
is a recursive function that takes anHContext
with the current layer’s data, user info, version info, a root entity, and aschema-class
of the entity being processed. It either creates a record in the DB if it doesn’t already exist, or updates it if it does.
We initialize LEntityProcessor
—a class with methods needed to process import entities.
If the entity belongs to an EnumLikeModel
, we handle it separately, trying to find data in the EnumLikeModel
-inherited table. If not found, we look in the MappingModel
-inherited table.
If the entity has nested children, we recursively process each child. If the entity has normal (rest
) fields, we add them to lookups, plus we add the comparison fields. If a field needs extra processing, we call the function specified in the schema, passing it the field value.
To fill relationships, we use parent
field info or just fill parent
based on which entity is the parent.
We pass lookups, version info (report
), DB lookup fields to create_or_update_or_skip()
. This function updates the existing record if found or schedules it for creation if not. If the record has m2m
links, we update the LCachem2m
attribute.
./server/loader/checks.py
adds a custom check for the presence of Event models in Django checks, which run when we call./manage.py
../server/loader/management/commands/add_pg_history_models.py
adds a command to be run viamanage.py
that automatically adds the@track_history
decorator to models, auto-generates migration files, and runs them.
A simple way to run this command is in justfile
under the alias just g
.