Skip to content

HappyCthulhu/ext_load

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

62 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

What is this project?

We have an external service whose database is not accessible to us. All we have is a UI, through which we've gathered some data. Our task is to load these data into our DB.

  • Our DB already contains a lot of data, including in those tables into which we plan to load the external data. If a record from the external service is already present in our DB, we need to update it; if it’s not present, we need to create it.
  • External data may contain errors and require validation. If external data requires additional processing that cannot be automated, we need to provide analysts with a convenient way of editing this data.
  • The external portal data is versioned. Each month, a new version is released, which may contain not only new records, but also old records with updated data. We need to retain information about each “layer” (=="version") and provide convenient access to them.
  • All external data has a unique id.

Click for Video explanation

Project Description

Table of Contents

Data

How to generate an ORM for the external data

The external ORM will be built on pydantic. To auto-generate the file with its types, we use datamodel-codegen, which can be run with the just gen_types command. That same just-script will also run ruff fix on the generated file. Be careful—datamodel-codegen consumes a huge amount of RAM.

How to increase performance

Caching queries

A QuerySet is lazy, meaning it does not store data in itself. For our task, this is not beneficial, since we repeatedly filter, validate, and change data. The number of DB queries would be in the millions, and importing 3GB of data would take days, if not weeks. To solve this, we have LCache and HCache—classes that store and filter data, collect update/create operations, and apply them with bulk requests.

The ext_profiles and db_profiles attributes are necessary because when looking up profiles, we need to do a case-insensitive search, and the string method lower() is extremely expensive in Python. Hence, at the start of the import, we create a cache of profiles that exist in our DB by storing their initials, the first letter of each initial, and an instance in DBProfileData. When loading data, we create a similar structure for the external data’s full names: a dictionary whose key is the external data’s full name and whose value is an instance of ExtProfileData.

Because of this logic, the next time we come across a similar full name, we can retrieve it by passing the key into that dictionary. If we can’t find a match in the DB in any way, we try searching by the first letters of the initials for a full match. If that doesn’t work, we try matching via Levenshtein distance.

Bulk requests

Aside from the obvious point that “when handling large numbers of requests to optimize DB operations, use bulk,” I want to mention a library whose performance surpasses Django’s built-in bulk requests: django-bulk-load.

Working with JSON

  • Splitting into layers and saving in “hard memory” — loading 3GB of JSON data into the Python program means that any Exception generating a traceback will take hours. Hence, we split the JSON file by layers and save each part into a temporary directory. When serializing/deserializing JSON data, we use orjson, which shows better performance than the built-in json library. At the same step, we check if there is enough free hard memory to store these data.
  • The performance of the orgjson library surpasses the built-in json utility.

Performance investigation

The project uses snakeviz, which gives a good visual representation of Profiler data.

How to handle Enums

  • “enum-like” models are key/value pairs in a DB table. Our import often includes enum data (e.g. measurement units), but since we cannot know all enum values from the external data right away (they might be added with each new version), instead of hardcoding them in code, we store them in a DB table so that analysts can supplement values if needed. All enum-like models inherit from loader.models.enum_like.EnumLikeModel. Examples of such tables: loader_measureunit, loader_repairtype, loader_signcode.

  • “mapping” models store a set of values that map to a particular record of an EnumLike model. For every EnumLikeModel, a dedicated model inheriting MappingModel must exist.

How to implement version storage

  • Though django-simple-history is popular, we should not use it, since the django-triggers architecture cannot track the results of bulk requests and negatively impacts performance. django-pghistory provides higher speed and, thanks to using postgres-triggers, can track bulk requests.

  • Every record in the Event model represents a loaded layer of data. When we import data, we essentially update a given record. If we need the historical information about changes (layer/version details), we’ll get that from the historical table.

  • Since we need to know which historical-table record belongs to which version, in the main Loader model we add a 12many link to Report, which holds the layer info. All the tables involved in data loading have a report field.

  • Since analysts/users of our portal must be able to edit the loaded data via DB/api, a problem arises: how to distinguish data loaded automatically from data loaded manually. The solution can be seen in the Loader model: we add the auto_imported field to all models that we load data into automatically, plus a pg-trigger that sets auto_imported based on the custom setting custom.auto_import_flag. During data load, we set custom.auto_import_flag=true; after loading completes, we revert it to false. Thus, if changes are made via api/DB, the field becomes false, but if loaded automatically, it becomes true.

How to map data

These are generic classes containing information about mapping import data to our DB. Classes that map external data to our DB. The schemas are an intermediate layer between external data and our DB, allowing us to unify external data.

Generic import schemas are described in schemas/generics.py. The main class, GenericImportSchema, inherits from the abstract SchemaValidatorMeta, which inherits from generic validators. This architecture allows the system to check schemas when the Django app starts. If there’s an error in schemas, the app cannot start. GenericImportSchema: model is the model matching the imported entity, import_entity is the entity being imported.

GenericLoaderSchema

Takes a pydantic model for external data, the corresponding Django model from our app, and a set of fields to be loaded.

FieldMapping

A mapping of any field. Accepts the string from the pydantic model and a field from the Django model.

FieldsMapping

  • comparison: fields used by the backend to search for a matching record in the DB. Usually, this is the ext_id field. Analysts’ job was to assign ext_id for most existing DB records, because other fields aren’t good for mapping.
  • rest: normal fields whose values must be transferred to the DB. If a field requires additional processing, the third argument of FieldMapping is a function that either takes the field’s value or the field’s value plus Context, if needed.
  • parent_field: some entities have a field with an id of the parent entity. Because not all entities have such a field, specifying ext_field is optional, but specifying db is important. During data import, the backend can find the parent entity in the DB via nesting, and the system pulls the parent-child linking data from schema.fields.parent.
  • children: takes “schemas” for ordinary child entities.
  • nested: takes a tuple with NestedField, which must be passed a string from the pydantic model plus the schema.

How to validate data

/loader/validate/{ivnb_file_id}/

The goal is to find data that requires manual handling by analysts. For instance, if the imported data references users not present on our backend.

  • Validation can be run via django or celery.
  • The validation endpoint requires the id of the JSON file uploaded to the backend that holds the DB data.

Validation processes each layer recursively, checking each entity per the scenario described in ValidatePassportReports. Each entity can have a different set of checks.

Types of errors

Error descriptions (also listed in the swagger description of /validate/)

  • not_found: if the DB record corresponding to the import entity is not found
  • multiple_records: if multiple DB records match the import entity when only one is expected
  • cant_parse: if a value cannot be parsed correctly
  • required: if the value is required by the schema but doesn’t exist in the data
  • mapping_not_found: if the DB record of an enum-like model wasn’t found either by mapping or by the original external value
  • duplicated_ext_ids: if ext_ids are duplicated within one report

After collecting all errors, the backend removes duplicates by grouping them by maintenance, version. It then returns them via API.

How to fix validation errors

not found

/loader/gen_json_for_users_creation/

For some error types, solutions exist to speed up or simplify the analyst’s work. gen_json_for_users_creation helps address a not_found error regarding a profile that doesn’t exist in the auth_user table. The endpoint takes the output from /importer/validate/, extracts errors about missing profiles from the JSON file that aren’t in our DB. The output of this endpoint must be reviewed. If everything is fine, that output can be passed to /loader/users/. That endpoint will create users from the provided data.

mapping_not_found

You need to create an SQL script that inserts records into the mapping tables.

How to store the original data on the backend

/loader/file/

This endpoint lets you upload and download files with external service data. The info about uploaded files is stored in the loader_file table. When a file is uploaded, a hash function is applied, after which the system checks if a file with the same hash is already present in the DB. If so, an error is returned.

Because the original task handled large amounts of data—over 3GB of JSON—the server supports working with a 7z archive that compresses 3GB to a few dozen MB so we don’t have to send so much data directly to the server.

How data is loaded into the DB

/loader/load/{file_id}/

  • This endpoint is atomic. The DB state changes only if the code completes with no errors.
  • The endpoint for validation requires the id of the JSON file with the external data uploaded to the backend.
  • The import can be done via django or celery.

First, data validation is performed. If no error is found, the import begins.

  • We find all models involved in the import. We parse all models from the schemas. Some models like Profile aren’t mentioned in schemas, so they’re hardcoded in find_all_models().

  • We run SELECT set_config('custom.auto_loader_flag', 'true', false).

  • We build caches for models.

  • We split the data by layers and store them in “hard memory.” Because loading 3GB of JSON into Python will produce an Exception that can take hours to traceback, we divide the JSON file by layers, saving each part in a temporary directory, verifying along the way that there’s enough free hard memory.

      • If the file is 7z, we extract it to get JSON.
  • We iterate over the layers using ReportsIter.

  • process_report is a recursive function that takes an HContext with the current layer’s data, user info, version info, a root entity, and a schema-class of the entity being processed. It either creates a record in the DB if it doesn’t already exist, or updates it if it does.

We initialize LEntityProcessor—a class with methods needed to process import entities.

If the entity belongs to an EnumLikeModel, we handle it separately, trying to find data in the EnumLikeModel-inherited table. If not found, we look in the MappingModel-inherited table.

If the entity has nested children, we recursively process each child. If the entity has normal (rest) fields, we add them to lookups, plus we add the comparison fields. If a field needs extra processing, we call the function specified in the schema, passing it the field value.

To fill relationships, we use parent field info or just fill parent based on which entity is the parent.

We pass lookups, version info (report), DB lookup fields to create_or_update_or_skip(). This function updates the existing record if found or schedules it for creation if not. If the record has m2m links, we update the LCachem2m attribute.

How to ensure an Event model for each of our models

  • ./server/loader/checks.py adds a custom check for the presence of Event models in Django checks, which run when we call ./manage.py.
  • ./server/loader/management/commands/add_pg_history_models.py adds a command to be run via manage.py that automatically adds the @track_history decorator to models, auto-generates migration files, and runs them.

A simple way to run this command is in justfile under the alias just g.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published