Skip to content

Export Strategy

WiredSound edited this page Feb 3, 2022 · 2 revisions

Once Synth has generated some data, the manner in which that data is output is determined by the 'export strategy'. Synth has several different export strategies with each facilitating the writing of data to a different database system, or to the filesystem in a text-based format.

Selecting a Strategy

Internally, an export strategy is represented using the trait ExportStrategy, which is implemented by structures including PostgresExportStrategy, JsonFileExportStrategy, MongoExportStrategy, and so on. Which of these is used is determined by the --to command-line parameter (default is to output to STDOUT in JSON format).

As described in the user documentation here, the --to parameter is a URI. It is therefore the URI scheme that determines the strategy, with the full URI parsed to the strategy for it to use (e.g. the postgres:// in --to postgres://user@localhost:5678 indicates that a PostgresExportStrategy instance should be used, with the full URI then passed to that instance so it knows what database to connect to).

Text-Based Formats

For each text-based format, there are two strategies - one for outputting to STDOUT (i.e. the terminal), and one for writing to a file. The former uses Rust's println! macro to output to the console while the latter uses the Rust standard library's filesystem API.

JSON

The implementation of the JSON export strategy is quite trivial - Synth's representation of data is already quite similar to JSON so conversion is as simple as performing a depth-first traversal of the data and recursively converting to serde_json::Value instances. The serde_json::Value type implements the built-in trait Display meaning it can be converted to text automatically.

JSON Lines

Outputting JSON Lines is again quite simple - as the values generated by Synth are contained within an array, all that needs to be done is have each individual value in said array be converted to JSON by itself, rather than convert the array as a whole. When generating JSON object values from multiple collections, each object has an additional field inserted into it which indicates which collection it was generated from.

CSV

Compared to JSON and JSON Lines, CSV encoding is surprisingly complex. This is because Synth is capable of generating large, nested tree structures which are naturally quite difficult to properly represent in the rigid, flat structure of CSV records. While the current implementation is capable of flattening most structures into a set of scalar values as required by CSV, it is only able to handle one_of generators when all variants are scalar (with 'scalar' meaning not an object or an array).

The first task performed by the CSV export strategy after data generation is producing a set of headers. These headers are vital for correctly identify the structure of data exported as CSV should it later be imported. Headers are represented internally using a nested enum type CsvHeader which has two two variants - ArrayElement and ObjectProperty. Say for example, the following generator was being converted into a header:

{
    "type": "object",
    "a": {
        "type": object",
        "b": {
            "type": "array",
            "length": 1,
            "content": {
                "type": "sting",
                "pattern": "hello world"
            }
        }
    }
}

The produces header look like how one would access in JavaScript the generated value were the data JSON-encoded, so in this example the header for the string value "hello world" would be a.b[0] (0th element of array b which is a property of object a). This would be represented using the CsvHeader enum like so:

CsvHeader::ArrayElement {
    index: 0,
    parent: Some(Box::new(CsvHeader::ObjectProperty {
        key: "b".to_string(),
        parent: Some(Box::new(CsvHeader::ObjectProperty {
            key: "a".to_string(),
            parent: None
        }))
    }))
}

A set of CsvHeader instances like the above are produced by recursively traversing a collection schema (not the generated data, as that would potentially cause headers for optional values to be missed). Next, the CSV records for the actual generated data are produced by traversing simultaneously each value and the collection schema. The collection schema is traversed at the same time in order to ensure the appropriate number of null values are inserted for any optional values and for any arrays of variable length.

Database Integrations

To limit code duplication, Synth has a RelationalDataSource trait which, provided a set of methods are implemented, provides a method which allows for the insertion of data easily into any SQL database. Implementors of the trait need to implement a few different methods for things like executing a SQL query, fetching the names of tables in the database, and so on. Thanks to this system though, export strategies for relational databases simply need to call a create_and_insert_values functions and pass in the generated data and an instance of a type implementing RelationalDataSource for their particular implementation.

PostgreSQL

For Postgres, the type that implements RelationalDataSource is the structure PostgresDataSource. In PostgresExportStrategy, a instance of PostgresDataSource is created and the function create_and_insert_values handles all the logic of inserting values into the database appropriately.

MySQL

Similar to Postgres above, for MySQL there is a type MySqlDataSource which implements RelationalDataSource and is passed to created_and_insert_values in MySqlExportStrategy.

MongoDB

MongoDB is a NoSQL database so the trait RelationalDataSource and the function create_and_insert_values function cannot be relied upon to implement MongoExportStrategy. Fortunately, the similarity between the values generated by Synth and BSON (the data storage type used by MongoDB) means the MongoDB export is still not particularly complex. The generated Synth Value instances are recursively traversed and mapped to Bson instances (provided by the bson crate). BSON values are simply then inserted into a Mongo database using the mongodb crate in the implementation of MongoExportStatregy.

Clone this wiki locally