-
Notifications
You must be signed in to change notification settings - Fork 110
Export Strategy
Once Synth has generated some data, the manner in which that data is output is determined by the 'export strategy'. Synth has several different export strategies with each facilitating the writing of data to a different database system, or to the filesystem in a text-based format.
Internally, an export strategy is represented using the trait ExportStrategy
, which is implemented by structures including PostgresExportStrategy
, JsonFileExportStrategy
, MongoExportStrategy
, and so on. Which of these is used is determined by the --to
command-line parameter (default is to output to STDOUT in JSON format).
As described in the user documentation here, the --to
parameter is a URI. It is therefore the URI scheme that determines the strategy, with the full URI parsed to the strategy for it to use (e.g. the postgres://
in --to postgres://user@localhost:5678
indicates that a PostgresExportStrategy
instance should be used, with the full URI then passed to that instance so it knows what database to connect to).
For each text-based format, there are two strategies - one for outputting to STDOUT (i.e. the terminal), and one for writing to a file. The former uses Rust's println!
macro to output to the console while the latter uses the Rust standard library's filesystem API.
The implementation of the JSON export strategy is quite trivial - Synth's representation of data is already quite similar to JSON so conversion is as simple as performing a depth-first traversal of the data and recursively converting to serde_json::Value
instances. The serde_json::Value
type implements the built-in trait Display
meaning it can be converted to text automatically.
Outputting JSON Lines is again quite simple - as the values generated by Synth are contained within an array, all that needs to be done is have each individual value in said array be converted to JSON by itself, rather than convert the array as a whole. When generating JSON object values from multiple collections, each object has an additional field inserted into it which indicates which collection it was generated from.
Compared to JSON and JSON Lines, CSV encoding is surprisingly complex. This is because Synth is capable of generating large, nested tree structures which are naturally quite difficult to properly represent in the rigid, flat structure of CSV records. While the current implementation is capable of flattening most structures into a set of scalar values as required by CSV, it is only able to handle one_of
generators when all variants are scalar (with 'scalar' meaning not an object or an array).
The first task performed by the CSV export strategy after data generation is producing a set of headers. These headers are vital for correctly identify the structure of data exported as CSV should it later be imported. Headers are represented internally using a nested enum type CsvHeader
which has two two variants - ArrayElement
and ObjectProperty
. Say for example, the following generator was being converted into a header:
{
"type": "object",
"a": {
"type": object",
"b": {
"type": "array",
"length": 1,
"content": {
"type": "sting",
"pattern": "hello world"
}
}
}
}
The produces header look like how one would access in JavaScript the generated value were the data JSON-encoded, so in this example the header for the string value "hello world"
would be a.b[0]
(0th element of array b
which is a property of object a
). This would be represented using the CsvHeader
enum like so:
CsvHeader::ArrayElement {
index: 0,
parent: Some(Box::new(CsvHeader::ObjectProperty {
key: "b".to_string(),
parent: Some(Box::new(CsvHeader::ObjectProperty {
key: "a".to_string(),
parent: None
}))
}))
}
A set of CsvHeader
instances like the above are produced by recursively traversing a collection schema (not the generated data, as that would potentially cause headers for optional values to be missed). Next, the CSV records for the actual generated data are produced by traversing simultaneously each value and the collection schema. The collection schema is traversed at the same time in order to ensure the appropriate number of null values are inserted for any optional values and for any arrays of variable length.
To limit code duplication, Synth has a RelationalDataSource
trait which, provided a set of methods are implemented, provides a method which allows for the insertion of data easily into any SQL database. Implementors of the trait need to implement a few different methods for things like executing a SQL query, fetching the names of tables in the database, and so on. Thanks to this system though, export strategies for relational databases simply need to call a create_and_insert_values
functions and pass in the generated data and an instance of a type implementing RelationalDataSource
for their particular implementation.
For Postgres, the type that implements RelationalDataSource
is the structure PostgresDataSource
. In PostgresExportStrategy
, a instance of PostgresDataSource
is created and the function create_and_insert_values
handles all the logic of inserting values into the database appropriately.
Similar to Postgres above, for MySQL there is a type MySqlDataSource
which implements RelationalDataSource
and is passed to created_and_insert_values
in MySqlExportStrategy
.
MongoDB is a NoSQL database so the trait RelationalDataSource
and the function create_and_insert_values
function cannot be relied upon to implement MongoExportStrategy
. Fortunately, the similarity between the values generated by Synth and BSON (the data storage type used by MongoDB) means the MongoDB export is still not particularly complex. The generated Synth Value
instances are recursively traversed and mapped to Bson
instances (provided by the bson
crate). BSON values are simply then inserted into a Mongo database using the mongodb
crate in the implementation of MongoExportStatregy
.