Gemini is designed around the concept of a job
. There are a number of defined jobs and each
of them perform a limited function for example MutationJob
applies mutations to the
database clusters.
-
MutationJob: This job applies mutations to the clusters. The mutations can be of several types. The basic INSERT and DELETE with various conditions or DDL type statements such as ALTER the structure of the table. These type of mutations happen with different frequency with normal INSERT being the most common and ALTER the most infrequent.
-
ValidationJob: This job simply reads one or rows from both clusters and compares them. In case they differ, an error is raised and the program can either terminate or continue based on the users preference.
-
WarmupJob: This job is much like a regular MutationJob but it never issues any DELETE or ALTER operations. The purpose of this job is to allow for a proper buildup of data in the clusters. It has a separate timeout to allow for the user to determine how long it should run before the ordinary jobs run.
Gemini has three modes of operation to allow for various types of workloads.
-
READ: This mode is intended to be used with a known schema on an existing set of clusters. In this mode only validation jobs are used.
-
WRITE: This mode just applies mutations for the entire program execution.
-
MIXED: This is the most common mode and it applies both mutations and validations.
The application allows the user to decide the level of concurrency that Gemini operates at.
The toggle --concurrency
currently means that the application will create that number of
READ and WRITE jobs when running mixed mode. If running in WRITE or READ
mode it will correspond to the exact number of job executing goroutines. Each goroutine is only
working on a subset of the data (from now known as bucket) when mutating and validating to avoid
concurrent modification races when validating the system under test.
This can still happen when executing read queries that performs an index scan.
The pump has an almost trivial purpose. It's job is to generate signals for the goroutines that
are executing jobs. When the pump channel is closed the goroutines know that it is time to stop.
Each heartbeat that the pum emits also carries a time.Duration
indicating that the goroutine that
receives this heartbeat should wait a little while before executing. This feature is not currently
in use but the idea is to introduce some jitter into the execution flow.
The application generates partition ids through a Generator
that creates a steady flow of partition
key components for the desired concurrency.
Each goroutine is connected to a partition
that the generator controls. This partition continuously emits
new partition ids in the form of a []any
. These keys are created in the same way as the the
driver does to ensure that each goroutine only processes partition keys from it's designated bucket.
These partition keys These values are copied into another list that keeps the old partition ids for
later reuse. The idea of reusing the partition keys is that probability of hitting the same partition
key kan be so small that we never actually read any data at all in the validation jobs if just generate
a new random key whenever we attempt a validation. Instead we just reuse previously known inserted
partition keys so we can be sure that at one point we operated on this partition key. We may have
deleted the key but at least the resulting "empty set" makes sense in then.
NB:There are probably issues with this approach and we may want to refine this further.
There are a number of core data structures that has a more central place in Gemini's design.
-
Schema
Gemini has a top level data structure named
Schema
. This structure a loose wrapper around a keyspace and a list of tables. It furthermore contains exported methods for generating a schema and it's corresponding CQL DDL statements allowing for creating the tables in the database. It also holds the methods for creating queries of all kinds which are used in the main Gemini program. -
Table
Tables are conceptually very similar to regular CQL tables. Their base elements are partition keys, clustering keys and columns. They also may contain materialized views and indexes depending on user preferences.
-
Columns
Columns are a list of ColumnDef and it represents a set of columns such as partition keys or clustering keys.
-
ColumnDef
A ColumnDef is essentially a Type with a name and defines a column in the table.
-
Type
There are two type of types (pun intended) and they are
SimpleType
such asint
,decimal
etc There are complex types which each is a new Type such asMapType
that is composed of simple types. Each type is responsible for generating the actual data that is inserted into the database. For example: the [generator](architecture.md#Partition Keys) is delegating the actual data construction to the instantiated types of the table it is working on.