A tool to compare two tables of data. Currently the tool supported tables from Postgres and CSV files.
The primary use case is to compare the output of a query executed at different points in time. For example, in a batch ETL process that runs every night, you can compare the previous batch with the new batch.
Download a release or install from source:
go get -u github.com/chop-dbhi/diff-table/cmd/diff-table
The minimum requirement is to specify two tables (CSV or relational) and specify the key columns. For example, using the provided example data in the repository, the command would look like this.
diff-table \
-csv1 example/file1.csv \
-csv2 example/file2.csv \
-key id
The default output is a JSON encoded object with a summary of the differences, including column and row changes.
{
"total_rows": 4,
"columns_added": [
"city"
],
"columns_dropped": [],
"type_changes": {},
"rows_added": 1,
"rows_deleted": 1,
"rows_changed": 1
}
Adding the -diff
option in the command will result in row-level changes in the output.
diff-table \
-csv1 example/file1.csv \
-csv2 example/file2.csv \
-key id \
-diff
{
"total_rows": 4,
"columns_added": [
"city"
],
"columns_dropped": [],
"type_changes": {},
"rows_added": 1,
"rows_deleted": 1,
"rows_changed": 2,
"row_diffs": [
{
"key": {
"id": "1"
},
"changes": {
"city": {
"old": null,
"new": "Trenton"
},
"color": {
"old": "Blue",
"new": "Teal"
}
}
},
{
"key": {
"id": "3"
},
"changes": {
"city": {
"old": null,
"new": "Philadelphia"
}
}
}
],
"new_rows": [
{
"city": "Allentown",
"color": "Black",
"gender": "Male",
"id": "4",
"name": "Neal"
}
],
"deleted_rows": [
{
"id": "2"
}
]
}
The above type of output is convenient for summary usage, however in some use cases a set of events may be more useful. Using the -events
option will result in the changes being streamed as they are discovered rather than aggregating everything up into a single output object.
diff-table \
-csv1 example/file1.csv \
-csv2 example/file2.csv \
-key id \
-events
The output is a newline-delimited set of JSON-encoded events. Column-based changes will always come first. The type
field denotes the type of event which can be switched on during consumption. The base event structured looks as follows:
{
// Event name/type.
"type": "row-added",
// Unix epoch timestamp in seconds.
"time": 1520114848,
// Offset of the row in the batch, if applicable. This can also be used as an
// ordering mechanism.
"offset": 1,
// Key of the row this event pertains to, if applicable.
"key": {...},
// Full snapshot of the row. This will be present in row-added events, but
// can be optionally included for row-changed and row-removed events using
// the -data option.
"data": {...},
// Set of changes found relative to the old row. Each key is the column
// name and the value is an object with `old` and `new` keys with the
// respective values.
"changes": {...},
// If a column -added, -changed, or -removed event, this is the name of the
// column affected.
"column": "city"
// If a column-changed event, this is the old type.
"old_type": "int32",
// If a column-changed event, this is the new type.
"new_type": "int64"
}
{
"type": "column-added",
"column": "city"
}
{
"type": "row-changed",
"offset": 1,
"key": {
"id": "1"
},
"changes": {
"city": {
"old": null,
"new": "Trenton"
},
"color": {
"old": "Blue",
"new": "Teal"
}
}
}
{
"type": "row-removed",
"offset": 2,
"key": {
"id": "2"
}
}
{
"type": "row-changed",
"offset": 3,
"key": {
"id": "3"
},
"changes": {
"city": {
"old": null,
"new": "Philadelphia"
}
}
}
{
"type": "row-added",
"offset": 4,
"key": {
"id": "4"
},
"data": {
"city": "Allentown",
"color": "Black",
"gender": "Male",
"id": "4",
"name": "Neal"
}
}
In addition to change events, snapshots are supported which scans a table and emits a row-stored
event including the current state of each row. These events are structurally equivalent to row-added
events which include the full data
.
The use case for them is to initialize a stream from a stateful starting point, but can be used over time to make new starting points for consumers. This prevents consumers from needing to know what the starting state was when the stream was initiated as well as not needing to read from the very beginning of the stream in order to build up the current state. With snapshots, the consumer can start from the last known snapshot and consume from there.
To create a snapshot, use the -snapshot
option with the newest table or CSV file specified.
diff-table \
-csv1 example/file1.csv \
-key id \
-snapshot
Below are examples of how tables can be specified including SQL-based tables and CSV files (with sorted or unsorted rows) and how columns can be renamed (not in the data source, just in the diff-table
runtime) before the tables are compared.
diff-table \
-db postgres://localhost:5432/postgres \
-table1 data_v1 \
-table2 data_v2 \
-key id
diff-table \
-db postgres://localhost:5432/postgres \
-db2 postgres://localhost:5435/other \
-table1 data_v1 \
-table2 data_v2 \
-key id
Note: this assumes the CSV files are pre-sorted by the specified key columns.
diff-table \
-csv1 data_v1.csv \
-csv2 data_v2.csv \
-key id
diff-table \
-csv1 data_v1.csv \
-csv1.sort \
-csv2 data_v2.csv \
-csv2.sort \
-key id
Note: this assumes the CSV file is pre-sorted by the specified key columns.
diff-table \
-csv1 data_v1.csv \
-db2 postgres://localhost:5432/postgres \
-table2 data_v2 \
-key id
The data_v1.foo
column will be renamed to bar
(just in the diff-table
runtime, not in the source) and compared to the data_v2.bar
column. The same will happen for the baz:buz
column rename.
diff-table \
-db postgres://localhost:5432/postgres \
-table1 data_v1 \
-rename1 "foo:bar,baz:buz" \
-table2 data_v2 \
-key id
In addition to tables, arbitrary SQL statements are supported as well. The basic requirement is that the columns specified in -key
must existing the statement and the data must be ordered by the key columns.
diff-table \
-db postgres://localhost:5432/postgres \
-kye id \
-sql1 "select id, col1, col2 from table1 order by id" \
-sql1 "select id, col1, col2 from table2 order by id"