You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've been working more with the citibike data and noticed that the column names of the CSVs change over the years. This seems to me like it might be a pretty common thing to happen, and so I wanted to share what I did in order to get it working, and suggest an idea that might make it easier to deal with this kind of thing in the future.
Here's the headers from one of the CSV files from 2020 (202004-citibike-tripdata.csv.zip):
"tripduration","starttime","stoptime","start station id","start station name","start station latitude","start station longitude","end station id","end station name","end station latitude","end station longitude","bikeid","usertype","birth year","gender"
And here's the headers from a 2021 file (202102-citibike-tripdata.csv.zip):
The CSV starts out with one set of column names, but they want something without whitespace for their database columns, so the collection has two sets of projections. This is already a little less than ideal because the presence of the projections for CSV parsing requires them to use recommended: false and to enumerate all of the desired projections to materialize.
When they learn about the change to the column names starting in 2021, they then need to add a third set of projections to the collection so that the 2021 CSV files can be parsed into JSON matching the schema, along with a new capture for the 2021 data (let's pretend that the 2020 data might still get corrected or just arrive late). So they add to their catalog:
This actually works pretty well, but I think we can do better. I think it's less than ideal to put all of the projections under the collection in this case, because they're truly details about how the data is stored in an external system. A better factoring would put the relevant CSV projections pretty close to the regex that's filtering the filenames. So here's how I think that might look:
collections:
acmeCo/tripdata:
# schema: ...
projections:
# These projections define how we _want_ to convert to/from tabular representations
start_station_id: /start/station_id
start_station_name: /start/station_name
# ... and so on
captures:
acmeCo/2021-source-s3:
endpoint:
airbyteSource:
image: ghcr.io/estuary/source-s3:c13a754
config:
bucket: "tripdata"
matchKeys: "2021\\d\\d-citibike-tripdata.csv.zip"
region: "us-east-1"
bindings:
- resource:
stream: tripdata/
syncMode: incremental
target: acmeCo/tripdata
fields:
started_at: /start/time
start_lat: /start/lat
start_lng: /start/lon
# ... and so on
# projections attached to the captures are used to deal with ugly data that's beyond our control
acmeCo/source-s3:
endpoint:
airbyteSource:
image: ghcr.io/estuary/source-s3:c13a754
config:
bucket: "tripdata"
matchKeys: "2020\\d\\d-citibike-tripdata.csv.zip"
region: "us-east-1"
bindings:
- resource:
stream: tripdata/
syncMode: incremental
target: acmeCo/tripdata
fields:
start station id: /start/station_id
start station name: /start/station_name
start station latitude: /start/lat
start station longitude: /start/lon
# ... and so on
What that catalog would mean more concretely is that the fields in each binding would be merged into the projections from the collection, and the result is what would be passed as part of the ConfiguredStream. This makes the captures bindings more symmetrical with materialization bindings. And it also means that the materialization get's less verbose, because it's able to just use the recommended projections. Since the csv-related projections aren't a part of the collection anymore, they'll no longer be considered "recommended", and you'll no longer need to list all the projections explicitly in each materialization.
Now, I'm a little hesitant to say that this one example is necessarily representative or broadly applicable to common use cases. That's open to debate. But it does feel like a better factoring for this case.
Possible Alternatives
Using the parser config directly
One alternative implementation might be to make use of the parser.projections within the endpoint config for filesource connectors. That approach would only work for filesource connectors, though, which is a bummer since you might also want database connectors to make use of these projections. So it's not a very generalizable solution. Of course, existing 3rd party connectors won't make use of projections, anyway, but that's at least a more tractable problem, since projections are at least theoretically part of the spec. I'll note that using parser.projectionsalmost would work today. We do allow parser options to be set on filesource connectors, but the projections property currently gets overwritten by the projections from the ConfiguredStream, rather than merged.
Making recommended a property of collection projections
Another alternative would be to scrap the whole idea of projections being a part of captures, and just make it more ergonomic to work with projections on the collection by adding a property to projections to explicitly mark them as either "recommended" or not. This might look something like:
This would improve the situation for materializations, since they would then be able to just use the default fields. But it still couples the collection spec to a concern that seems like an implementation detail of how the data is stored.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I've been working more with the citibike data and noticed that the column names of the CSVs change over the years. This seems to me like it might be a pretty common thing to happen, and so I wanted to share what I did in order to get it working, and suggest an idea that might make it easier to deal with this kind of thing in the future.
Here's the headers from one of the CSV files from 2020 (
202004-citibike-tripdata.csv.zip
):And here's the headers from a 2021 file (
202102-citibike-tripdata.csv.zip
):Let's put ourselves in the shoes of a hypothetical Flow user in 2020, before the format changed. They might have started out with a catalog like this:
The CSV starts out with one set of column names, but they want something without whitespace for their database columns, so the collection has two sets of projections. This is already a little less than ideal because the presence of the projections for CSV parsing requires them to use
recommended: false
and to enumerate all of the desired projections to materialize.When they learn about the change to the column names starting in 2021, they then need to add a third set of projections to the collection so that the 2021 CSV files can be parsed into JSON matching the schema, along with a new capture for the 2021 data (let's pretend that the 2020 data might still get corrected or just arrive late). So they add to their catalog:
This actually works pretty well, but I think we can do better. I think it's less than ideal to put all of the projections under the collection in this case, because they're truly details about how the data is stored in an external system. A better factoring would put the relevant CSV projections pretty close to the regex that's filtering the filenames. So here's how I think that might look:
What that catalog would mean more concretely is that the
fields
in each binding would be merged into theprojections
from the collection, and the result is what would be passed as part of theConfiguredStream
. This makes the captures bindings more symmetrical with materialization bindings. And it also means that the materialization get's less verbose, because it's able to just use therecommended
projections. Since the csv-related projections aren't a part of the collection anymore, they'll no longer be considered "recommended", and you'll no longer need to list all the projections explicitly in each materialization.Now, I'm a little hesitant to say that this one example is necessarily representative or broadly applicable to common use cases. That's open to debate. But it does feel like a better factoring for this case.
Possible Alternatives
Using the parser config directly
One alternative implementation might be to make use of the
parser.projections
within the endpoint config for filesource connectors. That approach would only work for filesource connectors, though, which is a bummer since you might also want database connectors to make use of these projections. So it's not a very generalizable solution. Of course, existing 3rd party connectors won't make use of projections, anyway, but that's at least a more tractable problem, since projections are at least theoretically part of the spec. I'll note that usingparser.projections
almost would work today. We do allow parser options to be set on filesource connectors, but theprojections
property currently gets overwritten by the projections from theConfiguredStream
, rather than merged.Making
recommended
a property of collection projectionsAnother alternative would be to scrap the whole idea of projections being a part of captures, and just make it more ergonomic to work with projections on the collection by adding a property to projections to explicitly mark them as either "recommended" or not. This might look something like:
This would improve the situation for materializations, since they would then be able to just use the default fields. But it still couples the collection spec to a concern that seems like an implementation detail of how the data is stored.
Beta Was this translation helpful? Give feedback.
All reactions