Capture-specific projections #186

psFried · 2021-08-06T13:51:40Z

psFried
Aug 6, 2021
Maintainer

I've been working more with the citibike data and noticed that the column names of the CSVs change over the years. This seems to me like it might be a pretty common thing to happen, and so I wanted to share what I did in order to get it working, and suggest an idea that might make it easier to deal with this kind of thing in the future.

Here's the headers from one of the CSV files from 2020 (202004-citibike-tripdata.csv.zip):

"tripduration","starttime","stoptime","start station id","start station name","start station latitude","start station longitude","end station id","end station name","end station latitude","end station longitude","bikeid","usertype","birth year","gender"

And here's the headers from a 2021 file (202102-citibike-tripdata.csv.zip):

ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual

Let's put ourselves in the shoes of a hypothetical Flow user in 2020, before the format changed. They might have started out with a catalog like this:

collections:
  acmeCo/tripdata:
    schema:
      properties:
        _meta:
          properties:
            file:
              type: string
            offset:
              minimum: 0
              type: integer
          required:
            - file
            - offset
          type: object
        bike_id: {type: string}
        start:
          type: object
          properties:
            station_id: {type: string}
            station_name: {type: string}
            lat: {type: number}
            lon: {type: number}
            time: {type: string, format: date-time}
        end:
          type: object
          properties:
            station_id: {type: string}
            station_name: {type: string}
            lat: {type: number}
            lon: {type: number}
            time: {type: string, format: date-time}
      required: [_meta, bike_id, start, end]
      type: object
    key: [/_meta/file, /_meta/offset]
    projections:
      # projections for CSV parsing
      start station id: /start/station_id
      start station name: /start/station_name
      start station latitude: /start/lat
      start station longitude: /start/lon
      starttime: /start/lon

      # projections for database column names
      start_station_id: /start/station_id
      start_station_name: /start/station_name
      # ... and so on

captures:
  acmeCo/source-s3:
    endpoint:
      airbyteSource:
        image: ghcr.io/estuary/source-s3:c13a754
        config:
          bucket: "tripdata"
          matchKeys: "2020\\d\\d-citibike-tripdata.csv.zip"
          region: "us-east-1"
    bindings:
      - resource:
          stream: tripdata/
          syncMode: incremental
        target: acmeCo/tripdata

materializations:
  acmeCo/toSomeDatabase:
    bindings:
      - source: acmeCo/tripdata
        resource:
          table: rides
        fields:
          recommended: false
          include:
            start_station_id: {}
            start_station_name: {}
            # ... and so on

The CSV starts out with one set of column names, but they want something without whitespace for their database columns, so the collection has two sets of projections. This is already a little less than ideal because the presence of the projections for CSV parsing requires them to use recommended: false and to enumerate all of the desired projections to materialize.

When they learn about the change to the column names starting in 2021, they then need to add a third set of projections to the collection so that the 2021 CSV files can be parsed into JSON matching the schema, along with a new capture for the 2021 data (let's pretend that the 2020 data might still get corrected or just arrive late). So they add to their catalog:

# collections.acmeCo/tripdata/projections:
      started_at: /start/time
      start_lat: /start/lat
      start_lng: /start/lon
      ended_at: /end/time
      end_lat: /end/lat
      end_lng: /end/lon

captures:
  acmeCo/2021-source-s3:
    endpoint:
      airbyteSource:
        image: ghcr.io/estuary/source-s3:c13a754
        config:
          bucket: "tripdata"
          matchKeys: "2021\\d\\d-citibike-tripdata.csv.zip"
          region: "us-east-1"
    bindings:
      - resource:
          stream: tripdata/
          syncMode: incremental
        target: acmeCo/tripdata
  acmeCo/source-s3:
    # ...

This actually works pretty well, but I think we can do better. I think it's less than ideal to put all of the projections under the collection in this case, because they're truly details about how the data is stored in an external system. A better factoring would put the relevant CSV projections pretty close to the regex that's filtering the filenames. So here's how I think that might look:

collections:
  acmeCo/tripdata:
    # schema: ...
    projections:
      # These projections define how we _want_ to convert to/from tabular representations
      start_station_id: /start/station_id
      start_station_name: /start/station_name
      # ... and so on

captures:
  acmeCo/2021-source-s3:
    endpoint:
      airbyteSource:
        image: ghcr.io/estuary/source-s3:c13a754
        config:
          bucket: "tripdata"
          matchKeys: "2021\\d\\d-citibike-tripdata.csv.zip"
          region: "us-east-1"
    bindings:
      - resource:
          stream: tripdata/
          syncMode: incremental
        target: acmeCo/tripdata
        fields:
          started_at: /start/time
          start_lat: /start/lat
          start_lng: /start/lon
          # ... and so on

  # projections attached to the captures are used to deal with ugly data that's beyond our control
  acmeCo/source-s3:
    endpoint:
      airbyteSource:
        image: ghcr.io/estuary/source-s3:c13a754
        config:
          bucket: "tripdata"
          matchKeys: "2020\\d\\d-citibike-tripdata.csv.zip"
          region: "us-east-1"
    bindings:
      - resource:
          stream: tripdata/
          syncMode: incremental
        target: acmeCo/tripdata
        fields:
          start station id: /start/station_id
          start station name: /start/station_name
          start station latitude: /start/lat
          start station longitude: /start/lon
          # ... and so on

What that catalog would mean more concretely is that the fields in each binding would be merged into the projections from the collection, and the result is what would be passed as part of the ConfiguredStream. This makes the captures bindings more symmetrical with materialization bindings. And it also means that the materialization get's less verbose, because it's able to just use the recommended projections. Since the csv-related projections aren't a part of the collection anymore, they'll no longer be considered "recommended", and you'll no longer need to list all the projections explicitly in each materialization.

Now, I'm a little hesitant to say that this one example is necessarily representative or broadly applicable to common use cases. That's open to debate. But it does feel like a better factoring for this case.

Possible Alternatives

Using the parser config directly

One alternative implementation might be to make use of the parser.projections within the endpoint config for filesource connectors. That approach would only work for filesource connectors, though, which is a bummer since you might also want database connectors to make use of these projections. So it's not a very generalizable solution. Of course, existing 3rd party connectors won't make use of projections, anyway, but that's at least a more tractable problem, since projections are at least theoretically part of the spec. I'll note that using parser.projections almost would work today. We do allow parser options to be set on filesource connectors, but the projections property currently gets overwritten by the projections from the ConfiguredStream, rather than merged.

Making `recommended` a property of collection projections

Another alternative would be to scrap the whole idea of projections being a part of captures, and just make it more ergonomic to work with projections on the collection by adding a property to projections to explicitly mark them as either "recommended" or not. This might look something like:

# collections.acmeCo/tripdata.projections:
      start station id:
        location: /start/station_id
        recommended: false
      # ...

This would improve the situation for materializations, since they would then be able to just use the default fields. But it still couples the collection spec to a concern that seems like an implementation detail of how the data is stored.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Capture-specific projections #186

{{title}}

Replies: 0 comments

Select a reply

Capture-specific projections #186

psFried Aug 6, 2021 Maintainer

Possible Alternatives

Using the parser config directly

Making recommended a property of collection projections

Replies: 0 comments

psFried
Aug 6, 2021
Maintainer

Making `recommended` a property of collection projections