Automatic propagation of source tables #1042

psFried · 2023-05-17T13:56:23Z

psFried
May 17, 2023
Maintainer

The desire is to end up with some sort of mode where changes to a source system are propagated through to a destination. This post serves as an initial design document, which we can discuss and iterate on as we work toward consensus on the design. Note that I use the term "tables" here, but really this all applies to any resource of a capture connector.

What changes are we talking about here, and what would it mean to propagate them?

Tables (or whatever) are added to the source
- Add the new tables as bindings of the capture
- Add the new tables as bindings of an existing materialization
Existing tables are deleted from the source(?)
- Remove the bindings from the capture
- See note about removing materialization bindings below
Existing tables have had their schema changed
- If materializations accept the change, then just go with it.
  - Added columns may be added to the materialization automatically if the field selection allows recommended fields.
- If materializations reject the change (unsatisfiable constraint in the validation response) then we may need to re-create the collection and update the bindings

The high level idea is to:

Allow users to opt-in to periodic background "re-discover" operations, which can automatically publish new and/or changed capture bindings.
Introduce the concept of "linked" captures and materializations, which models the desire to keep a materialization in sync with changes to a given source system.

While I'm reasonably confident that continuous schema inference can dovetail nicely with what's described here, this document has gotten really long. So I've removed the bits about continuous schema inference for now, and will send a subsequent post where we can discuss how that fits into the picture. This means that, for now, we'll be focusing more on "Type A" systems that provide authoritative schemas from discovery.

Information we need as inputs to these processes

Auto-Discover capture

Do we want to automatically and periodically re-run discovers and add/update/delete capture bindings/collections as needed? This has 3 possible states:

disabled: don't automatically update my capture or its collections
update existing bindings only: periodically re-run discover operations, but only use them to update bindings that are already a part of the capture.
update existing bindings and add new bindings: periodically re-run discover operations, and use them to both update existing collection schemas, as well as to add any newly discovered bindings.

In theory, there would be a 4th option where we add new bindings only without updating existing ones. In reality, this would be kind of a pain to implement and it's not clear that anyone actually wants it.

Automatically re-create collections?

There's an additional dimension to "Auto-Discover", which is whether to automatically re-create collections when they're deemed to have "incompatible" changes (i.e. there's a publish failure with incompatible_collections). There's really only two reasonable options. One is to stop and try to notify the user of the issue. The other is to automatically re-create the collection and have both the capture and any materializations start over for those bindings.

One thing that muddies the waters is that we might try to publish an incompatible schema as part of either an automated Discover or an automated schema inference. But the enablement of automated re-discovers is part of the Capture, whereas enablement of automated schema inference is attached to the individual collections.

If we were to only consider the case of automated discovers and ignore schema inference, then it would seem pretty clear that this property ought to live right alongside "Auto-Discover", attached to the Capture. For now, I propose that we do exactly that, and just attach this to the rest of the "Auto-Discover" configuration. In a subsequent post, I'll outline how I think continuous schema inference can fit in with the rest of this, but the TLDR is that I think it can rely on the same configuration for this.

Unwanted capture bindings

If a user wants us to automatically add newley discovered bindings, we need to know about any bindings that they've explicitly removed from their capture. Otherwise, we'll end up adding them again on the first automatic re-discovery.

Automatically add new fields?

This basically indicates the desired behavior of the materialization when the collection schema is changed. Regardless of whether or how we present this option to a user, we already have a nice home for it on the backend: field selections. The field selections in a materialization spec already have a recommended boolean that indicates whether to materialize new fields automatically. So the question really is just how we want to expose this when users are creating/editing the materialization in the UI. My proposal here is to consider this a part of the field selection editor, and not part of the schema evolution design. Field selection applies to all materializations, not just those that are linked to a capture.

Automated schema inference? (x-infer-schema)

Indicates that schema inference should be run. Currently, this is only done manually via the UI, but in the future the goal is to run it automatically and continuously.

This is already functionally a property of Collection specs, since it's part of the json schema. It's not expected to be user-configurable (with the possible exception of )

Source Capture

This would be attached to materializations, and it identifies a particular capture that the materialization is "linked" to. If present, it indicates the desire to automatically add new materialization bindings as they are added to the capture. Because this is attached to the materialization instead of the capture, it allows us to have multiple materializations that are all "linked" to the same capture.

Representation

(PLEASE feel free to suggest better names for this stuff.)

I think for all of these fields, the most sensible thing is to add them as additional fields on their respective specs. For example, you could write the following flow.yaml:

captures:
  acmeCo/foo:
    autoDiscover:
      addNewBindings: true
      reCreateIncompatibleCollections: true
    # ...

This might strike some as a little weird, given that automated discovery is a control-plane concern rather than a data-plane one. But putting it on the spec allows it to be set from either the UI or the CLI. And given that this really is an attribute of a capture, it seems hard to justify doing more work to put it in some separate location. Also, the collection spec containing the x-infer-schema attribute already sets a precident for inclusion of these things in the spec itself.

For unwanted bindings, I think my current favorite is to allow capture bindings to directly represent being disabled by setting the target to null.

    bindings:
      - resource: { namespace: my_schema, stream: my_table }
        target: null
    # ...

For materializations, you could write:

materializations:
  acmeCo/bar:
    sourceCapture: 'acmeCo/foo'
    # ...

which would have the materialization automatically "follow" the autoDiscover operations of the capture.

We don't really have a need to represent materialization bindings being disabled right now, but I also don't think it'd hurt to make a similar change there where we intead use a null resource to indicate that it's disabled. In the future, we could add the ability for materializations to exclude certain collections when linking it to a source capture. Definitely not needed for now, though.

Automated re-discovery

When you create or edit a capture, you should be presented with the option to opt in to automated discover operations. These would correspond to edits on the spec itself, so should fit nicely alongside the existing endpoint and bindings editors in the UI. If you opt in, then we'll periodically run a re-discover operation and publish any changes from it.

How does that periodic rediscovery actually work?
There would be a new rediscovers async job, styled after the existing discover, publish, etc jobs. Each rediscovers job would apply to a single capture. There's a separate scheduling process that creates rediscovers jobs on a periodic basis.

Here's a proposed table schema to make this more concrete:

create table rediscovers (
  like internal._model_async including all,

  capture_name catalog_name not null,
  update_only  boolean not null
);

The update_only column would have exactly the same semantics as it does in the existing discovers table. It would be populated based on the autoDiscover.addNewBindings configuration.
The scope of a rediscovers job would be:

Create a draft
Perform a discover operation on that draft
Check if anything has actually changed, and return success early if not.
If there are any new bindings and "Automatically add new tables" is enabled, add the bindings to all linked materializations
Publish the draft
Publish Success? All done then.
If the publication fails with incompatible_collections and "Automatically re-create collections" is enabled, then update the draft using an evolutions job and try publishing again
If "Automatically re-create collections" is disabled, or if the second publish fails, then bail out and tell an adult

Note that the check to see if the discover output has changed isn't technically required for correctness because there's no harm in re-publishing something that hasn't changed. But the status of the rediscovers job should clearly indicate what actions were taken, and doing this check as part of a rediscovers handler seems like the easiest way to make that happen. I want to emphasize the importance of leaving a legible audit trail from these operations. It's likely that only a small minority of rediscovers will actually result in spec changes, and it will be important that we can easily identify those operations.

You might be wondering just how we compose these various other async jobs into the larger rediscovers job. My current thinking on this is that we do so by invoking the discover, publish, and evolution handlers as rust function calls. I'm not entirely certain whether that will play nicely with the publish handler's row locking and rollbacks, so this theory may not survive contact with reality.

Scheduling `rediscovers`

Scheduling of rediscovers jobs would be handled by the pg_cron postgres extension. The idea would be to periodically run a SQL function that inserts rediscovers rows based on querying for Captures that have "Auto-Discover" enabled.

At first, we can probably get away with just queuing all rediscovers at the same time. For example, we could run this every n hours and each run would queue up a rediscover for every capture that has "Auto-Discover" enabled. Note that each agent instance will still just process one rediscovers at a time, and they will interleave them with other queued jobs, so with a little luck users shouldn't notice any "lag" in agent responsiveness.

AuthZ considerations

This feature is the first time we're introducing an automated process that updates specs without a user in the loop. Our existing role_grants were designed to support this sort of thing, and I don't see any problems there. We might want to introduce an authZ check that the materialization specs must have read capability to the prefix of any capture named by sourceCapture. For example, say we have a capture aliceCo/myCapture/source-postgres and a materialization bobCo/alice-data that has sourceCapture: aliceCo/myCapture/source-postgres. The authorization check would ensure that bobCo/alice-data has read capability to aliceCo/myCapture. This would ensure that the materialization is read authorized to both current and future collections that may be discovered by the capture.

Of course technically there's not really a problem with omitting that check, since the publish handler will already be performing authorization checks. But the nature of these operations means that publications may not be attempted until quite some time after a user creates a materialization. It would make for a far better UX if we could validate the permissions while the user is still editing their materialization.

Another aspect of this is that we don't currently perform any consistency checks when role_grants are removed. So you could have a grant when you first create the materialization, which is subsequently revoked, while your materialization is left in place until the next publication that touches it (which may not be your publication) fails due to a permission error. This seems worth addressing at some point, but I think for the immediate term we can safely ignore these concerns, since cross-tenant sharing isn't something any of our existing users do at this point.

Finally, end-users will not be permitted to create rediscovers jobs at this point. This restriction probably isn't necessary from a security perspective, but it would allow us to at least temporarily side-step the need for discussing exactly who is permitted to create rediscovers. For now, we can just tell RLS to disallow them all.

Additional questions on disabling bindings

Regardless of the exact representation, we still need some way to actually represent bindings that users don't want. Otherwise the binding could be automatically re-added on the next automatic rediscovers operation. The most obvious approach would be to update the UI to always null out the target instead of removing the binding altogether.

We'll also need to update our existing (manual) discovery handler to work with disabled bindings. But it's not clear exactly what the behavior should be in all scenarios. Some scenarios are:

Do we need to update our UI allow adding a previously "disabled" binding?
User creates a capture with autoDiscover.addNewBindings: false, then adds new tables to their source. They then click the manual "Discover" in the UI. Should we show them bindings that autoDiscover has ignored?
User creates a capture and removes a subset of discovered bindings. Do we want our manual "Discover" to still (always) return the bindings that they'd removed?
Should the UI always just disable bindings instead of removing them? Or should this be conditional based on the autoDiscover options?

There's certainly other aspects to the UI impact here, and I'll need some help from Travis and/or Kiahna to help identify those.

Note: What to do when source tables are deleted

If we're running periodic discovery, we may notice that certain tables have been deleted from the source system. I see no reason why we shouldn't automatically remove those bindings from the capture, and I imagine that will align best with user expectations. It also seems reasonable to remove the affected collections from materialization bindings, but there's a couple of reasons why I think we should hold off on that for now.

One reason is that it's possible there are still other captures that are writing into the collection, and we'd need to detect that scenario and leave the binding in place if that's the case. This would not only increase the scope of work, but it could potentially be more confusing to users. The other reason is that we can't really know whether the materializations have fully caught up yet. My (rough) sense is that users would expect their materialized data to include all the data from the source collection. That's not something we can guarantee.

So I think we should hold off on deleting any materialization bindings for the time being. We can always figure that out later if we want.

psFried · 2023-05-23T17:24:08Z

psFried
May 23, 2023
Maintainer Author

This is a high-level summary of an overall plan to implement this functionality, because I know that the previous posts were quite long.

Add the following options to Capture create/edit:

Automatically discover schema changes: if true, then we'll periodically run a background process that updates their collection schemas
Automatically discover new bindings: if true, then we'll also add newly discovered bindings to the capture
Re-create collections on incompatible changes: if a publication is rejected due to constraint violations, then re-publish it with a new name and update all "linked" materializations (see below for what linked means)

Add the following options to Materialization create/edit:

Linked capture: Automatically add new tables that are discovered from this capture.
As part of implementing field selection in the UI, add a toggle for recommended fields. This controls whether newly added fields will be materialized automatically.

Implementation work:

Add a disabled flag to bindings so we can represent bindings that the user has deselected
- update UI to set disabled: true instead of removing the binding
add auto discovery fields to capture specs
add linked capture field to materialization specs
Add a rediscovers job that acts on a single capture to update the captured collections and also update any linked materializations as appropriate
Add a cron job to create rediscovers jobs on a periodic basis

Open questions:

If a user opts out of automatically re-creating collections, then we need to somehow notify them when automatic discovery fails due to incompatible schema changes. Ideally, this would also be shown in the UI, as the capture is pretty much "broken" in this scenario.

Continuous schema inference implementation:

I'm thinking this would be added after the initial implementation work outlined above.

Add ability for schema inference service to be incremental. The api would accept a previously inferred schema and a checkpoint, and would return new copies of both
Update rediscovers job to fetch and publish inferred schemas
Trigger a rediscovers job when there's a schema validation error
- One way of doing this would be to add a materialization of schema validation errors and a database trigger that queues a rediscovers job if one doesn't exist already.

Additional work we can follow on with

Update the publications handler to reject (and recommend re-creating) collections based (only) on schema changes that seem obviously incompatible. For example, fields changing to non-overlapping types.
Notify users when a materialization or derivation reads from a collection that was re-created and isn't updated automatically

0 replies

dyaffe · 2023-05-23T17:57:50Z

dyaffe
May 23, 2023
Maintainer

This all seems great to me -- the only thing I would add is that our "ideal" state would be "all the things automatic" and as soon as we feel ready, our goal should be making that the default.

0 replies

jgraettinger · 2023-05-24T15:55:49Z

jgraettinger
May 24, 2023
Maintainer

Broadly, I'm very on board with this design 👍. I have a tactical comment and hot takes for your open questions.

Tactically: rather that introducing new rediscovers table, it might be far easier to update existing jobs tables with additional intentions of actions to take on success / failure. Specifically:  

rather than a separate rediscovers table, consider augmenting the current discovers table to flag that we should queue an immediate publication after discovery is done, and if there’s a change. We can record additional metadata in the job_status about what action was taken, if any, and why.  
the publications table can also be augmented with an intention that says “queue an evolution if I fail with incompatible collections which are allowed to evolve”.

basically, rather than one mega-job table that coordinates the whole flow, coordinate through message passing of the existing jobs tables.

This also can provide an answer for "how do users find out about publications gone wrong? " The publications table could additionally represent an intention of how failure is to be handled, for example (extremely hand-wavily) add the publication ID to an alerts table that the user is presented with on login. 

• User creates a capture with autoDiscover.addNewBindings: false, then adds new tables to their source. They then click the manual "Discover" in the UI. Should we show them bindings that autoDiscover has ignored?

 I would think so — that we would keep our current behavior. That flag isn’t a statement that they never want to add bindings, just that they don’t want it done automatically.  
 
 > • User creates a capture and removes a subset of discovered bindings. Do we want our manual "Discover" to still (always) return the bindings that they'd removed?
 
 I would think automated and manual discovery would behave the same: if there’s an explicit target: null binding, then don’t suggest that it be updated. A user / UI would first need to remove the target: null binding altogether and then refresh in order for it to be suggested as an added binding.
 
   > Should the UI always just disable bindings instead of removing them?
   
   I think “yes”. I cannot think of a reason otherwise.   
   
   I see in your condensed post that you reference disabled: true for bindings rather than target: null. Which are you thinking? An oddity with disabled: true is I’m unsure of what the value of target is in this case, and the validation logic is a bit more awkward, so am generally supportive of target: null and resource: null rather than a new field.

1 reply

psFried May 24, 2023
Maintainer Author

rather that introducing new rediscovers table, it might be far easier to update existing jobs tables with additional intentions of actions to take on success / failure

I responded more thoroughly in another comment below, but I love this idea.

   I see in your condensed post that you reference disabled: true for bindings rather than target: null. Which are you thinking?

Whoops, disabled: true was a relic from a previous iteration. I'm in favor of target|resource: null
for these.

The rest of your answers here all seem reasonable to me.

jgraettinger · 2023-05-24T16:13:57Z

jgraettinger
May 24, 2023
Maintainer

One other pattern to consider, as an alternative to fine-grained intention columns added to discovers / publications, would be to have a workflows table that's not a job table, but is added as a null-able foreign reference to discovers and publications.

The workflows table could have arbitrary, high-level descriptions of multi-step workflows the control plane should automate (akin to a state machine). Individual job handlers retrieve it, find their step / state within the workflow, and attempt to execute the next transition.

This pattern might be overkill -- I'm not specifically recommending we do this, just holding it up for discussion.

3 replies

psFried May 24, 2023
Maintainer Author

High level, I'm interested in this topic because I have a sense that we ought to have some easy way to make our existing jobs more composable instead of always having to add new ones. I don't know whether this is overkill or not, but my gut says it would at least be a bit premature. One consideration for me is that it's still unclear where in these processes we might want to stop and ask for user input, if at all.

I really like your other idea of adding some sort of job input that determines what the next step in the process should be, and I'm still thinking though how that might work in practice. Just thinking out loud here, one consideration there is that there's important distinctions between plain discovers and re-discovers. The main one being that re-discovers need to respect the autoDiscover options, and regular discovers should not. But this certainly seems like it could be represented by a simple boolean input, so I don't see any problem there.

I think maybe a good next step would be for me to just try working it into discovers and see how it plays out. Seem reasonable to you?

jgraettinger May 24, 2023
Maintainer

I think maybe a good next step would be for me to just try working it into discovers and see how it plays out. Seem reasonable to you?

Yep, sounds good. Trying out SQL changes would be informative

One other callout: we'd also like to tighten the schemas coming out of SQL captures with additionalProperties: false, so that we can guarantee that we're reflecting added columns in materializations prior to actually capturing any of those columns. Thoughts on how that fits with the design? Seems we could additionally create a discovers job upon a schema failure (in addition to pg_cron)?

psFried May 30, 2023
Maintainer Author

Yeah, I think that makes sense to always create a discovers job whenever there's a validation error. I think that fits in well with the rest of this. I don't see any issues there.

psFried · 2023-05-24T16:20:46Z

psFried
May 24, 2023
Maintainer Author

Continuous schema inference

I wanted to follow on with some more thoughts about how continuous schema inference fits into this.

For background, continuous schema inference applies only to what I've called "Type B" source systems. These are systems that cannot provide authoritative schemas for the data to be captured. We use the x-infer-schema annotation to indicate that the collection schema ought to be inferred from the actual source data. Put another way, a schema bearing the x-infer-schema attribute is typically only _de_scriptive of the data in the original source.

Continuous schema inference is the process by which we determine the json schema from the source data (after it has been written). This is a part of the overall process that publishes updates to collection schemas, but they are two separate processes that happen to interact. The other process is Auto Discovery.

Continuous schema inference does not need to be exposed as a separate concept for users to understand. From a user's perspective, they are deciding whether or not to have Flow keep their collection schemas up-to-date. If they opt-in to auto discovery, then their collections are kept up to date with the latest schemas, regardless of whether those come from an actual discover operation or schema inference.

So the design that seems (at least to me) to follow from that schema inference is driven by auto discovery. In somewhat more detail:

As part of auto-discovery, we identify any collections that should have schemas infered by looking for the presence of x-infer-schema in the read schema of the live spec. This detail is important if we want to allow users to opt-out of schema inference by removing that annotation. For each such collection, we call the schema inference service in the data-plane to fetch the latest schema. We then publish the results.

How to do?

For the schema-inference itself, it need not be continuous per se, as long as it can provide an up-to-date answer within a reasonable period of time. This requires that schema inference be done incrementally, though it doesn't necessarily require that it be done in a realtime or streaming fashion.

Lazy service

The lazy service approach would be to add an API to schema-inference that accepts a gazette consumer checkpoint and a starting JSON schema representing the results of the previous inference. It then reads whatever additional data is available and returns the new schema and checkpoint. The previous checkpoint and inferred schema would live in an inferred_schemas table, which would be updated by the auto discovery handler. This schema inference service would be entirely stateless.

It's worth noting one annoying detail about this: our rust client doesn't yet support committed reads, so we'd need to either implement that or else use go for reading the collection data (maybe by shelling out to flowctl-go?).

Nevertheless, this is my preferred approach at this point because it seems the least likely to suddenly explode in scope. But I'll discuss the alternative because I suspect others may also have thought of it.

The ops catalog

The high-level idea would be to have a special derivation and/or materialization that does the inference and materializes the inferred schemas into a postgres table. If that sounds vague, it's because it's really not clear to me precicely how we should seek to do this. Doing it in a derivation seems the most intuitive, except the derivation would have to be stateful and able to run Rust code, which we cannot do today. Then there's also the issue of needing to add or remove bindings whenever a collection with x-infer-schema is added or deleted. And finally, any derivation that's reading from a huge number of collections would require additional shard-splitting optimizations, which we've discussed previously in the context of ops materializations.

Re-creating collections considered harmful

Imagine we have a capture from cloud storage into a collection acmeCo/foo, along with a database materialization of that collection. It's humming along, and continuous schema inference identifies a breaking change to the schema of the collection. Based on the previous discussions, you might expect that we'd re-create the collection, and update the materialization to point to the new one. While this behavior is probably fine in some circumstances, it can become problematic if continuous schema inference results in multiple breaking changes. Each one might result in a new collection, and the capture would need to backfill each one with data that had already been read into the previous collection. If we did this multiple times, then each successive backfill would be larger and larger, and the whole process could be quite inneficient and costly.

An alternative would be to keep the collection itself, and to only update the materialization bindings to point to a new table. In other words, if you'd started with a binding for acmeCo/foo => table foo, after the breaking schema change were published, you'd get acmeCo/foo => table foo_v2. This seems like it would be best handled by updating the evolutions handler to do this by default for collections bearing the x-infer-schema annotation.

But there are still cases where users would still want to re-create collections having x-infer-schema rather than just materializing to a new table. For example, say you've been testing your cloud storage capture with test data, and now you want to clear your bucket and start over. The evolutions handler should still allow that if you specify a new_name in the input for that collection.

A remaining question is just how to expose this distinction to users. For example, one option would be to introduce a new reMaterializeOnBreakingSchemaChanges option as part of the automatic discovery options. Technically, we might find a use for that even with Type A captures.

Another option would be to overload (and perhaps rename) the reCreateIncompatibleCollections option, and have it either re-materialize to a new table or re-create the collection based on the presence of x-infer-schema.

Summary

Supposing we go with the lazy service, the summary of this would be:

update schema-inference to accept a checkpoint and schema from a prior invocation for the same collection
update the auto-discover job to call schema-inference for each collection having the x-infer-schema annotation
respect the behavior of reCreateIncompatibleCollections and/or reMaterializeOnBreakingSchemaChanges when the new infered schema is rejected by materializations.

This framing seems like it would work out fine in complicated scenarios involving multiple captures writing into overlapping sets of collections. The auto-discovery options for each capture will apply to only the collections bound to that capture.

11 replies

psFried Jul 11, 2023
Maintainer Author

From the proposal, it sounds like each capture / materialization is required to keep state that tracks the inferred schema over its entire history.

That's not quite what I was thinking. Here's what I was trying to convey: The (effective) readSchema is itself that "state". So when the capture or derivation (I'm assuming that's what you'd meant) starts up, it uses the collection's effective read schema as the initial value of inferredSchema. Then each time a document fails validation against it, it's expanded and we log the expanded version. After materializing those, we'd end up with a row in inferred_schemas for each capture shard, and the discovers handler would have to reduce them all into one.

That said, I actually like the reduce annotation better, since it would allow us to have a single inferred schema that clients can just fetch and use without needing to perform further reductions. And now that I think of it, this plan really doesn't work without that, unless we did introduce some additional state tracking the current inferred schema.

I still think we want to start with the effective read schema as the starting point, though. The reason is that it allows us to only emit updates when documents actually violate the read schema. That means that we can trigger a discovers job whenever any related inferred_schemas rows are updated for any bound collections of a capture.

Putting it all together for the sake of clarity:

Captures and derivations would use the readSchema (or schema) as the initial value of inferredSchema
Each time a document fails validation against the inferredSchema, it's expanded and the new inferredSchema is logged
The ops/inferred-schemas collection would use the collection name as the key, and we'd introduce a custom reduce annotation that does a merge of the schemas
The ops/inferred-schemas collection would be materialized to the inferred_schemas table
The discovers handler would pull in the latest inferred schemas for all bound collections that use schema inference

One other thought is that we could potentially skip the ops/logs middle-man, and have the runtime output directly to the ops/inferred-schemas collection. There's a benefit (academically, at least) to using a transactional append for these, anyway, since it would guarantee that we only expand the inferred schema in response to committed documents. IMO the risk of inferring schemas based on uncommitted documents is pretty low, so it'd be fine to still use ops/logs. 🤷

jgraettinger Jul 11, 2023
Maintainer

Captures and derivations would use the readSchema (or schema) as the initial value of inferredSchema

Each time a document fails validation against the inferredSchema, it's expanded and the new inferredSchema is logged

This is a key detail to resolve.

I was proposing mapping every document into a merged-in doc::Shape without doing schema validation.
You're proposing validating the document against schema/readSchema and expanding the doc::Shape only if it fails.

At the limit, if additionalProperties: false, these do the same thing. However, we (and users!) are very unlikely to set this universally for all schemas, and without it it means that documents may have novel properties which aren't reflected in inferred_schemas. That cuts against the overarching utility I see in inferred_schemas: a comprehensive description of what's actually in your collection, regardless of what your schema(s) may be. This is also hand-wavy, but it strikes me as smelly that one can affect the content of inferred_schemas by manipulating the schema/readSchema of a collection -- why should these depend on one another?

But, what to do then about the requirement that schema violations are propagated to trigger discovery? Perhaps the ops/inferred-schemas derivation is also looking for schema violation logs, and marks up its materialized table accordingly?

psFried Jul 11, 2023
Maintainer Author

That makes sense, and I definitely like how just always inferring a Shape avoids the weird circular dependency of having the inferred schema be based on the readSchema. So I think I'm convinced that your approach is the way to go.

But, what to do then about the requirement that schema violations are propagated to trigger discovery?

I don't think that's actually the requirement. We'd originally wanted to trigger discovers in response to a schema validation error just for lack of a better way of knowing that the actual shape of the data has changed. This was why we'd originally discussed adding additionalProperties: false everywhere, since without it our trigger wouldn't work.

But if we instead just always emit an inferred Shape whenever it changes, then it seems preferable to try to use the changing of that shape itself as the trigger. Otherwise, we'd still be relying on the presence of additionalProperties: false. Perhaps we could add a hash column to inferred_schemas, similar to what we do with live_specs? Another thought we be to add some special behavior to flow-inferred-schema-reduction-kludge that updates some special x-schema-updated-at field with a timestamp value whenever the reduction actually modifies the LHS? Something else?

jgraettinger Jul 11, 2023
Maintainer

But if we instead just always emit an inferred Shape whenever it changes, then it seems preferable to try to use the changing of that shape itself as the trigger.

True! What about two additional columns on the inferred_schemas table? One is a generated column which is the MD5 of the schema json. The other is the MD5 of the last auto-discovery run. Or we could add this last-inferred-schema hash to live_specs? Either way.

One pedantic note: additionalProperties: false still has a role, which is to ensure that the very first row that has a new field is held-back until the schema change is propagated to the target table. But agreed, if we have this then we don't need to further trigger it on explicit read schema violations.

psFried Jul 12, 2023
Maintainer Author

👍 Yeah, I like the hash idea.

How should schema inference deal with dynamic objects? It's a surprisingly tricky question!

We've seen a number of examples of objects having an unbounded number of fields. Things like custom_field_12345, custom_field_12346, etc. This poses a problem for schema inference because the inferred schema is liable to grow so large as to break something.

We can limit the size of schemas by:

Defining a limit on the maximum number of explicitly named properties in any single object shape.
If an object shape has more than that number of properties, replace all of them with an additionalProperties shape that's derived from merging the individual property shapes.

Say we do that, and someone starts a materialization while the schema still has few enough observed fields that they're all explicitly named. At this point, the materialization may use a field selection for all those explicitly named fields. So then while the materialization is running the inference observes enough fields to replace them all with an additionalProperties. The publication for this build fails because the materialization requires the field selection to include fields which are no longer explicitly named by the schema. An evolutions job then updates the materialization binding to materialize into a new table that no longer has all those projections.

I think that's probably the desired behavior, but LMK if you think otherwise, because we can work around it if we're willing to accept certain tradeoffs.

One other nuance is that we might want to consider slightly different behavior for top-level document properties. Top-level properties seem more likely to include a mix of dynamic and static properties. Also, database tables with 500+ columns are surprisingly common, and users would probably expect that they get individually materialized. One thing we could do for top-level properties is to just have a higher limit on the number of properties. Or we could special-case the top-level properties by retaining the individual named properties and only putting the fields that are over the threshold into additionalProperties.

Another thought is that we might be able to retain some information from the writeSchema by either:

Using the writeSchema to create the starting inferred Shape (but modified to default additionalProperties: false)
Composing the writeSchema with the inferred schema to use as the readSchema

Either one of those would give us a few benefits:

Any fields that are explicitly included in the writeSchema would be guaranteed to exist in the readSchema, and thus be available for materialization. This could be particularly important if key locations exist within dynamic objects.
Connectors could explicitly annotation dynamic objects with something like x-dynamic-object, which schema inference could leverage in order to skip enumerating individual field names for a given location. This would avoid a materialization that starts out materializing individual fields and then changes to a dynamic object

Personally, I'm leaning toward just composing the writeSchema with the inferred schema for use as the readSchema. Automatic discovers would generate the readSchema as:

type: object
allOf:
  - { <writeSchema> }
  - { <inferredSchema> }

This is nice because it perfectly preserves any reduce annotations and other information from the writeSchema, which guarantees that reductions performed on read will behave identically to reductions performed on write. It also ensures that key fields will still have explicit projections, even if they're part of an object with dynamic properties where the inferred schema would just use additionalProperties.

We could (though I'm less conviced we should) also use the writeSchema as the initial starting point for inferred schemas. The downside of this is that it makes inferred schemas less accurate. The upside is that the writeSchema could carry annotations that identify locations that are known to be dynamic objects. Those annotations could be used by our inferrence code to avoid creating individual fields in those locations until we cross the threshold of the max number of properties. Another, less important, upshot is that starting out with the writeSchema could make the inferred schema data less noisy, particularly for database captures, since we'd be starting out with a more comprehensive schema. Ultimately, I'm leaning toward not doing this right now, since we can always add it later if we need. But I wanted to mention it in case the aforementioned scenario (where we start materializing specific fields and then switch to a dynamic object) seems motivating to others.

psFried · 2023-07-05T20:34:41Z

psFried
Jul 5, 2023
Maintainer Author

I've merged the PR for the initial implementation of automatic dicsocvers, so this is a good time to check in about what's changed and what's remaining to be done.

What's changed:

We've changed our heuristics around when to re-create collections vs just re-creating materialization bindings. We now only re-create materialization bindings in response to incompatible schema changes. We don't re-create collections unless the key or partitions have changed.

What's remaining:

Publish docs on schema evolution (in progress)
Handle sourceCapture property of Materializations, which allows newly discovered collections to be materialized automatically (also in progress).
Allow editing autoDiscover properties in the UI (Allow editing Capture autoDiscover properties ui#676)
Triggering automatic discoveries in response to schema validation errors
Continuous schema inference, which is very closely related to the first point
Surfacing errors that happen during automatic discovers and subsequent publications and evolutions.
Adding support for evolutions in flowctl

I'll send out another update with some more thoughts on continuous schema evolution and responding to validation failures.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatic propagation of source tables #1042

{{title}}

Replies: 6 comments 15 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Automatic propagation of source tables #1042

psFried May 17, 2023 Maintainer

Information we need as inputs to these processes

Representation

Automated re-discovery

Scheduling rediscovers

AuthZ considerations

Additional questions on disabling bindings

Note: What to do when source tables are deleted

Replies: 6 comments · 15 replies

psFried May 23, 2023 Maintainer Author

Add the following options to Capture create/edit:

Add the following options to Materialization create/edit:

Implementation work:

Open questions:

Continuous schema inference implementation:

Additional work we can follow on with

dyaffe May 23, 2023 Maintainer

jgraettinger May 24, 2023 Maintainer

psFried May 24, 2023 Maintainer Author

jgraettinger May 24, 2023 Maintainer

psFried May 24, 2023 Maintainer Author

jgraettinger May 24, 2023 Maintainer

psFried May 30, 2023 Maintainer Author

psFried May 24, 2023 Maintainer Author

Continuous schema inference

How to do?

Re-creating collections considered harmful

Summary

psFried Jul 11, 2023 Maintainer Author

jgraettinger Jul 11, 2023 Maintainer

psFried Jul 11, 2023 Maintainer Author

jgraettinger Jul 11, 2023 Maintainer

psFried Jul 12, 2023 Maintainer Author

psFried Jul 5, 2023 Maintainer Author

psFried
May 17, 2023
Maintainer

Scheduling `rediscovers`

Replies: 6 comments 15 replies

psFried
May 23, 2023
Maintainer Author

dyaffe
May 23, 2023
Maintainer

jgraettinger
May 24, 2023
Maintainer

psFried May 24, 2023
Maintainer Author

jgraettinger
May 24, 2023
Maintainer

psFried May 24, 2023
Maintainer Author

jgraettinger May 24, 2023
Maintainer

psFried May 30, 2023
Maintainer Author

psFried
May 24, 2023
Maintainer Author

psFried Jul 11, 2023
Maintainer Author

jgraettinger Jul 11, 2023
Maintainer

psFried Jul 11, 2023
Maintainer Author

jgraettinger Jul 11, 2023
Maintainer

psFried Jul 12, 2023
Maintainer Author

psFried
Jul 5, 2023
Maintainer Author