Replies: 6 comments 15 replies
-
This is a high-level summary of an overall plan to implement this functionality, because I know that the previous posts were quite long. Add the following options to Capture create/edit:
Add the following options to Materialization create/edit:
Implementation work:
Open questions:
Continuous schema inference implementation:I'm thinking this would be added after the initial implementation work outlined above.
Additional work we can follow on with
|
Beta Was this translation helpful? Give feedback.
-
This all seems great to me -- the only thing I would add is that our "ideal" state would be "all the things automatic" and as soon as we feel ready, our goal should be making that the default. |
Beta Was this translation helpful? Give feedback.
-
Broadly, I'm very on board with this design 👍. I have a tactical comment and hot takes for your open questions. Tactically: rather that introducing new
basically, rather than one mega-job table that coordinates the whole flow, coordinate through message passing of the existing jobs tables. This also can provide an answer for "how do users find out about publications gone wrong?
" The publications table could additionally represent an intention of how failure is to be handled, for example (extremely hand-wavily) add the publication ID to an
I would think so — that we would keep our current behavior. That flag isn’t a statement that they never want to add bindings, just that they don’t want it done automatically.
|
Beta Was this translation helpful? Give feedback.
-
One other pattern to consider, as an alternative to fine-grained intention columns added to The This pattern might be overkill -- I'm not specifically recommending we do this, just holding it up for discussion. |
Beta Was this translation helpful? Give feedback.
-
Continuous schema inferenceI wanted to follow on with some more thoughts about how continuous schema inference fits into this. For background, continuous schema inference applies only to what I've called "Type B" source systems. These are systems that cannot provide authoritative schemas for the data to be captured. We use the Continuous schema inference is the process by which we determine the json schema from the source data (after it has been written). This is a part of the overall process that publishes updates to collection schemas, but they are two separate processes that happen to interact. The other process is Auto Discovery. Continuous schema inference does not need to be exposed as a separate concept for users to understand. From a user's perspective, they are deciding whether or not to have Flow keep their collection schemas up-to-date. If they opt-in to auto discovery, then their collections are kept up to date with the latest schemas, regardless of whether those come from an actual discover operation or schema inference. So the design that seems (at least to me) to follow from that schema inference is driven by auto discovery. In somewhat more detail: As part of auto-discovery, we identify any collections that should have schemas infered by looking for the presence of How to do?For the schema-inference itself, it need not be continuous per se, as long as it can provide an up-to-date answer within a reasonable period of time. This requires that schema inference be done incrementally, though it doesn't necessarily require that it be done in a realtime or streaming fashion. Lazy service The lazy service approach would be to add an API to schema-inference that accepts a gazette consumer checkpoint and a starting JSON schema representing the results of the previous inference. It then reads whatever additional data is available and returns the new schema and checkpoint. The previous checkpoint and inferred schema would live in an It's worth noting one annoying detail about this: our rust client doesn't yet support committed reads, so we'd need to either implement that or else use go for reading the collection data (maybe by shelling out to Nevertheless, this is my preferred approach at this point because it seems the least likely to suddenly explode in scope. But I'll discuss the alternative because I suspect others may also have thought of it. The ops catalog The high-level idea would be to have a special derivation and/or materialization that does the inference and materializes the inferred schemas into a postgres table. If that sounds vague, it's because it's really not clear to me precicely how we should seek to do this. Doing it in a derivation seems the most intuitive, except the derivation would have to be stateful and able to run Rust code, which we cannot do today. Then there's also the issue of needing to add or remove bindings whenever a collection with Re-creating collections considered harmfulImagine we have a capture from cloud storage into a collection An alternative would be to keep the collection itself, and to only update the materialization bindings to point to a new table. In other words, if you'd started with a binding for But there are still cases where users would still want to re-create collections having A remaining question is just how to expose this distinction to users. For example, one option would be to introduce a new Another option would be to overload (and perhaps rename) the SummarySupposing we go with the lazy service, the summary of this would be:
This framing seems like it would work out fine in complicated scenarios involving multiple captures writing into overlapping sets of collections. The auto-discovery options for each capture will apply to only the collections bound to that capture. |
Beta Was this translation helpful? Give feedback.
-
I've merged the PR for the initial implementation of automatic dicsocvers, so this is a good time to check in about what's changed and what's remaining to be done. What's changed: We've changed our heuristics around when to re-create collections vs just re-creating materialization bindings. We now only re-create materialization bindings in response to incompatible schema changes. We don't re-create collections unless the key or partitions have changed. What's remaining:
I'll send out another update with some more thoughts on continuous schema evolution and responding to validation failures. |
Beta Was this translation helpful? Give feedback.
-
The desire is to end up with some sort of mode where changes to a source system are propagated through to a destination. This post serves as an initial design document, which we can discuss and iterate on as we work toward consensus on the design. Note that I use the term "tables" here, but really this all applies to any
resource
of a capture connector.What changes are we talking about here, and what would it mean to propagate them?
recommended
fields.The high level idea is to:
While I'm reasonably confident that continuous schema inference can dovetail nicely with what's described here, this document has gotten really long. So I've removed the bits about continuous schema inference for now, and will send a subsequent post where we can discuss how that fits into the picture. This means that, for now, we'll be focusing more on "Type A" systems that provide authoritative schemas from discovery.
Information we need as inputs to these processes
Auto-Discover capture
Do we want to automatically and periodically re-run discovers and add/update/delete capture bindings/collections as needed? This has 3 possible states:
In theory, there would be a 4th option where we add new bindings only without updating existing ones. In reality, this would be kind of a pain to implement and it's not clear that anyone actually wants it.
Automatically re-create collections?
There's an additional dimension to "Auto-Discover", which is whether to automatically re-create collections when they're deemed to have "incompatible" changes (i.e. there's a publish failure with
incompatible_collections
). There's really only two reasonable options. One is to stop and try to notify the user of the issue. The other is to automatically re-create the collection and have both the capture and any materializations start over for those bindings.One thing that muddies the waters is that we might try to publish an incompatible schema as part of either an automated Discover or an automated schema inference. But the enablement of automated re-discovers is part of the Capture, whereas enablement of automated schema inference is attached to the individual collections.
If we were to only consider the case of automated discovers and ignore schema inference, then it would seem pretty clear that this property ought to live right alongside "Auto-Discover", attached to the Capture. For now, I propose that we do exactly that, and just attach this to the rest of the "Auto-Discover" configuration. In a subsequent post, I'll outline how I think continuous schema inference can fit in with the rest of this, but the TLDR is that I think it can rely on the same configuration for this.
Unwanted capture bindings
If a user wants us to automatically add newley discovered bindings, we need to know about any bindings that they've explicitly removed from their capture. Otherwise, we'll end up adding them again on the first automatic re-discovery.
Automatically add new fields?
This basically indicates the desired behavior of the materialization when the collection schema is changed. Regardless of whether or how we present this option to a user, we already have a nice home for it on the backend: field selections. The field selections in a materialization spec already have a
recommended
boolean that indicates whether to materialize new fields automatically. So the question really is just how we want to expose this when users are creating/editing the materialization in the UI. My proposal here is to consider this a part of the field selection editor, and not part of the schema evolution design. Field selection applies to all materializations, not just those that are linked to a capture.Automated schema inference? (
x-infer-schema
)Indicates that schema inference should be run. Currently, this is only done manually via the UI, but in the future the goal is to run it automatically and continuously.
This is already functionally a property of Collection specs, since it's part of the json schema. It's not expected to be user-configurable (with the possible exception of )
Source Capture
This would be attached to materializations, and it identifies a particular capture that the materialization is "linked" to. If present, it indicates the desire to automatically add new materialization bindings as they are added to the capture. Because this is attached to the materialization instead of the capture, it allows us to have multiple materializations that are all "linked" to the same capture.
Representation
(PLEASE feel free to suggest better names for this stuff.)
I think for all of these fields, the most sensible thing is to add them as additional fields on their respective specs. For example, you could write the following flow.yaml:
This might strike some as a little weird, given that automated discovery is a control-plane concern rather than a data-plane one. But putting it on the spec allows it to be set from either the UI or the CLI. And given that this really is an attribute of a capture, it seems hard to justify doing more work to put it in some separate location. Also, the collection spec containing the
x-infer-schema
attribute already sets a precident for inclusion of these things in the spec itself.For unwanted bindings, I think my current favorite is to allow capture bindings to directly represent being disabled by setting the
target
tonull
.For materializations, you could write:
which would have the materialization automatically "follow" the
autoDiscover
operations of the capture.We don't really have a need to represent materialization bindings being disabled right now, but I also don't think it'd hurt to make a similar change there where we intead use a null
resource
to indicate that it's disabled. In the future, we could add the ability for materializations to exclude certain collections when linking it to a source capture. Definitely not needed for now, though.Automated re-discovery
When you create or edit a capture, you should be presented with the option to opt in to automated discover operations. These would correspond to edits on the spec itself, so should fit nicely alongside the existing
endpoint
andbindings
editors in the UI. If you opt in, then we'll periodically run a re-discover operation and publish any changes from it.How does that periodic rediscovery actually work?
There would be a new
rediscovers
async job, styled after the existing discover, publish, etc jobs. Eachrediscovers
job would apply to a single capture. There's a separate scheduling process that createsrediscovers
jobs on a periodic basis.Here's a proposed table schema to make this more concrete:
The
update_only
column would have exactly the same semantics as it does in the existingdiscovers
table. It would be populated based on theautoDiscover.addNewBindings
configuration.The scope of a
rediscovers
job would be:incompatible_collections
and "Automatically re-create collections" is enabled, then update the draft using anevolutions
job and try publishing againNote that the check to see if the discover output has changed isn't technically required for correctness because there's no harm in re-publishing something that hasn't changed. But the status of the
rediscovers
job should clearly indicate what actions were taken, and doing this check as part of arediscovers
handler seems like the easiest way to make that happen. I want to emphasize the importance of leaving a legible audit trail from these operations. It's likely that only a small minority ofrediscovers
will actually result in spec changes, and it will be important that we can easily identify those operations.You might be wondering just how we compose these various other async jobs into the larger
rediscovers
job. My current thinking on this is that we do so by invoking thediscover
,publish
, andevolution
handlers as rust function calls. I'm not entirely certain whether that will play nicely with the publish handler's row locking and rollbacks, so this theory may not survive contact with reality.Scheduling
rediscovers
Scheduling of
rediscovers
jobs would be handled by the pg_cron postgres extension. The idea would be to periodically run a SQL function that insertsrediscovers
rows based on querying for Captures that have "Auto-Discover" enabled.At first, we can probably get away with just queuing all rediscovers at the same time. For example, we could run this every
n
hours and each run would queue up a rediscover for every capture that has "Auto-Discover" enabled. Note that each agent instance will still just process onerediscovers
at a time, and they will interleave them with other queued jobs, so with a little luck users shouldn't notice any "lag" in agent responsiveness.AuthZ considerations
This feature is the first time we're introducing an automated process that updates specs without a user in the loop. Our existing
role_grants
were designed to support this sort of thing, and I don't see any problems there. We might want to introduce an authZ check that the materialization specs must haveread
capability to the prefix of any capture named bysourceCapture
. For example, say we have a capturealiceCo/myCapture/source-postgres
and a materializationbobCo/alice-data
that hassourceCapture: aliceCo/myCapture/source-postgres
. The authorization check would ensure thatbobCo/alice-data
hasread
capability toaliceCo/myCapture
. This would ensure that the materialization is read authorized to both current and future collections that may be discovered by the capture.Of course technically there's not really a problem with omitting that check, since the publish handler will already be performing authorization checks. But the nature of these operations means that publications may not be attempted until quite some time after a user creates a materialization. It would make for a far better UX if we could validate the permissions while the user is still editing their materialization.
Another aspect of this is that we don't currently perform any consistency checks when
role_grant
s are removed. So you could have a grant when you first create the materialization, which is subsequently revoked, while your materialization is left in place until the next publication that touches it (which may not be your publication) fails due to a permission error. This seems worth addressing at some point, but I think for the immediate term we can safely ignore these concerns, since cross-tenant sharing isn't something any of our existing users do at this point.Finally, end-users will not be permitted to create
rediscovers
jobs at this point. This restriction probably isn't necessary from a security perspective, but it would allow us to at least temporarily side-step the need for discussing exactly who is permitted to createrediscovers
. For now, we can just tell RLS to disallow them all.Additional questions on disabling bindings
Regardless of the exact representation, we still need some way to actually represent bindings that users don't want. Otherwise the binding could be automatically re-added on the next automatic
rediscovers
operation. The most obvious approach would be to update the UI to always null out thetarget
instead of removing the binding altogether.We'll also need to update our existing (manual) discovery handler to work with disabled bindings. But it's not clear exactly what the behavior should be in all scenarios. Some scenarios are:
autoDiscover.addNewBindings: false
, then adds new tables to their source. They then click the manual "Discover" in the UI. Should we show them bindings thatautoDiscover
has ignored?autoDiscover
options?There's certainly other aspects to the UI impact here, and I'll need some help from Travis and/or Kiahna to help identify those.
Note: What to do when source tables are deleted
If we're running periodic discovery, we may notice that certain tables have been deleted from the source system. I see no reason why we shouldn't automatically remove those bindings from the capture, and I imagine that will align best with user expectations. It also seems reasonable to remove the affected collections from materialization bindings, but there's a couple of reasons why I think we should hold off on that for now.
One reason is that it's possible there are still other captures that are writing into the collection, and we'd need to detect that scenario and leave the binding in place if that's the case. This would not only increase the scope of work, but it could potentially be more confusing to users. The other reason is that we can't really know whether the materializations have fully caught up yet. My (rough) sense is that users would expect their materialized data to include all the data from the source collection. That's not something we can guarantee.
So I think we should hold off on deleting any materialization bindings for the time being. We can always figure that out later if we want.
Beta Was this translation helpful? Give feedback.
All reactions