Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(substrait): add support for insert roundtrip in append mode #13118

Closed
wants to merge 1 commit into from

Conversation

tokoko
Copy link
Contributor

@tokoko tokoko commented Oct 25, 2024

Rationale for this change

Adds support for append-mode inserts for substrait producer and consumer.

What changes are included in this PR?

The trickiest part is to get output schema right as substrait and datafusion don't fit very well there. Datafusion returns a single field named count after an insert. substrait has no explicit OutputMode that does the same. For the sake of this PR, I assumed that OutputMode.Unspecified replicates this behavior during an insert. I'll open an issue on the substrait side for this. We might need a new OutputMode named OperationStats or something similar.

Are these changes tested?

yes

Are there any user-facing changes?

yes

@tokoko
Copy link
Contributor Author

tokoko commented Oct 27, 2024

@vbarua

Copy link
Contributor

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the sake of this PR, I assumed that OutputMode.Unspecified replicates this behaviour during an insert.

It looks like the spec indicates that if the Output Mode is not specified it behaves as NO_OUTPUT so this assumption doesn't match the spec. See https://substrait.io/relations/logical_relations/#write-operator

I'll open an issue on the substrait side for this. We might need a new OutputMode named OperationStats or something similar.

I think this is something we would be open to in the upstream if you wanted to propose it.

@@ -255,6 +257,27 @@ async fn except_rels(
Ok(rel)
}

fn from_substrait_names(names: &[String]) -> Result<TableReference> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: I think it would be good to give this more specific name like table_reference_from_substrait_names

_ => {
return not_impl_err!("Unsupported WriteOp: {:?}", write.op());
}
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor/nit: if we move this up above table_schema we can avoid doing redundant work in the event that his validation fails.

@tokoko
Copy link
Contributor Author

tokoko commented Oct 28, 2024

@vbarua thanks for the tip. What sentence are you referring to? What would be the point of having both UNSPECIFIED and NO_OUTPUT in that case? I sort of thought Unspecified meant "we don't care, plan doesn't depend on it".

I think this is something we would be open to in the upstream if you wanted to propose it.

Let me see how common this behavior is in other DBs first and I'll come up with something.

@vbarua
Copy link
Contributor

vbarua commented Oct 28, 2024

What would be the point of having both UNSPECIFIED and NO_OUTPUT in that case?

My understanding is that the use of UNSPECIFIED as the first value in the Substrait enums comes from this Protobuf developer recommendation https://protobuf.dev/programming-guides/dos-donts/#unspecified-enum

It's not intended to mean

"we don't care, plan doesn't depend on it".

though I can understand why it would seem like that. In Substrait, UNSPECIFIED effectively means unset by the producer and the spec tries to define default behaviours in those cases.

I'm interpreting the line

Common default is NO_OUTPUT where we return nothing.

from substrait.io/relations/logical_relations#write-operator as indicated that the default behaviour is to not return any output, but tbh that could/should be made clearer if that's the case.

@alamb
Copy link
Contributor

alamb commented Nov 1, 2024

Is this PR ready to merge / good to go @vbarua and @tokoko ?

@tokoko tokoko marked this pull request as draft November 1, 2024 19:18
@tokoko
Copy link
Contributor Author

tokoko commented Nov 1, 2024

@alamb no, converted to draft. @vbarua I'll dig into this, but I don't think we can have a strictly correct implementation unless we make some sort of changes in substrait first wrt the output. Do you think it better to shelve the PR until then?

@vbarua
Copy link
Contributor

vbarua commented Nov 1, 2024

I think we should add this to the upstream first, otherwise we'll end up with non-compliant behaviour in DataFusion which folks might try and use.

Copy link

github-actions bot commented Jan 1, 2025

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale PR has not had any activity for some time label Jan 1, 2025
@github-actions github-actions bot closed this Jan 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Stale PR has not had any activity for some time substrait
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants