Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Postgres COPY FROM/TO support #3951

Merged
merged 15 commits into from
Apr 10, 2024

Conversation

weiznich
Copy link
Member

@weiznich weiznich commented Mar 1, 2024

This commit adds support for PostgreSQL COPY commands.

For COPY FROM we expose a variant that allows users to configure the stream manually and write directly to the stream. We also support a variant that takes essentially a Vec<Insertable> (or equivalent batch insert containers) and uses the binary format to perform a streamed batch insert.

For COPY TO we expose again a variant that allows the user to configure the stream manually and read directly from the stream. We also support loading the results directly via the binary protocol into a iterator of Queryable structs.

This is not finished yet, I open the PR nevertheless to gather some feedback on the API.

Things that need to be addressed before merging:

  • Documentation
  • Write some more tests (especially compile tests)
  • Go over the naming of everything again
  • Have a look if we could use Selectable for the loading part to avoid running into nasty mismatch errors there

@weiznich weiznich requested a review from a team March 1, 2024 18:09
Copy link

@JosueMolinaMorales JosueMolinaMorales left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is some cool code. Just some questions (sorry if there are any dumb questions)

diesel/src/pg/connection/copy.rs Outdated Show resolved Hide resolved
diesel/src/pg/query_builder/copy/copy_in.rs Outdated Show resolved Hide resolved
diesel/src/pg/query_builder/copy/copy_in.rs Outdated Show resolved Hide resolved
}

#[derive(Default, Debug)]
struct CommonOptions {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason for not including HEADER, FORCE_QUOTE, ENCODING, etc?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ENCODING is currently always assumed to be the same as for the connection (so UTF-8). It might be worth to expose that at some point.
HEADER is currently only exposed for copy_in, but after reading the documentation again it seems to be useful to expose it also for copy_out without allowing MATCH
FORCE_QUOTE support requires figuring out passing down a list of columns to that function. That's not impossible, it's just something that I did not want to do for the first version.

diesel_tests/tests/copy.rs Outdated Show resolved Hide resolved
diesel/src/pg/connection/copy.rs Outdated Show resolved Hide resolved
.as_ref()
.map(|l| l.as_ptr())
.unwrap_or(std::ptr::null());
let ret = unsafe { pq_sys::PQputCopyEnd(self.internal_connection.as_ptr(), error) };
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@weiznich what would happen in the event this line does not get called as a result of the '?' on line 164 ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, addressed in dbc92f2

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good 👍


fn get<'b, I>(&'b self, idx: I) -> Option<Self::Field<'b>>
where
'a: 'b,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@weiznich does this mean that 'a must be a subtype of 'b or is it the other way around ?

Copy link
Member Author

@weiznich weiznich Mar 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yep makes sense, thanks for the link 👍

Copy link

@peasee peasee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple questions from me to help get myself up to speed 😄


impl<'conn> Write for CopyFromSink<'conn> {
fn write(&mut self, buf: &[u8]) -> std::io::Result<usize> {
self.conn.put_copy_data(buf).unwrap();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Should we .unwrap() here if we're already returning a Result<_>?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, that shouldn't use unwrap. We don't want to use unwrap anywhere other than in tests.

#[allow(unsafe_code)] // ffi code
fn drop(&mut self) {
if !self.ptr.is_null() {
unsafe { pq_sys::PQfreemem(self.ptr as *mut ffi::c_void) };
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: This is my first time seeing CFFI code before. Let me see if I understand this right:

  • We have a function in pq_sys, PQfreemem which is a binding to the libpq library and drops (something) from memory?
  • The PQfreemem is equivalent to doing a drop() in Rust, and because we're using the C library we need to implement Drop to call PQfreemem so that we can drop(impl CopyToBuffer).

I have a gap in my knowledge here though, around the purpose of self.ptr as *mut ffi::c_void? Does this equate to C like void* self.ptr, so the library knows it's receiving a pointer but doesn't know what data type it is? The library then knows the data at that pointer is what it has to clear, but doesn't care what that data is?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PQfreemem is the postgresql specific variant of the C dealloc/free function. This drop impl essentially ensures that we always free the buffer this pointer points to. The cast is required as the c function takes a void pointer as argument.
Not calling that function would potentially leak memory, which would be a bug.

Comment on lines +96 to +103
if !self.ptr.is_null() {
pq_sys::PQfreemem(self.ptr as *mut ffi::c_void);
self.ptr = std::ptr::null_mut();
}
let len =
pq_sys::PQgetCopyData(self.conn.internal_connection.as_ptr(), &mut self.ptr, 0);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Okay, I think it's starting to click more for me. We clear the pointer from any previous library call, then do another FFI call supplying our (now empty) pointer.

If I'm following correctly, does this mean the library call will mutate our underlying pointer to make it point to the results of this library call, and this function returns the length of data that exists at the pointer location?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is correct.

Ok(res)
}

pub(crate) fn copy_to<T>(&mut self, command: CopyToCommand<T>) -> QueryResult<CopyToBuffer<'_>>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: This might be a styling thing I need to get familiar with, but why do we do command: CopyToCommand<T> vs the copy_from function above which does target: S [...] where S: CopyInExpression<T>?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are functionally different: command: CopyToCommand<T> is a struct with a generic type argument. target: S [...] where S: CopyInExpression <T> is a complete generic type that implements a certain trait (CopyInExpression in that case). The struct essentially holds a set of common values that is the same for all possible values of the generic type, while the trait variant allows fundamentally different types there. I've just used what seemed to fit the needs in each location.

Copy link

@peasee peasee Apr 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep totally right, tired brain was thinking there should've been a type T/S somewhere in the trait which isn't always true 🤦

@weiznich weiznich force-pushed the feature/copy_in_out branch from dbc92f2 to a740c1a Compare April 5, 2024 06:30
@weiznich weiznich marked this pull request as ready for review April 5, 2024 06:30
@weiznich weiznich force-pushed the feature/copy_in_out branch from 32b47f1 to fd49b3f Compare April 5, 2024 07:49
Copy link

@peasee peasee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple more comments from me here. I also noticed there's quite a few .expect() in the copy_to.rs changes, among other places. I'm not sure if that's a pattern we accept here, but for those that are in a Result<_> it might be suitable to change to errors instead of panicking.

Also, please let me know if my comments are too nitpick-y. I'm testing the waters to see how deep I go with my reviews 😄

let mut copy_in = CopyFromSink::new(&mut conn.raw_connection);
let r = source.target.callback(&mut copy_in);
copy_in.finish(r.as_ref().err().map(|e| e.to_string()))?;
let next_res = conn.raw_connection.get_next_result()?.expect("exists");
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Should we give these .expect()'s a better error message? Also, curious why we wouldn't use something like .ok_or_else(|| SomeError())? here, as this function already returns a Result<_>?

Comment on lines +485 to +490
) -> Result<R, E>,
) -> Result<R, E>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Why the type change from QueryResult to a Result?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The copy_from method allows to configure a callback that is then used to pull data into the database. That callback can return custom error, that we want to propagate back to the user. This callback is called somewhere in the callback passed to with_prepared_query, therefore we need to change the return type there to allow that.

Comment on lines +166 to +167
CString::new("Error message contains a \\0 byte")
.expect("Does not contain a null byte")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Should we make this .expect() more true to the error, in that it failed to initialize our new CString error? I would've expected to see the other .expect() in .map(CString::new) if it were to be somewhere.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I usually write expect() messages as factual statement. In this case we know at compile time (by looking at the string above) that is does not contain a null byte. This cannot ever be an error.

self.target.walk_target(pass.reborrow())?;
pass.push_sql(" FROM STDIN");
self.target.options().walk_ast(pass.reborrow())?;
// todo: where?
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Should this TODO be addressed before merge?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that's for now not implemented. I do not plan to support that for now, so I've removed the todo.

type Error = E;

fn callback(&mut self, copy: &mut CopyFromSink<'_>) -> Result<(), Self::Error> {
(self.copy_callback)(copy)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Why do we wrap self.copy_callback in brackets if we already know it's a function? Couldn't we just self.copy_callback(copy)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's required by rust to call a function stored in a field. See this playground for a minimal example: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=f5a4af872196441cb75487ee05af5aca

let values = self
.0
.take()
.expect("We only call this callback once")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: .expect() in a Result<_>. Ditto for above, could we map this to an error?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's one of the cases where I personally prefer using expect to get a panic if that's ever hit as that would be clearly a bug in diesel. My reasoning there is to panic for cases that are considered to be a problem with our (== diesel's) code to get reports on that. On the other hand anything that could be triggered and handled by a user should rather be a normal error.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay that's fair enough, I'll keep that in mind for future reviews! 😁

&'b self,
mut pass: crate::query_builder::AstPass<'_, 'b, Pg>,
comma: &mut &'static str,
) -> crate::QueryResult<()> {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Why is this QueryResult<_> if we don't ? anything here?

Copy link

@peasee peasee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your hard work! 🚢

@weiznich weiznich enabled auto-merge April 10, 2024 05:01
@weiznich weiznich force-pushed the feature/copy_in_out branch from 57ba5c0 to 3f32a02 Compare April 10, 2024 05:32
weiznich added 15 commits April 10, 2024 09:41
This commit adds support for PostgreSQL `COPY` commands.

For `COPY FROM` we expose a variant that allows users to configure the
stream manually and write directly to the stream. We also support a
variant that takes essentially a `Vec<Insertable>` (or equivalent batch
insert containers) and uses the binary format to perform a streamed
batch insert.

For `COPY TO` we expose again a variant that allows the user to
configure the stream manually and read directly from the stream. We also
support loading the results directly via the binary protocol into a
iterator of `Queryable` structs.
* Renaming everything to follow `COPY FROM`/`COPY TO` nameing
* Require `Selectable` for loading (to prevent errors)
* Add instrumentation support + tests
@weiznich weiznich force-pushed the feature/copy_in_out branch from 3f32a02 to c3b6dd1 Compare April 10, 2024 09:41
@weiznich weiznich added this pull request to the merge queue Apr 10, 2024
Merged via the queue into diesel-rs:master with commit 8afe715 Apr 10, 2024
48 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants