-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow customizing how SQL rows are unpacked #1187
Conversation
4f0324e
to
9917bfc
Compare
This commit enables a `client.withCodecs({ sql_row: {...} })` call. Caveat: `toDatabase()` must be specified even though it's not currently possible to send rows as query arguments. Here's an example: res = await client .withCodecs({ sql_row: { fromDatabase(data, desc) { return Object.fromEntries( desc.names.map((key, index) => [key, data[index]]), ); }, toDatabase() { throw "cannot encode SQL record as a query argument"; }, }, }) .querySQL( ... ); Here we decode SQL rows into JS objects {colName: colValue}. Note that toDatabase() is specified, albeit, it's a no op and is there just to satisfy the type checker. Most likely we'll end up adding a more high level API for this, e.g. client.withSQLRowsAsObjects().querySQL(...) or client.withCodecs(gel.SQLRowsAsObjects).querySQL(...) Note that either of these APIs would allow to specialize unpacking SQL rows into objects inside the actual codec, for better performance. In any case, the underlying machinery for those future new APIs will likely be the one that this commit implements.
packages/driver/src/codecs/record.ts
Outdated
const overload = ctx.getContainerOverload("sql_row"); | ||
if (overload != null) { | ||
return overload.fromDatabase(result, { names: this.names }); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My suggestion in the meeting was to intercept building the result and have the fromDatabase
build the data structure directly. So pass it an iterable that returns an "entries-like" [key, value]
where the value
has already been decoded from the subcodec. That means the fromDatabase
looks more like:
const objRecord = client.withCodec({
sql_row: {
fromDatabase: (elements: Iterable<[string, any]>): Record<string, any> => {
return Object.fromEntries(elements);
},
},
});
I think what that looks like is that if you don't specify a container overload, we use this same iterator ourselves to fill the result: any[]
array here like:
const results: any[] = Array.from(elements, (e) => e[1]);
Mostly this is a memory/space thing, but I can imagine people requesting pretty large rowsets where you don't want a bunch of allocations. We can probably do a more memory optimized version of the implementation you already have below though. Without a bunch of actual benchmarking, I'm not sure I can make my case much better than "vibes" here. 😂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another idea would be to return a data structure that has .entries
, .values
, .keys
methods that returned different iterables to avoid the extra allocations for the key name and tuple if they're not needed and make the Array case more like Array.from(elements.values())
instead of needing to define a mapping function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My suggestion in the meeting was to intercept building the result and have the fromDatabase build the data structure directly. So pass it an iterable that returns an "entries-like" [key, value] where the value has already been decoded from the subcodec.
and
Mostly this is a memory/space thing, but I can imagine people requesting pretty large rowsets where you don't want a bunch of allocations.
Sadly this won't work. In order to build a classic synchronous iterator/generator to pass it to fromDatabase()
we'd need to first parse the entire incoming data into an array and then return an iterator over it.
Another idea would be to return a data structure that has .entries, .values, .keys methods that returned different iterables to avoid the extra allocations for the key name and tuple if they're not needed and make the Array case more like Array.from(elements.values()) instead of needing to define a mapping function.
This doesn't change much - the data will still be all allocated for this data structure before fromDatabase
is even called.
If we want to progressively push values into fromDatabase()
as we parse them, then the only way is to use an async iterator. Sadly, the overhead of that will be so large that we'll just kill performance (compare array[i] = el
on a pre-sized array vs. a chain of callbacks and function calls).
If we really want to make it possible for users to have super fast decoding for rows then we have to have a special codec API for collections.
fromDatabase(values, names)
needs to be replaced with a pair of makeFromDatabaseContainer(size)
and setFromDatabaseContainerIndex(index, name, value)
.
This way we can efficiently pre-allocate container (if need be) and progressively populate it with values. This is the fastest way possible.
But this is very complex, so let's ask ourselves do we actually need any of this?
Let's step back and analyze what's going on here (doing this also for the sake of finally documenting the decision tree somewhere).
In an ideal world we'd just have ONE data type for querySQL
. JavaScript is dynamic enough to let us build a hybrid type, where a field can be accessed either by an index or by its name.
We could do that via extending the array
type (creating types dynamically per data descriptor and caching them) or by creating object
s with a custom prototype with nice name getters.
Either approach would work, and we even implemented this before for named tuples! Everything worked super fast and was quite ergonomic. But we ended up removing all of that, opting for using pure JS objects or arrays.
Why? Because next-js (and a few other frameworks) for some reasons enforce runtime validation on user data before allowing it to be serialized to JSON. I thought this could be a thing of the past (see the last comment in this thread, but I decided to check the latest nextjs source and found this and that. So nextjs still enforces serialization checks somewhere, and given that nextjs isn't the only offender we probably should just not attempt doing anything smart ever again.
Back to square one: what do users want? They want to access values in a row by index and/or name. With JavaScript we don't have a way to allow both at the same time. So it's one or another and the user has to choose ahead of time.
This is what pglite is offering - rowMode: "object" | "array"
.
Can we do this? Yes, that's what the current PR almost is. I'll push an update to it now to streamline it for the user further and make it close to pglite.
I still think it's useful to keep the API this PR introduces. Why? Because with the current API you can just wrap the passed values array and implement getters/setters on top of it (no need to re-allocate or copy anything). But I do now this that this would be a super low level API that few people would ever use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to do a little bit of experimenting and benchmarking to validate our conflicting vibes here to ensure we don't end up performing poorly versus other drivers when returning SQL results. This may end up "breaking" the interface, but since this isn't yet released, and the actual public interface doesn't involve setting up the codec manually, I think that's fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright! Did some benchmarking and I was not able to give a nicer interface for consumers that didn't measurably impact the performance of the default case, so I'm going to withdraw my suggestion here. The closest I got was giving the consumer an entries-like array ([key, value][]
) which I feel like is a more natural API than (values, { names })
, but to get the same performance involves just as much duplication of logic, and if the downstream consumer wants to ignore the names, they suffer a small penalty for doing so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, and I also realized I was wrong suggesting that a generator-based API can't be implemented. It sure can be. But it's still going to be much slower than what we have now, so let's not bother.
packages/driver/test/client.test.ts
Outdated
// toDatabase is required to be specified (limitation | ||
// of TypeScript type system). This isn't super elegant | ||
// and most likely we'll need a higher-level nicer API | ||
// in top of this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume you already tried just leaving off the toDatabase
in the type def, and ran into some errors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Believe it or not, I spent a considerable amount of time debugging typing here, ending up quite frustrated that I can't solve this puzzle.
Out of complete desperation I asked chatgpt and here's the sacred knowledge it shared with me.
This stems from this:
See the highlighted part. CodecSpec
is used to type client.withCodecs
. I literally tried everything, so I don't think I can solve this. You're welcome to try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, with the new commit just pushed we don't care about this (for now).
packages/driver/test/client.test.ts
Outdated
// and most likely we'll need a higher-level nicer API | ||
// in top of this. | ||
// | ||
// Maybe `client.withCodecs(gel.SQLRowAsObject)`? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, absolutely, 100%. Could even short-cut this with .withSQLObjectCodec()
. I almost think that this should be the default and have the array of tuples be the opt-in option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, let's do this. I'll push some updates in a few.
Use it to switch the return type of client.querySQL() from JS array (indexable by position) to JS object (indexable by field name). "Hide" `client.withCodecs({sql_row: ...})` by renaming "sql_row" to "_private_sql_row". If there's ever a use-case to make it possible for users to overload how SQL rows are decoded beyond choosing between 'array' and 'object' we can rename it back to `sql_row`. But I do want to keep the door open to us potentially implementing a different codec API for "collection" types -- an API that would have better performance but look more low level. See [1] for details. [1] #1187 (comment)
@scotttrinh @jaclarke alright, I've pushed another commit to "hide" If either of you decide to merge, please don't "squash" but instead "rebase". I'd like both commits of this PR to survive in the history. |
Alright, turns out that both I've just pushed another commit to flip the default.
|
Use it to switch the return type of client.querySQL() from JS array (indexable by position) to JS object (indexable by field name). "Hide" `client.withCodecs({sql_row: ...})` by renaming "sql_row" to "_private_sql_row". If there's ever a use-case to make it possible for users to overload how SQL rows are decoded beyond choosing between 'array' and 'object' we can rename it back to `sql_row`. But I do want to keep the door open to us potentially implementing a different codec API for "collection" types -- an API that would have better performance but look more low level. See [1] for details. [1] #1187 (comment)
// and it's set to "object mode", the we want the codec mapping to be | ||
// empty instead. Why? Empty codec mapping short circuits a lot of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// and it's set to "object mode", the we want the codec mapping to be | |
// empty instead. Why? Empty codec mapping short circuits a lot of | |
// and it's set to "object mode", we want the codec mapping to be | |
// empty instead. Why? Empty codec mapping short circuits a lot of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was 4 minutes too late 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ha! PRs welcome 😂
No description provided.