Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow customizing how SQL rows are unpacked #1187

Merged
merged 4 commits into from
Jan 28, 2025
Merged

Allow customizing how SQL rows are unpacked #1187

merged 4 commits into from
Jan 28, 2025

Conversation

1st1
Copy link
Member

@1st1 1st1 commented Jan 27, 2025

No description provided.

@1st1 1st1 force-pushed the sqlobj branch 2 times, most recently from 4f0324e to 9917bfc Compare January 28, 2025 00:04
This commit enables a `client.withCodecs({ sql_row: {...} })` call.
Caveat: `toDatabase()` must be specified even though it's not currently
possible to send rows as query arguments. Here's an example:

  res = await client
    .withCodecs({
      sql_row: {
        fromDatabase(data, desc) {
          return Object.fromEntries(
            desc.names.map((key, index) => [key, data[index]]),
          );
        },
        toDatabase() {
          throw "cannot encode SQL record as a query argument";
        },
      },
    })
    .querySQL( ... );

Here we decode SQL rows into JS objects {colName: colValue}. Note
that toDatabase() is specified, albeit, it's a no op and is there
just to satisfy the type checker.

Most likely we'll end up adding a more high level API for this, e.g.

  client.withSQLRowsAsObjects().querySQL(...)

or

  client.withCodecs(gel.SQLRowsAsObjects).querySQL(...)

Note that either of these APIs would allow to specialize unpacking
SQL rows into objects inside the actual codec, for better performance.

In any case, the underlying machinery for those future new APIs
will likely be the one that this commit implements.
packages/driver/src/codecs/context.ts Show resolved Hide resolved
Comment on lines 64 to 67
const overload = ctx.getContainerOverload("sql_row");
if (overload != null) {
return overload.fromDatabase(result, { names: this.names });
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion in the meeting was to intercept building the result and have the fromDatabase build the data structure directly. So pass it an iterable that returns an "entries-like" [key, value] where the value has already been decoded from the subcodec. That means the fromDatabase looks more like:

const objRecord = client.withCodec({
  sql_row: {
    fromDatabase: (elements: Iterable<[string, any]>): Record<string, any> => {
      return Object.fromEntries(elements);
    },
  },
});

I think what that looks like is that if you don't specify a container overload, we use this same iterator ourselves to fill the result: any[] array here like:

const results: any[] = Array.from(elements, (e) => e[1]);

Mostly this is a memory/space thing, but I can imagine people requesting pretty large rowsets where you don't want a bunch of allocations. We can probably do a more memory optimized version of the implementation you already have below though. Without a bunch of actual benchmarking, I'm not sure I can make my case much better than "vibes" here. 😂

Copy link
Collaborator

@scotttrinh scotttrinh Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another idea would be to return a data structure that has .entries, .values, .keys methods that returned different iterables to avoid the extra allocations for the key name and tuple if they're not needed and make the Array case more like Array.from(elements.values()) instead of needing to define a mapping function.

Copy link
Member Author

@1st1 1st1 Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion in the meeting was to intercept building the result and have the fromDatabase build the data structure directly. So pass it an iterable that returns an "entries-like" [key, value] where the value has already been decoded from the subcodec.

and

Mostly this is a memory/space thing, but I can imagine people requesting pretty large rowsets where you don't want a bunch of allocations.

Sadly this won't work. In order to build a classic synchronous iterator/generator to pass it to fromDatabase() we'd need to first parse the entire incoming data into an array and then return an iterator over it.

Another idea would be to return a data structure that has .entries, .values, .keys methods that returned different iterables to avoid the extra allocations for the key name and tuple if they're not needed and make the Array case more like Array.from(elements.values()) instead of needing to define a mapping function.

This doesn't change much - the data will still be all allocated for this data structure before fromDatabase is even called.


If we want to progressively push values into fromDatabase() as we parse them, then the only way is to use an async iterator. Sadly, the overhead of that will be so large that we'll just kill performance (compare array[i] = el on a pre-sized array vs. a chain of callbacks and function calls).


If we really want to make it possible for users to have super fast decoding for rows then we have to have a special codec API for collections.

fromDatabase(values, names) needs to be replaced with a pair of makeFromDatabaseContainer(size) and setFromDatabaseContainerIndex(index, name, value).

This way we can efficiently pre-allocate container (if need be) and progressively populate it with values. This is the fastest way possible.

But this is very complex, so let's ask ourselves do we actually need any of this?


Let's step back and analyze what's going on here (doing this also for the sake of finally documenting the decision tree somewhere).

In an ideal world we'd just have ONE data type for querySQL. JavaScript is dynamic enough to let us build a hybrid type, where a field can be accessed either by an index or by its name.

We could do that via extending the array type (creating types dynamically per data descriptor and caching them) or by creating objects with a custom prototype with nice name getters.

Either approach would work, and we even implemented this before for named tuples! Everything worked super fast and was quite ergonomic. But we ended up removing all of that, opting for using pure JS objects or arrays.

Why? Because next-js (and a few other frameworks) for some reasons enforce runtime validation on user data before allowing it to be serialized to JSON. I thought this could be a thing of the past (see the last comment in this thread, but I decided to check the latest nextjs source and found this and that. So nextjs still enforces serialization checks somewhere, and given that nextjs isn't the only offender we probably should just not attempt doing anything smart ever again.


Back to square one: what do users want? They want to access values in a row by index and/or name. With JavaScript we don't have a way to allow both at the same time. So it's one or another and the user has to choose ahead of time.

This is what pglite is offering - rowMode: "object" | "array".

Can we do this? Yes, that's what the current PR almost is. I'll push an update to it now to streamline it for the user further and make it close to pglite.

I still think it's useful to keep the API this PR introduces. Why? Because with the current API you can just wrap the passed values array and implement getters/setters on top of it (no need to re-allocate or copy anything). But I do now this that this would be a super low level API that few people would ever use.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to do a little bit of experimenting and benchmarking to validate our conflicting vibes here to ensure we don't end up performing poorly versus other drivers when returning SQL results. This may end up "breaking" the interface, but since this isn't yet released, and the actual public interface doesn't involve setting up the codec manually, I think that's fine.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright! Did some benchmarking and I was not able to give a nicer interface for consumers that didn't measurably impact the performance of the default case, so I'm going to withdraw my suggestion here. The closest I got was giving the consumer an entries-like array ([key, value][]) which I feel like is a more natural API than (values, { names }), but to get the same performance involves just as much duplication of logic, and if the downstream consumer wants to ignore the names, they suffer a small penalty for doing so.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, and I also realized I was wrong suggesting that a generator-based API can't be implemented. It sure can be. But it's still going to be much slower than what we have now, so let's not bother.

Comment on lines 2621 to 2624
// toDatabase is required to be specified (limitation
// of TypeScript type system). This isn't super elegant
// and most likely we'll need a higher-level nicer API
// in top of this.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you already tried just leaving off the toDatabase in the type def, and ran into some errors?

Copy link
Member Author

@1st1 1st1 Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Believe it or not, I spent a considerable amount of time debugging typing here, ending up quite frustrated that I can't solve this puzzle.

Out of complete desperation I asked chatgpt and here's the sacred knowledge it shared with me.
CleanShot 2025-01-27 at 21 11 44@2x

This stems from this:

CleanShot 2025-01-27 at 21 13 35@2x

See the highlighted part. CodecSpec is used to type client.withCodecs. I literally tried everything, so I don't think I can solve this. You're welcome to try.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, with the new commit just pushed we don't care about this (for now).

// and most likely we'll need a higher-level nicer API
// in top of this.
//
// Maybe `client.withCodecs(gel.SQLRowAsObject)`?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, absolutely, 100%. Could even short-cut this with .withSQLObjectCodec(). I almost think that this should be the default and have the array of tuples be the opt-in option.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, let's do this. I'll push some updates in a few.

Use it to switch the return type of client.querySQL() from
JS array (indexable by position) to JS object (indexable by
field name).

"Hide" `client.withCodecs({sql_row: ...})` by renaming
"sql_row" to "_private_sql_row".

If there's ever a use-case to make it possible for users
to overload how SQL rows are decoded beyond choosing between
'array' and 'object' we can rename it back to `sql_row`.

But I do want to keep the door open to us potentially implementing
a different codec API for "collection" types -- an API that would
have better performance but look more low level. See [1] for
details.

[1] #1187 (comment)
@1st1
Copy link
Member Author

1st1 commented Jan 28, 2025

@scotttrinh @jaclarke alright, I've pushed another commit to "hide" withCodecs for setting up "SQL row" type and added a new high-level API withSQLRowMode(mode: 'array' | 'object'): Client.

If either of you decide to merge, please don't "squash" but instead "rebase". I'd like both commits of this PR to survive in the history.

@1st1
Copy link
Member Author

1st1 commented Jan 28, 2025

Alright, turns out that both node-postgres and pglite by default return arrays of objects, not arrays of arrays. And so should we.

I've just pushed another commit to flip the default.

client.querySQL() -> array<record<string, any>>
client.withSQLRowMode('array').querySQL() -> array<array<any>>

@1st1 1st1 requested a review from scotttrinh January 28, 2025 06:46
@scotttrinh scotttrinh merged commit 62a2199 into master Jan 28, 2025
9 checks passed
scotttrinh pushed a commit that referenced this pull request Jan 28, 2025
Use it to switch the return type of client.querySQL() from
JS array (indexable by position) to JS object (indexable by
field name).

"Hide" `client.withCodecs({sql_row: ...})` by renaming
"sql_row" to "_private_sql_row".

If there's ever a use-case to make it possible for users
to overload how SQL rows are decoded beyond choosing between
'array' and 'object' we can rename it back to `sql_row`.

But I do want to keep the door open to us potentially implementing
a different codec API for "collection" types -- an API that would
have better performance but look more low level. See [1] for
details.

[1] #1187 (comment)
@scotttrinh scotttrinh deleted the sqlobj branch January 28, 2025 15:59
Comment on lines +283 to +284
// and it's set to "object mode", the we want the codec mapping to be
// empty instead. Why? Empty codec mapping short circuits a lot of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// and it's set to "object mode", the we want the codec mapping to be
// empty instead. Why? Empty codec mapping short circuits a lot of
// and it's set to "object mode", we want the codec mapping to be
// empty instead. Why? Empty codec mapping short circuits a lot of

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was 4 minutes too late 😅

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ha! PRs welcome 😂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants