-
-
Notifications
You must be signed in to change notification settings - Fork 688
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend DocumentDeserialize with a stateful variant DocumentDeserializeSeed #2362
base: main
Are you sure you want to change the base?
Conversation
I had a read through the discussion in #2359 but I am not entirely sure how this solves the issue of fields being deserialized in a non-guaranteed order without still needing to go to some intermediate value first (i.e. BTreeMap or similar) do you have a small example to demo the utility? |
Consider a struct like struct Foobar {
foo: String,
bar: f64,
} which we want to deserialize field by field by implementing impl DocumentDeserialize for Foobar {
fn deserialize<'de, D>(mut deserializer: D) -> Result<Self, DeserializeError>
where
D: DocumentDeserializer<'de>,
{
let (_, foo) = deserializer
.next_field()?
.ok_or_else(|| DeserializeError::custom("missing foo"))?;
let (_, bar) = deserializer
.next_field()?
.ok_or_else(|| DeserializeError::custom("missing bar"))?;
Ok(Foobar { foo, bar })
}
} relies on the unspecified order of the deserialized fields. However, if we have the runtime field ID available from the schema, stored for example as struct FoobarFields {
foo: Field,
bar: Field,
} we can implement impl DocumentDeserializeSeed for FoobarFields {
type Value = Foobar;
fn deserialize<'de, D>(self, mut deserializer: D) -> Result<Self::Value, DeserializeError>
where
D: DocumentDeserializer<'de>,
{
let mut foo = None;
let mut bar = None;
while let Some((field, value)) = deserializer.next_field()? {
if field == self.foo {
match value {
OwnedValue::Str(value) => foo = Some(value),
_ => return Err(DeserializeError::custom("unexpected type for foo"),
}
} else if field == self.bar {
match value {
OwnedValue::F64(value) => bar = Some(value),
_ => return Err(DeserializeError::custom("unexpected type for bar"),
}
} else {
return Err(DeserializeError::custom("unexpected field"));
}
}
let foo = foo.ok_or_else(|| DeserializeError::custom("missing foo"))?;
let bar = bar.ok_or_else(|| DeserializeError::custom("missing bar"))?;
Ok(Foobar { foo, bar })
}
} Of course, the positional guarantee would still be nicer because I could actually expect a certain type instead of matching on |
Right, that makes sense thanks, I agree a guaranteed order would be nicer, although maybe it is also worth looking into adding custom serialization formats, since that could effectively kill two birds with one stone. |
I am not sure, because if I really want the document store to take full advantage of all the byte-and-bit-packing tricks Tantivy has to offer. If I want a custom serialization format, I can already put |
Perhaps yes, although I think maybe it is application-specific where being able to get a borrowed buffer, I'm thinking of something like bitcode that does some great specialization for floating points in order to improve compression ratios and performance. |
I think borrowing byte slices conflicts with document store compression?
But you should be able to put bytes serialized using bitcode instead of bincode into the document store can you not? (We are also currently evaluating whether to replace bincode by bitcode, but our Tantivy schema does not care.) |
Absolutely yes
I am not sure, the compression, and decompressing will create a new owned buffer, but last time I went through the code it should be possible I believe to allow for borrowing of that decompressed buffer. Saying that though, I'm not sure whether that leads to any gains or losses in performance. It is probably quite niche. |
I would like to ping this as making the default document type more opinionated increases the need for full flexibility when writing custom deserializers. I think the main design question here is whether to add this as additional API or wholesale replace the deserialization traits with their stateful variants. |
…eSeed This is modelled on Serde's `DeserializeSeed` include how the relevant API entry points gain a `_seed` variant. It can be used for example to obtain runtime field ID values when deserializing a struct field by field without relying on the order of the fields as written to/read from the document store.
9e73734
to
e9c16a4
Compare
I think the question is if the added complexity justifies the potential performance gain. The de-serialization step is usually dwarfed by other parts of the search execution. Maybe you can share some numbers or your use case. |
I do not want to frame this as a performance question. Without the state, some kinds of documents just cannot be deserialized from the stored representation. One would always have to build an intermediate representation to later on extract these documents. And if the limits chosen for So from my point of view, this is mainly a question of reducing complexity in downstream code. As proposed, the API does not become any more complex for users how do not want the state and there should be no runtime overhead at all. I admit there is additional implementation complexity, but I think it was relatively obvious that this would become a Serde reimplementation when Finally, note that when Tantivy's document store becomes are primary store, search is not the only use case. For example, we have endpoints which just stream documents and therefore do nothing but deserialize and serialize documents. (Without having done a CPU profile, I suspect that decompression would dominate in that case, at least if the workload does not fit into the block cache.) |
One relevant alternative I see, at least for the case of Rust structure with statically known fields, is driving the deserialization by field ID and names. This would allow matching fields without having access to the runtime field ID but just to the also statically known names. I fear this would make the general case more complicated though and also likely requiring looking that information up during deserialization even when it isn't needed. |
If you want to reduce complexity, you can convert your document from the JSON value instead of implementing the deserialize trait. The whole document as trait stuff is about indexing performance, not ergonomics. If someone has more than u16:MAX fields I would be interested in their use case. I think it's quite unlikely (and unnecessary with JSON fields now). The limit could be removed easily. |
I really don't understand the abrasiveness. This is relatively simple change that closely follows Serde's precedent and would enable a straight-forward conversion from and to Rust structures as documents which is impossible to do otherwise. Personally, I don't think And yes, my use case does fit into |
This is modelled on Serde's
DeserializeSeed
include how the relevant API entry points gain a_seed
variant. It can be used for example to obtain runtime field ID values when deserializing a struct field by field without relying on the order of the fields as written to/read from the document store.