Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend DocumentDeserialize with a stateful variant DocumentDeserializeSeed #2362

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

adamreichold
Copy link
Collaborator

This is modelled on Serde's DeserializeSeed include how the relevant API entry points gain a _seed variant. It can be used for example to obtain runtime field ID values when deserializing a struct field by field without relying on the order of the fields as written to/read from the document store.

@ChillFish8
Copy link
Collaborator

I had a read through the discussion in #2359 but I am not entirely sure how this solves the issue of fields being deserialized in a non-guaranteed order without still needing to go to some intermediate value first (i.e. BTreeMap or similar) do you have a small example to demo the utility?

@adamreichold
Copy link
Collaborator Author

Consider a struct like

struct Foobar {
   foo: String,
   bar: f64,
}

which we want to deserialize field by field by implementing DocumentDeserialize, doing that directly fails because

impl DocumentDeserialize for Foobar {
    fn deserialize<'de, D>(mut deserializer: D) -> Result<Self, DeserializeError>
    where
        D: DocumentDeserializer<'de>,
    {
        let (_, foo) = deserializer
            .next_field()?
            .ok_or_else(|| DeserializeError::custom("missing foo"))?;

        let (_, bar) = deserializer
            .next_field()?
            .ok_or_else(|| DeserializeError::custom("missing bar"))?;
            
        Ok(Foobar { foo, bar })
    }
}

relies on the unspecified order of the deserialized fields.

However, if we have the runtime field ID available from the schema, stored for example as

struct FoobarFields {
    foo: Field,
    bar: Field,
}

we can implement DocumentDeserializeSeed on this type to get a robust implementation, e.g.

impl DocumentDeserializeSeed for FoobarFields {
    type Value = Foobar;
    
    fn deserialize<'de, D>(self, mut deserializer: D) -> Result<Self::Value, DeserializeError>
    where
        D: DocumentDeserializer<'de>,
    {
        let mut foo = None;
        let mut bar = None;
        
        while let Some((field, value)) = deserializer.next_field()? {
            if field == self.foo {
                match value {
                    OwnedValue::Str(value) => foo = Some(value),
                    _ => return Err(DeserializeError::custom("unexpected type for foo"),
                }
            } else if field == self.bar {
                match value {
                    OwnedValue::F64(value) => bar = Some(value),
                    _ => return Err(DeserializeError::custom("unexpected type for bar"),
                }
            } else {
                return Err(DeserializeError::custom("unexpected field"));
            }
        }
        
        let foo = foo.ok_or_else(|| DeserializeError::custom("missing foo"))?;
        let bar = bar.ok_or_else(|| DeserializeError::custom("missing bar"))?;
            
        Ok(Foobar { foo, bar })
    }
}

Of course, the positional guarantee would still be nicer because I could actually expect a certain type instead of matching on OwnedValue. But even this could be mitigated if DocumentDeserializer would get a method to peek the next Field without committing to read its value.

@ChillFish8
Copy link
Collaborator

Right, that makes sense thanks, I agree a guaranteed order would be nicer, although maybe it is also worth looking into adding custom serialization formats, since that could effectively kill two birds with one stone.

@adamreichold
Copy link
Collaborator Author

adamreichold commented Apr 17, 2024

although maybe it is also worth looking into adding custom serialization formats, since that could effectively kill two birds with one stone.

I am not sure, because if I really want the document store to take full advantage of all the byte-and-bit-packing tricks Tantivy has to offer. If I want a custom serialization format, I can already put Vec<u8> into the document store and deserialize it as I see fit. (For example, in our application we have exactly three stored fields, two identifiers and BLOB which contains a complex bincode-based format.)

@ChillFish8
Copy link
Collaborator

I am not sure, because if I really want the document store to take full advantage of all the byte-and-bit-packing tricks Tantivy has to offer. If I want a custom serialization format, I can already put Vec into the document store and deserialize it as I see fit. (For example, in our application we have exactly three stored fields, two identifiers and BLOB which contains a complex bincode-based format.)

Perhaps yes, although I think maybe it is application-specific where being able to get a borrowed buffer, I'm thinking of something like bitcode that does some great specialization for floating points in order to improve compression ratios and performance.

@adamreichold
Copy link
Collaborator Author

adamreichold commented Apr 17, 2024

although I think maybe it is application-specific where being able to get a borrowed buffer

I think borrowing byte slices conflicts with document store compression?

I'm thinking of something like bitcode that does some great specialization for floating points in order to improve compression ratios and performance.

But you should be able to put bytes serialized using bitcode instead of bincode into the document store can you not? (We are also currently evaluating whether to replace bincode by bitcode, but our Tantivy schema does not care.)

@ChillFish8
Copy link
Collaborator

But you should be able to put bytes serialized using bitcode instead of bincode into the document store can you not? (We are also currently evaluating whether to replace bincode by bitcode, but our schema Tantivy does not care.)

Absolutely yes

I think borrowing byte slices conflicts with document store compression?

I am not sure, the compression, and decompressing will create a new owned buffer, but last time I went through the code it should be possible I believe to allow for borrowing of that decompressed buffer. Saying that though, I'm not sure whether that leads to any gains or losses in performance. It is probably quite niche.

@adamreichold
Copy link
Collaborator Author

I would like to ping this as making the default document type more opinionated increases the need for full flexibility when writing custom deserializers.

I think the main design question here is whether to add this as additional API or wholesale replace the deserialization traits with their stateful variants.

…eSeed

This is modelled on Serde's `DeserializeSeed` include how the relevant API entry
points gain a `_seed` variant. It can be used for example to obtain runtime
field ID values when deserializing a struct field by field without relying on
the order of the fields as written to/read from the document store.
@adamreichold adamreichold force-pushed the document-deserialize-seed branch from 9e73734 to e9c16a4 Compare May 23, 2024 06:30
@PSeitz
Copy link
Contributor

PSeitz commented May 27, 2024

I would like to ping this as making the default document type more opinionated increases the need for full flexibility when writing custom deserializers.

I think the main design question here is whether to add this as additional API or wholesale replace the deserialization traits with their stateful variants.

I think the question is if the added complexity justifies the potential performance gain. The de-serialization step is usually dwarfed by other parts of the search execution. Maybe you can share some numbers or your use case.

@adamreichold
Copy link
Collaborator Author

I think the question is if the added complexity justifies the potential performance gain.

I do not want to frame this as a performance question. Without the state, some kinds of documents just cannot be deserialized from the stored representation.

One would always have to build an intermediate representation to later on extract these documents. And if the limits chosen for CompactDoc are insufficient, it means having a custom intermediate representation and a custom final representation.

So from my point of view, this is mainly a question of reducing complexity in downstream code.

As proposed, the API does not become any more complex for users how do not want the state and there should be no runtime overhead at all.

I admit there is additional implementation complexity, but I think it was relatively obvious that this would become a Serde reimplementation when DocumentDeserialize was first introduced.

Finally, note that when Tantivy's document store becomes are primary store, search is not the only use case. For example, we have endpoints which just stream documents and therefore do nothing but deserialize and serialize documents. (Without having done a CPU profile, I suspect that decompression would dominate in that case, at least if the workload does not fit into the block cache.)

@adamreichold
Copy link
Collaborator Author

One relevant alternative I see, at least for the case of Rust structure with statically known fields, is driving the deserialization by field ID and names. This would allow matching fields without having access to the runtime field ID but just to the also statically known names. I fear this would make the general case more complicated though and also likely requiring looking that information up during deserialization even when it isn't needed.

@PSeitz
Copy link
Contributor

PSeitz commented May 27, 2024

If you want to reduce complexity, you can convert your document from the JSON value instead of implementing the deserialize trait. The whole document as trait stuff is about indexing performance, not ergonomics.

If someone has more than u16:MAX fields I would be interested in their use case. I think it's quite unlikely (and unnecessary with JSON fields now). The limit could be removed easily.

@adamreichold
Copy link
Collaborator Author

I really don't understand the abrasiveness.

This is relatively simple change that closely follows Serde's precedent and would enable a straight-forward conversion from and to Rust structures as documents which is impossible to do otherwise. Personally, I don't think serde_json::Value is reasonable approach to that use case. Especially compared to the cost of supporting it directly.

And yes, my use case does fit into CompactDoc. But I just don't think we should force downstream projects into unnecessary API detours if the solution is a +92-12 with no stated technical defects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants