Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] [stdlib] [proposal] Define BinarySerializable, BinaryDeserializable, and BinaryStreamable traits #3747

Closed
1 task done
martinvuyk opened this issue Nov 5, 2024 · 10 comments
Labels
enhancement New feature or request mojo Issues that are related to mojo mojo-repo Tag all issues with this label

Comments

@martinvuyk
Copy link
Contributor

martinvuyk commented Nov 5, 2024

Review Mojo's priorities

What is your request?

Define Serializable, Deserializable, and Streamable traits

What is your motivation for this change?

#3744 and #3745 are reference implementations for Writer applied to those collection types. They make no sense as they are because the concept of Writer is basically conceptually constrained to String streaming. This needs to scale out further.

Any other details?

Proposals

I will bunch up many thoughts inside this proposal, they can be implemented or not individually.

1. Rename Span to Buffer

Buffer is a known word and concept, it is the preferred word in C. Span is not so clear for non native English speakers, and is not one the first words one thinks of when talking about data stored in memory. Buffer is also what best describes this structure since it can also mutate (in the future maybe also own) the data.

2. Prioritize and strengthen generics for working with Span

Parametrized origins are very clunky. See #3744 for an example of a method to enable reads or writes to a Span.

3. Define a generic trait for serialization

proposed pseudocode (we don't have parametrized traits yet so we can constrain it to Byte for now)
(A type could potentially serialize to more than 1 type of Buffer. e.g. String could be encoded to different UTF standards)

trait Serializable[D: DType]:
    fn __encode__[
        is_mutable: Bool, origin: Origin[is_mutable].type
    ](self) -> Buffer[Scalar[D], origin]:
        ...

4. Define a generic trait for binary deserialization

(A type could potentially deserialize from more than 1 type of Buffer. e.g. String could be decoded from different UTF standards)

trait BinaryDeserializableByCopy[D: DType]:
    fn __init__(out self, *, buffer: Buffer[Scalar[D]]):
        ...

# if we do make buffer be able to own the data
trait BinaryDeserializableZeroCopy[D: DType]:
    fn __init__[O: MutableOrigin](
        out self,
        *,
        owned buffer: Buffer[Scalar[D], MutableOrigin],
        self_is_owner: Bool,
    ):
        ...

4. Define generic traits for binary streaming

Libc's abstraction of streams that get flushed has stood the test of 50 years of different use by very different domain specific logic. It is not perfect, but it is a great starting point.

  • The error handling model needs improvement but we can iterate on that. The current Writer trait has no error handling at all, which is not good for interfacing with anything with I/O
  • The lazy nature of streams has potential to allow lazy serialization which we currently don't have with the Writer trait
trait BinaryStreamable[D: DType]:
    fn __send__[S: Stream[D]](self, *, mut stream: S):
        ...

trait BinaryStream[D: DType]:
    fn __recv__(mut self, *, buffer: Buffer[Scalar[D]]): # append to queue, write directly, etc.
        ...
    fn __flush__(mut self) -> Int: # Amount of items transmitted. -1 on error.
        ...
    fn __error__(self) -> Int: # error code
        ...
    fn __error_msg__(self) -> StaticString: # error message
        ...
what this model enables

current writer trait

fn main():
    stream = String(capacity=1)
    a = Span(List[Byte](97))
    # every call reserves stream.byte_length() + len(a)
    stream.write_bytes(a) # no-op reserve since it's big enough
    stream.write_bytes(a) # does a realloc and copies everything
    stream.write_bytes(a) # does a realloc and copies everything

what the streamable trait does

fn main():
    stream = String(capacity=1)
    a = Span(List[Byte](97))
    # every call reserves stream.byte_length() + len(a)
    send(a, stream) # no-op reserve since it's big enough
    send(a, stream)  # does a realloc and copies everything
    send(a, stream)  # does a realloc and copies everything

where does the difference come into play?

fn concat_a(some_message: ConcatString):
    a = Span(List[Byte](97))
    send(a, some_message) # we can even provide a 'write' API that does this internally


fn main() raises:
    # ConcatString is a fictitious struct which has a queue for later concatenating `Buffer[Byte]`
    stream = ConcatString(capacity=2)
    # every call appends the item's buffer to a queue
    concat_a(stream) # no-op append since it's big enough
    concat_a(stream) # no-op append since it's big enough
    concat_a(stream) # append does a realloc to 2x size and copies the buffers (not the data)
    # ConcatString.__str__() internally calls flush on the stream and raises on error
    # it also can count the total amount of bytes the string will need and do 1 syscall to alloc
    final_string = str(stream)
    # if instead of building a final_string this were to be sent to print(stream), it would write things
    # directly to the underlying file stream by doing something along the lines of
    # fn __send__[S: Stream[Byte]](self: ConcatString, *, stream: S)
    #    for b in self.queue:
    #        recv(stream, b)

This same abstraction will work all the way down to sockets and file streams, because it's practically the same. The potential for zero-copy data structures and direct file memory mapping deserialization is huge. Most importantly, this is simple and easy to use; everyone can understand that you send things to a pipe/stream and then you have to flush it. It might also be interesting to make this a linear type where the user must call flush at some point or the code won't compile.

5. Define convenience builtin methods

fn decode[D: DType, S: BinaryDeserializable[D]](buffer: Buffer[Scalar[D]]) -> S:
    return S(buffer=buffer)

fn decode[S: BinaryDeserializable[Byte] = String](buffer: Buffer[Byte]) -> S:
    return S(buffer=buffer)

fn decode(buffer: Buffer[Byte], encoding: StringLiteral) -> String:
    return String(buffer=buffer, encoding=encoding)

fn encode[D: DType, S: BinarySerializable[D]](v: S) -> Buffer[Scalar[D]]:
    return v.__encode__()

fn encode[S: BinarySerializable[Byte]](v: S) -> Buffer[Byte]:
    return v.__encode__()

fn encode(buffer: Buffer[Byte], encoding: StringLiteral) -> String:
    return String.__encode__(buffer=buffer, encoding=encoding)

fn bytes[S: BinarySerializable[Byte]](v: S) -> Buffer[Byte]:
    return v.__encode__()

fn bytes(buffer: Buffer[Byte], encoding: StringLiteral) -> String:
    return encode(buffer, encoding)

fn send[D: DType, S: BinaryStreamable[D], St: Stream[D]](v: S, inout stream: St):
    return v.__send__(stream=stream)

fn recv[D: DType, St: BinaryStream[D]](inout stream: St, buffer: Buffer[Scalar[D]]):
    return stream.__recv__(buffer=buffer)

Effects this will have

This will allow passing fat pointers all the way for many APIs. We will stop caring so much about collection types copying their items as we will be dealing with pointer + len combinations for everything. With the added safety of Origin.

This will positively influence the handling of any future networking and data interchange implementations and build scaffolding for some neat code. I'm currently developing the Socket package (waiting for async Mojo :( ) and purely using Span[Byte] as data interchange format. But if we can define these generically many improvements will follow as the user would be able to implement their own streaming, serialization, and deserialization logic.

@martinvuyk martinvuyk added enhancement New feature or request mojo-repo Tag all issues with this label labels Nov 5, 2024
@lsh
Copy link
Contributor

lsh commented Nov 8, 2024

In general I think the Span parts of this proposal seem mostly unrelated to the real meat of the proposal and would be better served being separate proposals.

The name Span was picked because it aligns with industry conventions in C++. If you look at many C or C++ libraries, that's what this data structure is called. That being said, I'm fine with another name like View since it's kind of like memoryview if you squint at it. I'm not necessarily opposed to Buffer, but it's a longer name and goes against conventions.

I also go back and forth on whether a serialization/deserialization trait belongs in the standard library, because it makes improving that data model difficult. For example, in Rust serde is the standard, but libraries like rkyv are able to achieve some great times by using a different data model.

@martinvuyk
Copy link
Contributor Author

Hi @lsh , thanks for mentioning those examples. I went and looked at the code, it's not so dissimilar to what I'm proposing here.

SerDe's serializer implements the serialization for every type basically, which is not ideal. Mojo's goal is for each type to implement their own thing.

rkyv is very similar to this proposal:

/// Converts a type to its archived form.
///
/// Objects perform any supportive serialization during
/// [`serialize`](Serialize::serialize). For types that reference nonlocal
/// (pointed-to) data, this is when that data must be serialized to the output.
/// These types will need to bound `S` to implement
/// [`Writer`](crate::ser::Writer) and any other required traits (e.g.
/// [`Sharing`](crate::ser::Sharing)). They should then serialize their
/// dependencies during `serialize`.
///
/// See [`Archive`] for examples of implementing `Serialize`.
pub trait Serialize<S: Fallible + ?Sized>: Archive {
    /// Writes the dependencies for the object and returns a resolver that can
    /// create the archived type.
    fn serialize(&self, serializer: &mut S)
        -> Result<Self::Resolver, S::Error>;
}

/// Converts a type back from its archived form.
///
/// Some types may require specific deserializer capabilities, such as `Rc` and
/// `Arc`. In these cases, the deserializer type `D` should be bound so that it
/// implements traits that provide those capabilities (e.g.
/// [`Pooling`](crate::de::Pooling)).
///
/// This can be derived with [`Deserialize`](macro@crate::Deserialize).
pub trait Deserialize<T, D: Fallible + ?Sized> {
    /// Deserializes using the given deserializer
    fn deserialize(&self, deserializer: &mut D) -> Result<T, D::Error>;
}

The API in C is really just pass a stream pointer (pointer to a file), and people build stuff around that (mostly safety nets). There is only so many ways you can represent a pointer and a length. This is the same thing but with the added safety of knowing the length and mutability of the origin, and if proposal #3728 gets accepted, we will also have the added benefit of a Buffer's data being able to be owned and as such the library author knows that they don't need to copy the data. The simple model of passing pointers around and letting everyone implement their own serialization and deserialization logic has stood the test of time, this is just giving a prettier interface for that.

proposal #3728 and the fact that mutable origins exist is also the reason why I'm rooting for Buffer and not View, if you can own and mutate the data then it's not really a view.

@melodyogonna
Copy link

Why are we doing streaming before the language gets support for generators?

@martinvuyk
Copy link
Contributor Author

@melodyogonna They aren't needed, though they may be a nice abstraction. I'll update the Streaming section of the proposal with an idea to go full on C mode that I think will answer that question.

FYI the current writer abstraction also doesn't make use of generators nor flushing, every time you call writer.write() it checks whatever underlying pointer length and resizes to the new size of each new item to be added, moving the pointed data if the allocation is bigger than the current one.

@lsh
Copy link
Contributor

lsh commented Nov 16, 2024

@martinvuyk

I went and looked at the code, it's not so dissimilar to what I'm proposing here.

This isn't exactly right. If I understand correctly, you're proposing to not have a data model at all, instead it seems you're conflating the Read/Write traits with Deserialize/Serialize, which can be considered distinct concepts. serde and rkyv are both built around data models to make multiple targets easier. What you describe as

SerDe's serializer implements the serialization for every type basically

is the data model. Because your methods don't have a data model or concept of Serializer/Deserializer the API abilities and limitations are pretty different. With serde, the idea is that one doesn't have to think as often "how does my data type map to the target" so much as "how does my data type map to the data model", and then the relevant (de)serializer answers the question "how does the data model map to the target?" In this way, once your struct implements Serialize or Deserialize in serde, then you can swap in the relevant Deserializer/Serializer for your type and get whatever target you want. You can even serialize from one type to another type.

Here's where the design challenges kick in that makes it tough to add it to the standard library. If you do define a data model, it's possible there is a better data model that exists but we're stuck because of backwards compatibility. If you don't define a data model, then it becomes harder to map structs automatically to new serialization/deserialization targets (and automatically generate them via reflection and decorators).

@owenhilyard
Copy link
Contributor

I'm not a fan of the Serializable trait as written. First, it ignores that endianness exists. Second, it ignores vectored io.

Consider the following struct.

struct Foo[N: UInt]:
    var a: UInt32
    var b: InlineArray[TrivialType, N]
    var c: String
    var d: List[TrivialType]

The most efficient way to write this to a file (assuming native endian) is this:

fn write_to_file[N: UInt](fd: FileDescriptor, foo: Foo[N]):
  var vecs = InlineArray[IoVec, 4]()
  vecs[0].iov_len = sizeof[UInt32]()
  vecs[0].iov_base = UnsafePointer.address_of(foo.a)
  vecs[1].iov_len = sizeof[InlineArray[TrivialType, N]]()
  vecs[1].iov_base = UnsafePointer.address_of(foo.b)
  vecs[2].iov_len = len(foo.c._buffer)
  vecs[2].iov_base = foo.c._buffer.data
  vecs[3].iov_len = len(foo.d) * sizeof[TrivialType]()
  vecs[4].iov_base = foo.d.data
  fd.writev(UnsafePointer.address_of(vecs), 4)

This is totally incompatible with the proposed API. You may not be thinking about the cost of the copies, but if we take N=1 million, and sizeof[TrivialType]() == 256, this quickly turns into unnecessarily copying a quarter GB of data. Additionally, it will cause buffer bloat issues because it forces you to make your buffers as large as the largest object you intend to serialize. We also need to assume that this trait will be used for both JSON and Binary serialization, which means that you can't blindly serialize, you need to know the target format. This forces us into the serializer model that Rust libraries tend to use, which lets you hide a buffer pool behind the serializer if you need it, as well as determine what the byte format for a particular Mojo type in a particular format should be. However, if you only have local information about a type, some types of serialization are impossible to implement, like ASN.1 (used for things like TLS certificates, LDAP (MS Active Directory), SNMP, a variety of standard scientific formats, and Kerberos). This forces a serializer to buffer the entire serialization until it has complete type information and can make a decision about how to pack the data (for example, ASN.1 packed encodings typically require bit-packing booleans from disparate parts of the message). We also need to be aware of alignment requirements if we want zero-copy, so you need to know both the alignment of the buffer the data will eventually be serialized inside of as well as the current offset into the write buffer. This gets more complicated in the presence of scatter/gather, but is similar in principle.

Given that large copies exist, we also want async variants of everything since Intel and AMD both package DMA accelerators on their CPUs now, and because we many want to stream data out which means having async inside of the serialization as well as for IO.

I'm also not sure if requiring byte granularity is desirable. Many networked systems are heavily bottlenecked on bandwidth (likely anything with less than 1 Gbps of bandwidth per CPU core) and prefer to bit-pack items.

As presently written, I think this would result in bottlenecks even for APIs as minimal as POSIX sockets, and would make serialization costs orders of magnitude higher than IO costs for better APIs. I am SURE that this would cause havoc above 100 Gbps due to the amount of unnecessary copies and/or extra syscall overhead, and, as I mentioned earlier, local information is not sufficient to determine the proper encoding for many formats. I strongly suggest that we wait until we have reflection, at which point an implementation of a format can inspect both the output type (buffer, stream socket, message socket, file, device memory, etc) and the input data types to determine a good way to serialize the data. rkyv isn't much better, since it ignores scatter/gather io.

My design for networking in std is blocked on the ability to have custom allocators (because the current one has no way to allocate DMA-safe memory), as well as trait objects, custom MLIR dialects (for P4), parametric traits, conditional trait impls, and negative trait bounds. The target performance is in the multiple Gbps per CPU core range (hopefully 10+ Gbps) which is why it will see a magnified version of any overheads added to serialization. Ideally, you the user should never copy more than a few cache lines, and the network card or NVMe drive should do all of the copying since they can do that without taxing the CPU. Depending on the available memory bandwidth it may even be preferable to task a GPU with the copies if you can do other useful work in while it finishes up the serialization.

@martinvuyk
Copy link
Contributor Author

I went and looked at the code, it's not so dissimilar to what I'm proposing here.

This isn't exactly right. If I understand correctly, you're proposing to not have a data model at all, instead it seems you're conflating the Read/Write traits with Deserialize/Serialize, which can be considered distinct concepts. serde and rkyv are both built around data models to make multiple targets easier. What you describe as

SerDe's serializer implements the serialization for every type basically

is the data model. Because your methods don't have a data model or concept of Serializer/Deserializer the API abilities and limitations are pretty different.

@lsh ok wow, thats cool. I definitely took too quick a look. Yeah no I mean Binary serialization, for me serialization is when data is changed to a bitwise format to be encoded into voltages and sent over a literal wire (UART, I2C, SPI, CAN, RS 232, etc.).

Here's where the design challenges kick in that makes it tough to add it to the standard library. If you do define a data model, it's possible there is a better data model that exists but we're stuck because of backwards compatibility. If you don't define a data model, then it becomes harder to map structs automatically to new serialization/deserialization targets (and automatically generate them via reflection and decorators).

And no I don't think we should define a data model beyond that, since other people can come up with better ideas down the road. And fancy things like reflecting over struct fields are better left for external libraries as well IMO.

This is totally incompatible with the proposed API. You may not be thinking about the cost of the copies, but if we take N=1 million, and sizeofTrivialType == 256, this quickly turns into unnecessarily copying a quarter GB of data. Additionally, it will cause buffer bloat issues because it forces you to make your buffers as large as the largest object you intend to serialize.

@owenhilyard I think the implementation could just have a List[Buffer] and lazily build the vectored io structure you showed once __flush__ is called on it (?). You could also have a struct that has the same API that is specifically designed to vectorize X other struct, the API can remain mostly the same I think.

However, if you only have local information about a type, some types of serialization are impossible to implement, like ASN.1 (used for things like TLS certificates, LDAP (MS Active Directory), SNMP, a variety of standard scientific formats, and Kerberos). This forces a serializer to buffer the entire serialization until it has complete type information and can make a decision about how to pack the data (for example, ASN.1 packed encodings typically require bit-packing booleans from disparate parts of the message).

Same thing, can't you implement a serializer with the same API that is constrained[_type_is_eq[T, SomeSpecificStruct]()]() ?

We also need to be aware of alignment requirements if we want zero-copy, so you need to know both the alignment of the buffer the data will eventually be serialized inside of as well as the current offset into the write buffer. This gets more complicated in the presence of scatter/gather, but is similar in principle.

Wouldn't that be the responsibility of the implementation of the stream? You could again have a List[SomeSpecificStruct] and the underlying function constrained, and when __flush__ is called, you add all the logic for alignment; or if your stream struct is sent to another stream you align everything and send it properly.

Given that large copies exist, we also want async variants of everything since Intel and AMD both package DMA accelerators on their CPUs now, and because we many want to stream data out which means having async inside of the serialization as well as for IO.

Yeah, we could also make the __flush__, __send__ and __recv__ functions always be async and in sync contexts users would just add an await in front 🤷‍♂️ . I'm also thinking of making socket's API async and not offering a sync alternative.

As presently written, I think this would result in bottlenecks even for APIs as minimal as POSIX sockets, and would make serialization costs orders of magnitude higher than IO costs for better APIs. I am SURE that this would cause havoc above 100 Gbps due to the amount of unnecessary copies and/or extra syscall overhead

I'm not sure if you observed something I'm not getting or if you think the stream will do a syscall on each __recv__, that would depend on the implementation of the stream. Span is just a pointer and a length, storing lists of them has trivial costs in memory and no costs in I/O if you do it lazily. It's just an intermediary between the sender (streamable struct) and the receiver (OS networking/storage/serial interface), or it can itself be the abstraction over an OS interface. The current Writer implementations do copy everything onto a buffer and then send things to a stream, I'm not proposing that.

I strongly suggest that we wait until we have reflection, at which point an implementation of a format can inspect both the output type (buffer, stream socket, message socket, file, device memory, etc) and the input data types to determine a good way to serialize the data. rkyv isn't much better, since it ignores scatter/gather io.

I'm not proposing the most generic of APIs where a stream can serialize anything, the idea is you add constraints inside the methods for what your specific stream supports. I would just like to define an API to which all kinds of streaming can be adapted to in a way that doesn't limit their capabilities. So if for example someone figures out a clever way to implement all you're describing, they can just plug it into stdlib methods in an organic way because they conform to the traits.

My design for networking in std is blocked on the ability to have custom allocators (because the current one has no way to allocate DMA-safe memory)

I'd love the inputs of both of you on #3728 btw. (and also #3766 which is a bit unrelated, but I know you'll probably be some of the heavy users)

as well as trait objects, custom MLIR dialects (for P4), parametric traits, conditional trait impls, and negative trait bounds.

That is a lot of requirements and I'm worried it will take too long. I'm already seeing people doing gymnastics to adapt to the Writer trait which I think is still lacking, and they forget it still does syscalls every time and they forget to go down low and allocate space for the String beforehand. If we define these kind of lazy serialization mechanisms as early as possible, it will do the thinking for them and allocate the buffer for the whole serialization at once (without all the unnecessary copies and reallocs that the writer trait does) or send the buffers to a stream without ever copying them.

The target performance is in the multiple Gbps per CPU core range (hopefully 10+ Gbps) which is why it will see a magnified version of any overheads added to serialization. Ideally, you the user should never copy more than a few cache lines, and the network card or NVMe drive should do all of the copying since they can do that without taxing the CPU. Depending on the available memory bandwidth it may even be preferable to task a GPU with the copies if you can do other useful work in while it finishes up the serialization.

That would be amazing. I totally agree that that should be the target 🔥 , I just want us to define an interface where all of that is possible without going into fancy things like struct specific reflection etc. That would be the responsibility of the implementation, the API should be generic enough for all use cases.

@martinvuyk martinvuyk changed the title [Feature Request] [stdlib] [proposal] Define Serializable, Deserializable, and Streamable traits [Feature Request] [stdlib] [proposal] Define BinarySerializable, BinaryDeserializable, and BinaryStreamable traits Nov 17, 2024
@owenhilyard
Copy link
Contributor

Here's where the design challenges kick in that makes it tough to add it to the standard library. If you do define a data model, it's possible there is a better data model that exists but we're stuck because of backwards compatibility. If you don't define a data model, then it becomes harder to map structs automatically to new serialization/deserialization targets (and automatically generate them via reflection and decorators).

And no I don't think we should define a data model beyond that, since other people can come up with better ideas down the road. And fancy things like reflecting over struct fields are better left for external libraries as well IMO.

I think that there is good reason to make serialization reflection based. Doing more work at compile time means that hot loop serialization can be more heavily optimized. My thought is that the API would leverage type equality to allow for "overrides" by the user, where the user provides an compile-time mapping of types to handler functions that is overlayed on top of the one provided by the format (which would special case List, Dict, String, etc). Some type of annotations would also be useful (for instance, a request to treat a var member as if it had a different name or format a timestamp in a particular way). I don't think it's possible to make an API which can properly handle the wide variety of formats well if it can't see both the input and output of the operation clearly. We could, of course, design a very detailed API for asking questions of types on both sides (ex: Does the type efficiently do vectored serialization? How many iovecs does it need to do vectored serialization? Is there an upper bound on the size of the copy if you flatten the type? Is the type trivial? Is there any arbitrary user information attached to the serializer (ex: format version)), but I think that way is going to be a very manual way of reimplementing the information that could be handled via reflection. If we look at the Zig JSON implementation, it's not that verbose but still has room to implement a bunch of JSON-specific optimizations. A reflection-based approach which also is able to make use of traits/annotations to figure out the intent (ex: serialize me like a list) should be much easier for users. This is especially true since without reflection we run into the problem of every single struct needing a serialize implementation, which is the bane of my existence in Rust when some library forgets to add it and then I need to use the remote-derive hacks. A reflection-based approach can work without annotations for the general approach, and only require effort from the user for data structures which don't implement a "how to serialize me" trait.

This is totally incompatible with the proposed API. You may not be thinking about the cost of the copies, but if we take N=1 million, and sizeofTrivialType == 256, this quickly turns into unnecessarily copying a quarter GB of data. Additionally, it will cause buffer bloat issues because it forces you to make your buffers as large as the largest object you intend to serialize.

@owenhilyard I think the implementation could just have a List[Buffer] and lazily build the vectored io structure you showed once __flush__ is called on it (?). You could also have a struct that has the same API that is specifically designed to vectorize X other struct, the API can remain mostly the same I think.

The point of this was to show that you can more efficiently serialize things that have collections of trivial types (like Strings), using writev, especially if the data is large. If I want to serialize a single struct, then making a dedicated protocol for it is fine. The issue comes when you have 20 structs like this spread across 3 libraries. Ideally, a program should use arena allocators for arrays of iovec so that you can support async io well, which leads to a desire to, if possible, know the bounds on the number of iovecs needed for an operation. If you use List, even if you know exactly how many items you need, you then have double indirection, which means touching a cache line unnecessarily, since you will likely be getting the List from an object pool (since resizable arrays tend to dislike having their backing allocation come from an arena). There are also times when it would be better to slowly fill an array of iovecs and then do a "flush" to clear the buffer in the middle of serializing to avoid allocations (TCP or non-atomic file IO).

If we need to make a serializer for each unique combination of things we want to serialize efficiently, then why have the API at all? Just make a function for each type. I think that making special serializers is going to quickly produce a process that is something better accomplished via reflection.

However, if you only have local information about a type, some types of serialization are impossible to implement, like ASN.1 (used for things like TLS certificates, LDAP (MS Active Directory), SNMP, a variety of standard scientific formats, and Kerberos). This forces a serializer to buffer the entire serialization until it has complete type information and can make a decision about how to pack the data (for example, ASN.1 packed encodings typically require bit-packing booleans from disparate parts of the message).

Same thing, can't you implement a serializer with the same API that is constrained[_type_is_eq[T, SomeSpecificStruct]()]() ?

I can, but then I end up only being able to serialize that struct, so what happens is I end up having to serialize a concrete type or a known set of parameters, which gets me back to hand-writing an implementation for every encode and decode call in my program.

We also need to be aware of alignment requirements if we want zero-copy, so you need to know both the alignment of the buffer the data will eventually be serialized inside of as well as the current offset into the write buffer. This gets more complicated in the presence of scatter/gather, but is similar in principle.

Wouldn't that be the responsibility of the implementation of the stream? You could again have a List[SomeSpecificStruct] and the underlying function constrained, and when __flush__ is called, you add all the logic for alignment; or if your stream struct is sent to another stream you align everything and send it properly.

There would need to be 2-way communication with the stream. For serialization formats which are closer to SoA, you can't cleanly say "this whole struct needs to be at alignment X", you need it to be per-member. Then, you have preferences, like preferring to align to 512 bits to use better instructions but tolerating lower alignment, or 128 bit instructions tolerating 64 bit alignment but strongly preferring to not cross cache lines, which means the user needs a lever to control that serialized size vs alignment tradeoff. For ISAs with variable-width SIMD, you can only make the determination of what alignment is needed at runtime.

Given that large copies exist, we also want async variants of everything since Intel and AMD both package DMA accelerators on their CPUs now, and because we many want to stream data out which means having async inside of the serialization as well as for IO.

Yeah, we could also make the __flush__, __send__ and __recv__ functions always be async and in sync contexts users would just add an await in front 🤷‍♂️ . I'm also thinking of making socket's API async and not offering a sync alternative.

I think we need to see more of Mojo's async machinery first. It seems to be decent at unrolling functions that are actually sync into sync code, but I'm not sure how well that will hold up in non-toy examples. However, for things where the program can't advance until the call returns, blocking syscalls are generally more efficient. Many scripting use-cases run into this, and in that case it's better to have the core that submitted the work go do the network processing because then the data is in l1 cache when you look at the result.

As presently written, I think this would result in bottlenecks even for APIs as minimal as POSIX sockets, and would make serialization costs orders of magnitude higher than IO costs for better APIs. I am SURE that this would cause havoc above 100 Gbps due to the amount of unnecessary copies and/or extra syscall overhead

I'm not sure if you observed something I'm not getting or if you think the stream will do a syscall on each __recv__, that would depend on the implementation of the stream. Span is just a pointer and a length, storing lists of them has trivial costs in memory and no costs in I/O if you do it lazily. It's just an intermediary between the sender (streamable struct) and the receiver (OS networking/storage/serial interface), or it can itself be the abstraction over an OS interface. The current Writer implementations do copy everything onto a buffer and then send things to a stream, I'm not proposing that.

Lists have trivial costs IF you know what size to make them up front. If you have to guess you either end up with buffer bloat or hot loop allocations. I think that having an intermediary is helpful, but I don't think that you can get a sufficient amount of information on what you are doing as an intermediary to be efficient without either a massive inspection API or reflection. This is especially relevant for things like "Are the Lists/Strings in this struct in DMA-safe memory?" which can force copies for some types of IO and is very expensive to query.

I strongly suggest that we wait until we have reflection, at which point an implementation of a format can inspect both the output type (buffer, stream socket, message socket, file, device memory, etc) and the input data types to determine a good way to serialize the data. rkyv isn't much better, since it ignores scatter/gather io.

I'm not proposing the most generic of APIs where a stream can serialize anything, the idea is you add constraints inside the methods for what your specific stream supports. I would just like to define an API to which all kinds of streaming can be adapted to in a way that doesn't limit their capabilities. So if for example someone figures out a clever way to implement all you're describing, they can just plug it into stdlib methods in an organic way because they conform to the traits.

While this could work, I think it will lead to a continually expanding inspection API to provide sufficient information to the stream. For people working in high performance areas like myself, "Can I DMA this?" is an important question, but if someone who doesn't know what they're doing fills out the introspection API incorrectly then the kernel will kill your program or (if using older DMA drivers), you could just overwrite arbitrary physical memory. I think that having users fill it out by hand will lead to APIs either having to be very conservative in what they accept or requiring users to know about things like DCOM, Slingshot, Shared-memory computing and disaggregated memory to properly fill out a list of flags and enums in the API.

My design for networking in std is blocked on the ability to have custom allocators (because the current one has no way to allocate DMA-safe memory)

I'd love the inputs of both of you on #3728 btw. (and also #3766 which is a bit unrelated, but I know you'll probably be some of the heavy users)

I'll take a look.

as well as trait objects, custom MLIR dialects (for P4), parametric traits, conditional trait impls, and negative trait bounds.

That is a lot of requirements and I'm worried it will take too long. I'm already seeing people doing gymnastics to adapt to the Writer trait which I think is still lacking, and they forget it still does syscalls every time and they forget to go down low and allocate space for the String beforehand. If we define these kind of lazy serialization mechanisms as early as possible, it will do the thinking for them and allocate the buffer for the whole serialization at once (without all the unnecessary copies and reallocs that the writer trait does) or send the buffers to a stream without ever copying them.

While I agree the Writer trait is not good, I want to make sure we get this right. We've been promised reflection at some point, and I think it makes more sense to stack this feature on top of that. Zig started in that place and it's worked well for them, C++ is working on doing the same, and removing the serde dependency from half of crates.io while improving performance was the motivating example for reflection in Rust. We have a substantial amount of prior art indicating that even the ability to have full inspection of the local struct is not enough to properly do serialization, and I want to avoid repeating those mistakes but getting something good enough that it's hard to move people off of it. Once we have reflection, whatever exists pre-reflection should be fairly easily to discard for the new API since it won't require much library support. As such, I'm happy to let half-solutions exist for a bit while the language catches up, then we can standardize a state of the art solution.

The target performance is in the multiple Gbps per CPU core range (hopefully 10+ Gbps) which is why it will see a magnified version of any overheads added to serialization. Ideally, you the user should never copy more than a few cache lines, and the network card or NVMe drive should do all of the copying since they can do that without taxing the CPU. Depending on the available memory bandwidth it may even be preferable to task a GPU with the copies if you can do other useful work in while it finishes up the serialization.

That would be amazing. I totally agree that that should be the target 🔥 , I just want us to define an interface where all of that is possible without going into fancy things like struct specific reflection etc. That would be the responsibility of the implementation, the API should be generic enough for all use cases.

I don't think it's possible to do that without the "fancy things". 10 Gbps of 256 byte messages (A reasonable size for an RPC call that isn't going to a storage service), is ~5 million decodes per second and then responses would likely be another ~5 million per second. On a 4 Ghz CPU (normal-ish server), this means that if you have a core dedicated to doing serialization and network IO, and nothing else, that your receive, rpc decode, response encode, and transmit need to average ~820 clock cycles. You can buy more cycles with more cores, but that starts to hurt your application performance. By the way, UDP sendmsg and recvmsg will each eat ~200 cycles, and TCP is worse. Even if we're charitable and use the DPDK l2fwd numbers (which will never happen in a real application), 780 cycles still isn't great for serialization costs. If you want to do time-based retransmits, rdtsc will eat another ~80 cycles. 200 Gbps means that I can present this problem to a 32 core server and have the same problem on 20 cores while having to do 100 million RPS on the remaining 12 cores.

This is for 10 Gbps, if you think that is unrealistic, then 100 Gbps with 1500 byte messages (the largest you can reliably send over the internet) is worse, giving you 440 cycles per message with DPDK rx/tx times. At that point, I don't think there's really room for anything except for the "fancy things" that squeeze as much performance as possible out of serialization. In my opinion, it's irresponsible to design for anything less than 100 Gbps if we expect Mojo to have a long lifespan, since we can expect further consolidation of servers for the sake of power efficiency. The reason I'm pushing for this approach is because if we make something that will tolerate being put under this much pressure, then everyone else benefits immensely, and there isn't a cutoff above which you have to ignore the standard library. A solution designed to be able to perform at 10x or 100x what most people use, if it can be made ergonomic, is a tremendous asset to a language.

@martinvuyk
Copy link
Contributor Author

@owenhilyard thanks for all the explanations, you sent me on a couple of rabit holes there 😆

We have a substantial amount of prior art indicating that even the ability to have full inspection of the local struct is not enough to properly do serialization, and I want to avoid repeating those mistakes but getting something good enough that it's hard to move people off of it. Once we have reflection, whatever exists pre-reflection should be fairly easily to discard for the new API since it won't require much library support. As such, I'm happy to let half-solutions exist for a bit while the language catches up, then we can standardize a state of the art solution.
[...]
This is for 10 Gbps, if you think that is unrealistic, then 100 Gbps with 1500 byte messages (the largest you can reliably send over the internet) is worse, giving you 440 cycles per message with DPDK rx/tx times. At that point, I don't think there's really room for anything except for the "fancy things" that squeeze as much performance as possible out of serialization. In my opinion, it's irresponsible to design for anything less than 100 Gbps if we expect Mojo to have a long lifespan, since we can expect further consolidation of servers for the sake of power efficiency. The reason I'm pushing for this approach is because if we make something that will tolerate being put under this much pressure, then everyone else benefits immensely, and there isn't a cutoff above which you have to ignore the standard library. A solution designed to be able to perform at 10x or 100x what most people use, if it can be made ergonomic, is a tremendous asset to a language.

Ok now you're just blowing my mind 🤯. If you seriously think an API that applies to such a high throughput system is achievable in a generic way that we can apply it for most things and has a low entry barrier (or at least we can implement it for most common use cases), I'm sold. Once we have the features just ping me, I'll be your code monkey 🤣

@martinvuyk martinvuyk closed this as not planned Won't fix, can't repro, duplicate, stale Nov 17, 2024
@owenhilyard
Copy link
Contributor

Ok now you're just blowing my mind 🤯. If you seriously think an API that applies to such a high throughput system is achievable in a generic way that we can apply it for most things and has a low entry barrier (or at least we can implement it for most common use cases), I'm sold. Once we have the features just ping me, I'll be your code monkey 🤣

I'm glad I convinced you. My approach to a lot of things like this is to design for scenarios which are pushing the limits of the hardware and have that as the base API, then build another API which trades a bit of performance for ergonomics on top of the first one. Everyone uses compatible things, and taking 20% off the top of a design that can handle 100-200 Gbps to make the API nicer leads to it being both fast and ergonomic. In my opinion going the other way is very hard, since you may not provide enough information, which is why I will always advocate for building the hard to use but fast version first. This is a very weird way to build things for most people because it requires having specialists on hand at the start, which usually doesn't happen.

My hope is that people who have different specialties than I do are looking out for their areas in similar ways, since I want Mojo to go directly to state of the art and bypass all of the "good enough" solutions that aren't. If Mojo can actually pull it off, it will end up with a much better ecosystem than other languages because even the "easy mode" libraries will be built on top of the libraries used by domain specialists, meaning that everyone benefits from domain specialists being off in their own corners working on their needs instead of them having separate libraries.

@ematejska ematejska added the mojo Issues that are related to mojo label Feb 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request mojo Issues that are related to mojo mojo-repo Tag all issues with this label
Projects
None yet
Development

No branches or pull requests

5 participants