-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] [stdlib] [proposal] Define BinarySerializable
, BinaryDeserializable
, and BinaryStreamable
traits
#3747
Comments
In general I think the The name I also go back and forth on whether a serialization/deserialization trait belongs in the standard library, because it makes improving that data model difficult. For example, in Rust |
Hi @lsh , thanks for mentioning those examples. I went and looked at the code, it's not so dissimilar to what I'm proposing here. SerDe's serializer implements the serialization for every type basically, which is not ideal. Mojo's goal is for each type to implement their own thing. rkyv is very similar to this proposal: /// Converts a type to its archived form.
///
/// Objects perform any supportive serialization during
/// [`serialize`](Serialize::serialize). For types that reference nonlocal
/// (pointed-to) data, this is when that data must be serialized to the output.
/// These types will need to bound `S` to implement
/// [`Writer`](crate::ser::Writer) and any other required traits (e.g.
/// [`Sharing`](crate::ser::Sharing)). They should then serialize their
/// dependencies during `serialize`.
///
/// See [`Archive`] for examples of implementing `Serialize`.
pub trait Serialize<S: Fallible + ?Sized>: Archive {
/// Writes the dependencies for the object and returns a resolver that can
/// create the archived type.
fn serialize(&self, serializer: &mut S)
-> Result<Self::Resolver, S::Error>;
}
/// Converts a type back from its archived form.
///
/// Some types may require specific deserializer capabilities, such as `Rc` and
/// `Arc`. In these cases, the deserializer type `D` should be bound so that it
/// implements traits that provide those capabilities (e.g.
/// [`Pooling`](crate::de::Pooling)).
///
/// This can be derived with [`Deserialize`](macro@crate::Deserialize).
pub trait Deserialize<T, D: Fallible + ?Sized> {
/// Deserializes using the given deserializer
fn deserialize(&self, deserializer: &mut D) -> Result<T, D::Error>;
} The API in C is really just pass a stream pointer (pointer to a file), and people build stuff around that (mostly safety nets). There is only so many ways you can represent a pointer and a length. This is the same thing but with the added safety of knowing the length and mutability of the origin, and if proposal #3728 gets accepted, we will also have the added benefit of a proposal #3728 and the fact that mutable origins exist is also the reason why I'm rooting for |
Why are we doing streaming before the language gets support for generators? |
@melodyogonna They aren't needed, though they may be a nice abstraction. I'll update the Streaming section of the proposal with an idea to go full on C mode that I think will answer that question. FYI the current writer abstraction also doesn't make use of generators nor flushing, every time you call |
This isn't exactly right. If I understand correctly, you're proposing to not have a data model at all, instead it seems you're conflating the
is the data model. Because your methods don't have a data model or concept of Here's where the design challenges kick in that makes it tough to add it to the standard library. If you do define a data model, it's possible there is a better data model that exists but we're stuck because of backwards compatibility. If you don't define a data model, then it becomes harder to map structs automatically to new serialization/deserialization targets (and automatically generate them via reflection and decorators). |
I'm not a fan of the Consider the following struct. struct Foo[N: UInt]:
var a: UInt32
var b: InlineArray[TrivialType, N]
var c: String
var d: List[TrivialType] The most efficient way to write this to a file (assuming native endian) is this: fn write_to_file[N: UInt](fd: FileDescriptor, foo: Foo[N]):
var vecs = InlineArray[IoVec, 4]()
vecs[0].iov_len = sizeof[UInt32]()
vecs[0].iov_base = UnsafePointer.address_of(foo.a)
vecs[1].iov_len = sizeof[InlineArray[TrivialType, N]]()
vecs[1].iov_base = UnsafePointer.address_of(foo.b)
vecs[2].iov_len = len(foo.c._buffer)
vecs[2].iov_base = foo.c._buffer.data
vecs[3].iov_len = len(foo.d) * sizeof[TrivialType]()
vecs[4].iov_base = foo.d.data
fd.writev(UnsafePointer.address_of(vecs), 4) This is totally incompatible with the proposed API. You may not be thinking about the cost of the copies, but if we take N=1 million, and Given that large copies exist, we also want async variants of everything since Intel and AMD both package DMA accelerators on their CPUs now, and because we many want to stream data out which means having async inside of the serialization as well as for IO. I'm also not sure if requiring byte granularity is desirable. Many networked systems are heavily bottlenecked on bandwidth (likely anything with less than 1 Gbps of bandwidth per CPU core) and prefer to bit-pack items. As presently written, I think this would result in bottlenecks even for APIs as minimal as POSIX sockets, and would make serialization costs orders of magnitude higher than IO costs for better APIs. I am SURE that this would cause havoc above 100 Gbps due to the amount of unnecessary copies and/or extra syscall overhead, and, as I mentioned earlier, local information is not sufficient to determine the proper encoding for many formats. I strongly suggest that we wait until we have reflection, at which point an implementation of a format can inspect both the output type (buffer, stream socket, message socket, file, device memory, etc) and the input data types to determine a good way to serialize the data. rkyv isn't much better, since it ignores scatter/gather io. My design for networking in std is blocked on the ability to have custom allocators (because the current one has no way to allocate DMA-safe memory), as well as trait objects, custom MLIR dialects (for P4), parametric traits, conditional trait impls, and negative trait bounds. The target performance is in the multiple Gbps per CPU core range (hopefully 10+ Gbps) which is why it will see a magnified version of any overheads added to serialization. Ideally, you the user should never copy more than a few cache lines, and the network card or NVMe drive should do all of the copying since they can do that without taxing the CPU. Depending on the available memory bandwidth it may even be preferable to task a GPU with the copies if you can do other useful work in while it finishes up the serialization. |
@lsh ok wow, thats cool. I definitely took too quick a look. Yeah no I mean Binary serialization, for me serialization is when data is changed to a bitwise format to be encoded into voltages and sent over a literal wire (UART, I2C, SPI, CAN, RS 232, etc.).
And no I don't think we should define a data model beyond that, since other people can come up with better ideas down the road. And fancy things like reflecting over struct fields are better left for external libraries as well IMO.
@owenhilyard I think the implementation could just have a
Same thing, can't you implement a serializer with the same API that is
Wouldn't that be the responsibility of the implementation of the stream? You could again have a
Yeah, we could also make the
I'm not sure if you observed something I'm not getting or if you think the stream will do a syscall on each
I'm not proposing the most generic of APIs where a stream can serialize anything, the idea is you add constraints inside the methods for what your specific stream supports. I would just like to define an API to which all kinds of streaming can be adapted to in a way that doesn't limit their capabilities. So if for example someone figures out a clever way to implement all you're describing, they can just plug it into stdlib methods in an organic way because they conform to the traits.
I'd love the inputs of both of you on #3728 btw. (and also #3766 which is a bit unrelated, but I know you'll probably be some of the heavy users)
That is a lot of requirements and I'm worried it will take too long. I'm already seeing people doing gymnastics to adapt to the
That would be amazing. I totally agree that that should be the target 🔥 , I just want us to define an interface where all of that is possible without going into fancy things like struct specific reflection etc. That would be the responsibility of the implementation, the API should be generic enough for all use cases. |
Serializable
, Deserializable
, and Streamable
traitsBinarySerializable
, BinaryDeserializable
, and BinaryStreamable
traits
I think that there is good reason to make serialization reflection based. Doing more work at compile time means that hot loop serialization can be more heavily optimized. My thought is that the API would leverage type equality to allow for "overrides" by the user, where the user provides an compile-time mapping of types to handler functions that is overlayed on top of the one provided by the format (which would special case List, Dict, String, etc). Some type of annotations would also be useful (for instance, a request to treat a var member as if it had a different name or format a timestamp in a particular way). I don't think it's possible to make an API which can properly handle the wide variety of formats well if it can't see both the input and output of the operation clearly. We could, of course, design a very detailed API for asking questions of types on both sides (ex: Does the type efficiently do vectored serialization? How many iovecs does it need to do vectored serialization? Is there an upper bound on the size of the copy if you flatten the type? Is the type trivial? Is there any arbitrary user information attached to the serializer (ex: format version)), but I think that way is going to be a very manual way of reimplementing the information that could be handled via reflection. If we look at the Zig JSON implementation, it's not that verbose but still has room to implement a bunch of JSON-specific optimizations. A reflection-based approach which also is able to make use of traits/annotations to figure out the intent (ex: serialize me like a list) should be much easier for users. This is especially true since without reflection we run into the problem of every single struct needing a serialize implementation, which is the bane of my existence in Rust when some library forgets to add it and then I need to use the remote-derive hacks. A reflection-based approach can work without annotations for the general approach, and only require effort from the user for data structures which don't implement a "how to serialize me" trait.
The point of this was to show that you can more efficiently serialize things that have collections of trivial types (like Strings), using writev, especially if the data is large. If I want to serialize a single struct, then making a dedicated protocol for it is fine. The issue comes when you have 20 structs like this spread across 3 libraries. Ideally, a program should use arena allocators for arrays of iovec so that you can support async io well, which leads to a desire to, if possible, know the bounds on the number of iovecs needed for an operation. If you use List, even if you know exactly how many items you need, you then have double indirection, which means touching a cache line unnecessarily, since you will likely be getting the List from an object pool (since resizable arrays tend to dislike having their backing allocation come from an arena). There are also times when it would be better to slowly fill an array of iovecs and then do a "flush" to clear the buffer in the middle of serializing to avoid allocations (TCP or non-atomic file IO). If we need to make a serializer for each unique combination of things we want to serialize efficiently, then why have the API at all? Just make a function for each type. I think that making special serializers is going to quickly produce a process that is something better accomplished via reflection.
I can, but then I end up only being able to serialize that struct, so what happens is I end up having to serialize a concrete type or a known set of parameters, which gets me back to hand-writing an implementation for every encode and decode call in my program.
There would need to be 2-way communication with the stream. For serialization formats which are closer to SoA, you can't cleanly say "this whole struct needs to be at alignment X", you need it to be per-member. Then, you have preferences, like preferring to align to 512 bits to use better instructions but tolerating lower alignment, or 128 bit instructions tolerating 64 bit alignment but strongly preferring to not cross cache lines, which means the user needs a lever to control that serialized size vs alignment tradeoff. For ISAs with variable-width SIMD, you can only make the determination of what alignment is needed at runtime.
I think we need to see more of Mojo's async machinery first. It seems to be decent at unrolling functions that are actually sync into sync code, but I'm not sure how well that will hold up in non-toy examples. However, for things where the program can't advance until the call returns, blocking syscalls are generally more efficient. Many scripting use-cases run into this, and in that case it's better to have the core that submitted the work go do the network processing because then the data is in l1 cache when you look at the result.
Lists have trivial costs IF you know what size to make them up front. If you have to guess you either end up with buffer bloat or hot loop allocations. I think that having an intermediary is helpful, but I don't think that you can get a sufficient amount of information on what you are doing as an intermediary to be efficient without either a massive inspection API or reflection. This is especially relevant for things like "Are the Lists/Strings in this struct in DMA-safe memory?" which can force copies for some types of IO and is very expensive to query.
While this could work, I think it will lead to a continually expanding inspection API to provide sufficient information to the stream. For people working in high performance areas like myself, "Can I DMA this?" is an important question, but if someone who doesn't know what they're doing fills out the introspection API incorrectly then the kernel will kill your program or (if using older DMA drivers), you could just overwrite arbitrary physical memory. I think that having users fill it out by hand will lead to APIs either having to be very conservative in what they accept or requiring users to know about things like DCOM, Slingshot, Shared-memory computing and disaggregated memory to properly fill out a list of flags and enums in the API.
I'll take a look.
While I agree the Writer trait is not good, I want to make sure we get this right. We've been promised reflection at some point, and I think it makes more sense to stack this feature on top of that. Zig started in that place and it's worked well for them, C++ is working on doing the same, and removing the serde dependency from half of crates.io while improving performance was the motivating example for reflection in Rust. We have a substantial amount of prior art indicating that even the ability to have full inspection of the local struct is not enough to properly do serialization, and I want to avoid repeating those mistakes but getting something good enough that it's hard to move people off of it. Once we have reflection, whatever exists pre-reflection should be fairly easily to discard for the new API since it won't require much library support. As such, I'm happy to let half-solutions exist for a bit while the language catches up, then we can standardize a state of the art solution.
I don't think it's possible to do that without the "fancy things". 10 Gbps of 256 byte messages (A reasonable size for an RPC call that isn't going to a storage service), is ~5 million decodes per second and then responses would likely be another ~5 million per second. On a 4 Ghz CPU (normal-ish server), this means that if you have a core dedicated to doing serialization and network IO, and nothing else, that your receive, rpc decode, response encode, and transmit need to average ~820 clock cycles. You can buy more cycles with more cores, but that starts to hurt your application performance. By the way, UDP sendmsg and recvmsg will each eat ~200 cycles, and TCP is worse. Even if we're charitable and use the DPDK l2fwd numbers (which will never happen in a real application), 780 cycles still isn't great for serialization costs. If you want to do time-based retransmits, rdtsc will eat another ~80 cycles. 200 Gbps means that I can present this problem to a 32 core server and have the same problem on 20 cores while having to do 100 million RPS on the remaining 12 cores. This is for 10 Gbps, if you think that is unrealistic, then 100 Gbps with 1500 byte messages (the largest you can reliably send over the internet) is worse, giving you 440 cycles per message with DPDK rx/tx times. At that point, I don't think there's really room for anything except for the "fancy things" that squeeze as much performance as possible out of serialization. In my opinion, it's irresponsible to design for anything less than 100 Gbps if we expect Mojo to have a long lifespan, since we can expect further consolidation of servers for the sake of power efficiency. The reason I'm pushing for this approach is because if we make something that will tolerate being put under this much pressure, then everyone else benefits immensely, and there isn't a cutoff above which you have to ignore the standard library. A solution designed to be able to perform at 10x or 100x what most people use, if it can be made ergonomic, is a tremendous asset to a language. |
@owenhilyard thanks for all the explanations, you sent me on a couple of rabit holes there 😆
Ok now you're just blowing my mind 🤯. If you seriously think an API that applies to such a high throughput system is achievable in a generic way that we can apply it for most things and has a low entry barrier (or at least we can implement it for most common use cases), I'm sold. Once we have the features just ping me, I'll be your code monkey 🤣 |
I'm glad I convinced you. My approach to a lot of things like this is to design for scenarios which are pushing the limits of the hardware and have that as the base API, then build another API which trades a bit of performance for ergonomics on top of the first one. Everyone uses compatible things, and taking 20% off the top of a design that can handle 100-200 Gbps to make the API nicer leads to it being both fast and ergonomic. In my opinion going the other way is very hard, since you may not provide enough information, which is why I will always advocate for building the hard to use but fast version first. This is a very weird way to build things for most people because it requires having specialists on hand at the start, which usually doesn't happen. My hope is that people who have different specialties than I do are looking out for their areas in similar ways, since I want Mojo to go directly to state of the art and bypass all of the "good enough" solutions that aren't. If Mojo can actually pull it off, it will end up with a much better ecosystem than other languages because even the "easy mode" libraries will be built on top of the libraries used by domain specialists, meaning that everyone benefits from domain specialists being off in their own corners working on their needs instead of them having separate libraries. |
Review Mojo's priorities
What is your request?
Define
Serializable
,Deserializable
, andStreamable
traitsWhat is your motivation for this change?
#3744 and #3745 are reference implementations for
Writer
applied to those collection types. They make no sense as they are because the concept ofWriter
is basically conceptually constrained to String streaming. This needs to scale out further.Any other details?
Proposals
I will bunch up many thoughts inside this proposal, they can be implemented or not individually.
1. Rename
Span
toBuffer
Buffer
is a known word and concept, it is the preferred word in C.Span
is not so clear for non native English speakers, and is not one the first words one thinks of when talking about data stored in memory.Buffer
is also what best describes this structure since it can also mutate (in the future maybe also own) the data.2. Prioritize and strengthen generics for working with
Span
Parametrized origins are very clunky. See #3744 for an example of a method to enable reads or writes to a
Span
.3. Define a generic trait for serialization
proposed pseudocode (we don't have parametrized traits yet so we can constrain it to
Byte
for now)(A type could potentially serialize to more than 1 type of Buffer. e.g. String could be encoded to different UTF standards)
4. Define a generic trait for binary deserialization
(A type could potentially deserialize from more than 1 type of Buffer. e.g. String could be decoded from different UTF standards)
4. Define generic traits for binary streaming
Libc's abstraction of streams that get flushed has stood the test of 50 years of different use by very different domain specific logic. It is not perfect, but it is a great starting point.
Writer
trait has no error handling at all, which is not good for interfacing with anything with I/OWriter
traitwhat this model enables
current writer trait
what the streamable trait does
where does the difference come into play?
This same abstraction will work all the way down to sockets and file streams, because it's practically the same. The potential for zero-copy data structures and direct file memory mapping deserialization is huge. Most importantly, this is simple and easy to use; everyone can understand that you send things to a pipe/stream and then you have to flush it. It might also be interesting to make this a linear type where the user must call
flush
at some point or the code won't compile.5. Define convenience builtin methods
Effects this will have
This will allow passing fat pointers all the way for many APIs. We will stop caring so much about collection types copying their items as we will be dealing with pointer + len combinations for everything. With the added safety of Origin.
This will positively influence the handling of any future networking and data interchange implementations and build scaffolding for some neat code. I'm currently developing the Socket package (waiting for async Mojo :( ) and purely using
Span[Byte]
as data interchange format. But if we can define these generically many improvements will follow as the user would be able to implement their own streaming, serialization, and deserialization logic.The text was updated successfully, but these errors were encountered: