Proposal: reduce the memory allocation caused by mappers copying for each request #368

akudiyar · 2023-03-17T01:29:52Z

akudiyar
Mar 17, 2023
Maintainer

Preamble

What are the mappers and why do we need them

When speaking about a language connector for a protocol or API (IPROTO and any of the higher level abstractions like "box" or Cartridge APIs), the question about how to represent the specific domain model of that protocol or API is very important. There are several possible approaches for designing the connector models, including for example:

1-1 representation of the protocol/API server domain model. This approach has its both advantages and disadvantages:
- adv: it's easier to visually tie the connector objects with the corresponding protocol/API structures
- adv: the connector code may remain simpler, especially if it follows the API structures literally, fully reconstructing the same object structure in the connector
- dis: the connector object model may become convoluted and unnatural for the target language users, with design errors and the protocol implementation source language differences leaking to the connector design
- dis: the connector users will still have to implement their own domain models and adapters for using the connector in the real applications
Completely different domain model, resembling some existing target language patterns or some intermediate level of the protocol. This may include working directly with MsgPack or other containers, adapting the existing ORMs or other frameworks, or already existing standards like JSR221 (JDBC). It also has its pros and cons:
- adv: the domain model is very familiar to the connector users
- adv: many existing libraries on top of the existing domain models allow building business solutions faster
- adv/dis: the users expect the connector behavior to follow the behaviors of other products that use the same domain model, which may be not achievable by the nature of the protocol/API server
- dis: target domain model will be narrower than the protocol/API domain model, with custom features sometimes not possible
- dis: some existing approaches were created long ago and are out of date compared to the modern design and programming patterns
Any kind of intermediate product between the above approaches. This includes the approach that was taken for the current driver implementation: keeping the outlines of the Tarantool server and cluster APIs for the connector API, while working closely as possible to the target Java standard library objects for reducing the users' headache and amount of code that is needed for gluing the connector into the business applications and higher-level convenience libraries like Spring Data or Apache Spark.

If we talk about the Tarantool protocol (IPROTO) and APIs, we have a lot of different domain models on different levels:

the "IPROTO" level: IPROTO comes both as a transport layer for communicating with a Tarantool instance and includes representation of the "box" API parts (but not all, see further). On the IPROTO level, we have all objects wrapped into MsgPack. So the API parts in the protocol require first parsing the MsgPack objects and then constructing the API model objects from them;
the "Lua" level: when we go out of the IPROTO capabilities, we have to work with Lua objects wrapped into MsgPack and then into the IPROTO packets. That includes any APIs available on Tarantool instances, from some "box" API parts to the high-level cluster functions. The Lua objects differ from the Tarantool database objects: one example of that is the infamous box.NULL kludge and the subsequences of using it in places other than the database tuples;
the "library" level: we use some library function calls on a Tarantool instance, wrapped into Lua objects and then into IPROTO protocol. Any library API, for example, may or may not follow the "go" error reporting convention (return nil, err). Some of the libraries use a higher-level abstraction of tarantool/errors, but some do not. Therefore we have several different error reporting patterns that we need to handle in the connector;
and finally, the "cluster" level: basically, some library APIs (tarantool/crud, for example) that are supposed to follow the Tarantool database server API but implement it for the Tarantool cluster.

The purpose of a language connector is to hide all that burden of complexity and bad design decisions from the end user and provide a uniform, stable, disambiguous, and unsurprising API, well adapted to the target language reality and possible applications and library usage. Therefore, we have to implement a connector layer that resolves all this and has the following characteristics:

allows creating a uniform API for both single-instance and cluster Tarantool database deployments;
correctly represents internally the stack of layers for wrapping the data, but hides it from the user (as most as possible);
resolves ambiguity automatically as most as possible, but relies on the user's decision of what API they call and what data they expect;
provides compile-time type checks as the Java language requires it, reducing the amount of runtime errors or unnecessary ugly code for object type resolution in the user applications;
is easy for extension or customization (by the connector user);
is easy to maintain, clean from the programming practice perspective, and complies with Java programming patterns.

In the current driver architecture such level is represented by mappers.

Terms

A mapper is an abstraction that represents the rules of mapping between Tatantool protocol responses and Java language objects, from the standard library or custom ones, and vice versa.

Each mapper contains a set of converters, each of which represents a rule for converting a single MsgPack type to a single standard or custom Java class and vice versa. The set of converters is organized as a stack by the order of adding them to the mapper - the first converter that fits the value will be used. This is necessary for correct parsing the MsgPack objects because in this format the smallest possible object representation is always chosen, but on the user application level, it should be an object of the same type.

Converters are divided into "value" and "object" converters. The first group represents rules for converting the Java objects to MsgPack objects, the second one is for converting the MsgPack objects to Java objects.

On the other hand, converters are divided into "simple" and "complex" ones. The simple converters are for simple types (like numbers or strings), and the complex converters are for aggregate classes or container types like List or Map. By their nature, the complex converters are in fact recursive, because they require an instance of a mapper for converting the contents.

All converters needed for working with the standard TarantoolTuple type representing a Tarantool database tuple, are "built-in" converters and are described in the DefaultMessagePackMapper class. That includes a converter for a special Packable interface for the classes that have their custom MsgPack representation. All the TarantoolTuple implementations implement Packable.

Mapper stacks

Tuple is an atom of data in the Tarantool database. But for working with the different Tarantool and library APIs, only tuples are not enough. Tuples can be wrapped into a plain IPROTO response ("box" API) or a library response ("tarantool/crud" API). Also, we have some special situations like getting the Tarantool spaces and indexes metadata or API methods that do not use tuples in requests or responses. The number of layers of wrapping the actual data is different for each case. And on the transport level, each data container is represented by the same MsgPack standard container type, either an array or map, so the meaning of a structure in the packet depends on what level of wrapping we are in. Therefore we have a concept of mapper stacks.

Mapper stacks are represented by mappers that take converters for one layer and include nested mappers. The highest level mapper types currently include TarantoolTupleResultMapper, SingleValueCallResultMapper, and MultiValueCallResultMapper. The mapper stacks in implementations for each type will be different depending on what API level is being used - "box", "library" or "cluster". Each of these mapper types represents a common API response pattern:

TarantoolTupleResultMapper: represents an API method response that contains tuples as the data. "box" and "tarantool/crud" API responses for the CRUD methods will contain tuples, but in different formats. The users may also implement custom Lua APIs that return tuples.
SingleValueCallResultMapper: represents a common pattern for all Lua library API methods that return a single object or a "Go-style" tuple "nil, err". Only the first two objects in the response MsgPack array are processed.
MultiValueCallResultMapper: represents all other Lua library API methods that do not follow the previous patterns. The users may use some custom code to parse the error objects from such responses or rely on the IPROTO errors.

The biggest challenge in parsing the layered answers from the API calls to Tarantool server is that the representation of the answer will differ based on the selected mapper stack. We cannot apply them completely automatically, and the user input is necessary for selecting the right one. However, we provide the user with very simple controls for that: the type of the expected API in the configuration (currently either a "box" or "proxy", for which in fact only the "tarantool/crud" library response format is supported, but that may change in the future), and the different client API methods with different result types. The user specifies only the useful payload type, all the mapper stack machinery is hidden behind the method name.

A good description of the driver classes used for constructing the stacks was made by @ArtDu in #300. Although something could have already changed, the structure in general remains the same.

The problem

When parsing tuple data, it is possible to use the space schema for attributing the tuple fields to their names. This approach has different important applications in higher-level libraries such as Spring Data and Apache Spark connector, where the tuples are automatically mapped to Java classes. Also for the user, it is more convenient to use names for populating the tuples than the field positions. It is also important that with names we can hide from the user some auxiliary tuple fields like bucket_id for sharding.

Using the tuple field names requires using the space metadata information in the tuple mapper. Therefore we need to have an isolated mapper stack for each Tarantool space.

All the mapper stacks use the same mapper API and structure with converters as in DefaultMessagePackMapper. So we have to use an isolated mapper stack for each driver API method call and copy the base default mapper with built-in converters, adding on some deep level a tuple mapper for a particular space or some other mapper for custom user data (a list of strings, for example). We are not able to have a single stack with a tuple mapper for all calls, and different payload types for the custom library calls make things even worse.

The current mappers implementation constructs a mapper stack for each call from scratch, involving copying of DefaultMessagePackMapper instance several times (we cannot reuse the same instance because we mutate it for adding new converters to the stack, like different tuple converters). This is the simplest bugless approach, but it has its drawbacks: high pressure for the garbage collector, more often GC stops affecting the performance, and a very high memory footprint. Extra CPU cycles for creating the new mapper stacks are also not the best approach for ecology.

We need to reduce the unnecessary copying of the mapper stacks and the DefaultMessagePackMapper instance. Ideally, each different mapper stack will be constructed only once for the lifetime of the application (not taking into account that the spaces schema on the server may change, but the customer application has to either tolerate such changes or will have to be modified and relaunched anyway).

Possible solutions

Here goes the point of discussion.

First partial solution idea that lies on the surface is binding the TarantoolTupleResultMapper stacks to the space metadata. That's fairly easy to implement, we just need to construct a mapper stack once we create a proxy space object (an instance of a TarantoolSpace implementation). We will have no more stacks than the number of spaces used by the user in their application. The number of internal DefaultMessagePackMapper copies will be a multiple of that, but still a constant one. But this idea does not solve the problem with all other types of calls.
When each mapper stack is built, it has some input parameters: the call type (single/multi, single-instance/cluster), the method, the set of method parameter values (and their types), and their type. The second idea is to use that information for uniquely attributing a mapper stack. Using some fingerprint (like a normal object hashcode) for all the input parameters and binding a mapper stack to that value in a "cache" we will guarantee that at least we will not create the same stack for the same set of input parameters. By moving the idea further and using only the method parameter types (like the class names) it looks possible to not construct a mapper stack twice for each library method call with the same signature.
@ArtDu has some ideas for separating the stacks of mappers inside DefaultMessagePackMapper (there is a task Separate default mappers from custom ones to reduce memory allocations #361, but unfortunately without any details). This idea could lead to reducing the mapper stack cache size when using the above approaches. It would be great to discuss this idea here too.

I'd like to hear your opinions and concerns on these two approaches, as well as new fresh ideas. The discussion is greatly welcome!

cc @ArtDu @Elishtar @bitgorbovsky @iDneprov @Totktonada

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: reduce the memory allocation caused by mappers copying for each request #368

{{title}}

Replies: 0 comments

Select a reply

Proposal: reduce the memory allocation caused by mappers copying for each request #368

akudiyar Mar 17, 2023 Maintainer

Preamble

What are the mappers and why do we need them

Terms

Mapper stacks

The problem

Possible solutions

Replies: 0 comments

akudiyar
Mar 17, 2023
Maintainer