-
Notifications
You must be signed in to change notification settings - Fork 335
The current Lakehouse is like a false proposition
From all-in-one machine, hyper-convergence, cloud computing to HTAP, we constantly try to combine multiple application scenarios together and attempt to solve this type of problem through one technology so as to achieve the goal of simple and efficient use. Lakehouse, which is very hot nowadays, is exactly such a technology; its goal is to integrate the data lake with the data warehouse to give play to their respective value at the same time.
The data lake and data warehouse have always been related closely, yet there are significant differences between them. The data lake pays more attention to retaining the original information, and its primary goal is to store the raw data “as is”. However, there are a lot of junk data in the raw data. Does storing the raw data “as is” mean that all the junk data will be stored in data lake? Yes, the data lake is just like a junk data yard where all the data is stored, regardless of whether they are useful or not. Therefore, the first problem that the data lake faces is the storage of massive (junk) data.
Benefiting from the considerableprogress of modern storage technology, the cost of storing massive data is reduced dramatically. For example, using the distributed file system can fully meet the storage needs of data lake. But, the data storing ability alone is not enough, the computing ability is also required to bring the value into play. Data lake stores various types of data and each is processed differently, and the structured data processing is of the highest degree of importance. Whether it is historical data or newly generated business data, data processing mainly focuses on structured data. On many occasions, computations of semi-structured data and unstructured data will eventually be transformed to structured data computation. Unfortunately, however, since the storage schema itself (file system) of data lake does not have the computing ability, it is impossible to process the data directly on the data lake. To process the data, you have to use other technologies (such as data warehouse). The main problem that data lake is facing is “capable of storing, but incapable of computing”.
For the data warehouse, it is just the opposite. Data warehouse is based on SQL system, and often has powerful ability to calculate the structured data. However, only after the raw data are cleansed, transformed and deeply organized until they meet database’s constraints can they be loaded into the data warehouse. In this process, a large amount of original information will be lost, even the data granularity will become coarse, resulting in a failure of obtaining the value of data with lower granularity. Moreover, the data warehouse is highly subject-oriented, and services one or a few subjects only. Since the data outside the subjects is not the target of data warehouse, it will make the range of usable data relatively narrow, making it unable to explore the value of full and unknown data as data lake does, let alone store massive raw data like data lake. Compared with data lake, the data warehouse is “capable of computing, but incapable of storing”.
From the point of view of data flow, the data of data warehouse can be organized based on data lake, so a natural idea is to integrate the data lake with the data warehouse to achieve the goal of “capable of storing and computing”, which is the so-called “Lakehouse”.
So, what is current implementing situation?
The current method is oversimplified and crude, that is, open the data access rights on the data lake to allow the data warehouse to access the data in real-time (the so-called real-time is relative to the original ETL process that needs to periodically move the data from data lake to data warehouse. Yet, there is still a certain delay in practice). Physically, the data are still stored in two places, and the data interaction is performed through high-speed network. Due to having a certain ability to “real time” process the data of data lake, the implementation result (mostly at the architecture level) is now called Lakehouse.
That’s it? Is that a Lakehouse in the true sense?
Well, I have to say - as long as the one (who claims it is Lakehouse) doesn’t feel embarrassed, embarrassing is as the one (who knows what Lakehouse should be like) feels embarrassed.
Then, how does the data warehouse read the data of data lake? A common practice is to create an external table/schema in the data warehouse to map RDB’s table, or schema, or hive’s metastore. This process is the same as the method that a traditional RDB accesses the external data through external table. Although the metadata information is retained, the disadvantages are obvious. Specifically, it requires the data lake can be mapped as tables and schema under corresponding relational model, and it also needs to organize the data before computing them. Moreover, the types of available data sources decrease (for example, we cannot perform mapping directly based on NoSQL, text, and Webservice). Furthermore, even if there are other data sources (such as RDB) available for computation in the data lake, the data warehouse usually needs to move the data to its local position when computing (such as grouping and aggregating), resulting in a high data transmission cost, performance drop, and many problems.
For the current Lakehouse, in addition to “real-time” data interaction, the original channel for periodically organizing the data in batches is still retained. In this way, the organized data of data lake can be stored into the data warehouse for local computing. Of course, this has little to do with the Lakehouse, because it was done the same way before the “integration”.
Anyway, both the data lake and data warehouse change little (only the data transmission frequency is improved, but many conditions have to be met), whether the data is transmitted from lake to warehouse through traditional ETL or external real-time mapping. Physically, the data are still stored in two places. The data lake is still the original data lake, and the warehouse is still the original data warehouse, and they are not integrated essentially! Consequently, not only are the data diversity and efficiency problems not fundamentally solved (lack of flexibility), but it also needs to organize the “junk” data of data lake first, and then load them into the warehouse before computing (poor real time performance). If you want to build a real-time and efficient data processing ability on the data lake through the “Lakehouse” implemented in this way, I'm afraid it's a joke.
Why?
If we think a little, we will find that the problem is in the data warehouse. The database system is too closed and lacks openness, it needs to load the data into the database (including external data mapping) before computing. Moreover, due to the database constraints, the data must be deeply organized to conform to the norms before being loaded into the database, while the raw data itself of data lake has a lot of “junk” data. Although it is reasonable to organize these data, it is difficult to respond to the real-time computing needs of data lake. If the database is open enough, and has the ability to directly calculate the unorganized data of data lake, and even the ability to perform mixed computing based on a variety of different types of data sources, and provide a high-performance mechanism to ensure the computing efficiency at the same time, then it is easy to implement a real Lakehouse. However, it is a pity that the database is unable to achieve this goal.
Fortunately, esProc SPL does.
SPL - an open computing engine - helps implement a real Lakehouse
The open-source SPL is a structured data computing engine that provides open computing power for data lake, which can directly calculate the raw data of data lake, there are no constraints, even no database to store the data. Moreover, SPL boasts the mixed computing ability for diverse data sources. Whether the data lake is built on a unified file system, or based on diverse data sources (RDB, NoSQL, LocalFile, Webservice), a direct mixed computing can be accomplished in SPL, and the value of data lake can be produced quickly. Furthermore, SPL provides a high-performance file storage (the storage function of data warehouse).The data can be organized unhurriedly when calculations are going on in SPL, while loading the raw data into SPL’s storage can obtain higher performance. Particular attention should be paid that the data are still stored in the file system after they are organized in SPL storage, and theoretically, they can be stored in the same place with the data lake. In this way, a real Lakehouse can be implemented.
In the whole architecture, SPL can perform unified storage and calculation directly based on data lake, and can also connect to diverse data sources in the data lake, and even directly read the external production data source. With these abilities, a real-time calculation on the data lake can be implemented, and in some scenarios that require high data timeliness (it needs to use the data before they are stored into the data lake), SPL can connect to the real-time data source, so the data timeliness is higher.
The original way that moves the data from the data lake to data warehouse can still be retained. ETLing the raw data to SPL’s high-performance storage can achieve a higher computing performance. Meanwhile, using the file system to store the data enables the data to be distributed on the SPL server (storage) or, alternatively, we can still use the unified file of data lake to store the data, that is, the work of original data warehouse is completely taken over by SPL. As a result, the Lakehouse is implemented in one system.
Let's take a look at these abilities of SPL.
SPL supports various data sources, including RDB, NoSQL, JSON/XML, CSV, Webservice, etc., and has the ability to perform mixed computation between different sources. This enables direct use of any type of raw data stored in the data lake and gives play to the value of data without transforming the data, and the action of “loading into the database” is omitted. Therefore, the flexible and efficient use of data is ensured, and a wider range of business requirements is covered.
With this ability, the data lake will be able to provide data service for applications as soon as it is established rather than having to complete a prolonged cycle of data preparation, loading and modeling. Moreover, the SPL-based data lake is more flexible, and can provide a real time response based on business needs.
Particularly, SPL’s good support for files gives powerful computing ability to them. Storing lake data in a file system can also obtain computing power nearly as good as, even greater than, the database capability. Besides text files, SPL can also handle the data of hierarchical format like JSON, and thus the data stored in NoSQL and RESTful can be used directly without transformation. It’s really convenient.
A | ||
1 | =json(file("/data/EO.json").read()) | |
2 | =A1.conj(Orders) | |
3 | =A2.select(Amount>1000 && Amount<=3000 && like@c(Client,"*s*")) | Conditional filtering |
4 | =A2.groups(year(OrderDate);sum(Amount)) | Grouping and aggregating |
5 | =A1.new(Name,Gender,Dept,Orders.OrderID,Orders.Client,Orders.Client,Orders.SellerId,Orders.Amount,Orders.OrderDate) | Join |
SPL provides all-around computational capability. The discrete data set model (instead of relational algebra) it is based arms it with a complete set of computing abilities as SQL has. Moreover, with agile syntax and procedural programming ability, data processing in SPL is simpler and more convenient than in SQL.
Rich computing library of SPL
This enables the data lake to fully has the computing ability of data warehouse, achieving the first step of integrating data lake with data warehouse.
SPL’s open computing power extends beyond data lake. Generally, if the target data isn’t synchronized from the source into the lake but is needed right now, we have no choice but to wait for the completion of synchronization. Now with SPL, we can access the data source directly to perform computations, or perform mixed computations between the data source and the existing data in the lake. Logically, the data source can be treated as part of the data lake to engage in the computation so that higher flexibility can be achieved.
In addition to its own all-around and powerful computing abilities, SPL provides file-based high-performance storage. ETLing raw data and storing it in SPL storage can achieve higher performance. What’s more, the file system has a series of advantages like flexible to use and easy to parallelly process. Having the data storage ability is equivalent to achieving the second step of integrating the data lake with data warehouse, and a new open and flexible data warehouse is formed.
Currently, SPL provides two high-performance file storage formats: bin file and composite table. The bin file adopts the compression technology (faster reading due to less space occupation,), stores the data types (faster reading due to no need to parse the data type), and supports the double increment segmentation mechanism that can append the data. Since it is easy to implement parallel computing by using the segmentation strategy, computing performance is ensured. The composite table supports columnar storage, this storage schema has great advantage in scenarios where only a very small number of columns (fields) are involved. In addition, the composite table implements the minmaxindex and supports double increment segmentation mechanism, therefore, it not only enjoys the advantages of columnar storage, but also makes it easier to perform the parallel computing to improve the performance.
Furthermore, it is easy to implement parallel computing in SPL and fully bring into play the advantage of multiple CPUs. Many SPL functions, like file retrieval, filtering and sorting, support parallel processing. It is simple and convenient for them to automatically implement the multithreaded processing only by adding the @moption. They support writing parallel program explicitly to enhance computing performance.
In particular, SPL supports a variety of high-performance algorithms SQL cannot achieve. For example, the common TopN operation is treated as an aggregation operation in SPL, as a result, a high-complexity sorting operation can be transformed to a low-complexity aggregation operation while extending the range of application.
A | ||
1 | =file(“data.ctx”).create().cursor() | |
2 | =A1.groups(;top(10,amount)) | Get orders whose amount rank in top 10 |
3 | =A1.groups(area;top(10,amount)) | Get orders whose amount rank in top 10 in each area |
In these statements, there are no any sort-related keywords and will not trigger a full sorting action. The statements for getting top N from a whole set and from grouped subsets are basically the same and both can achieve a higher performance. SPL boasts many more such high-performance algorithms.
Depending on these mechanisms, SPL can achieve a performance that surpasses that of traditional data warehouse, the degree of surpassing is measured in orders of magnitude, and the full implementation of Lakehouse in data lake is not done in words but effective mechanisms.
Furthermore, SPL can perform mixed computations on transformed data and raw data to give full play to the value of various types of data, instead of preparing data in advance. In this way, not only is the flexibility of data lake fully expanded, but it also has the function of real-time data warehouse. This achieves the third step of integrating the data lake with data warehouse, which takes into account both the flexibility and high performance.
Through the above three steps, the path to build the data lake is improved (the original path needs to load and transform the data before computing), and the data preparation and computation can be carried out at the same time, and the data lake is built step by step. Moreover, in the process of building the data lake, the data warehouse is perfected, making the data lake has powerful computing ability, implementing the real Lakehouse. This is the correct method for implementing a real Lakehouse.
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code