-
Notifications
You must be signed in to change notification settings - Fork 335
With lightweight SPL available, how necessary is MPP?
In order to obtain better computing performance, the MPP databases, such as Greenplum, Vertica, IQ, TD and AsterData, are often adopted. Although MPP can achieve better performance, the cost is high. Specifically, MPP consumes a large amount of hardware resources, resulting in high hardware cost, and it needs to pay expensive license fee if a commercial software is used. Moreover, it is very complicated to operate and maintain MPP, as each node needs to be maintained separately, and the uniform distribution and the consistency assurance of data under distributed framework will increase the O&M complexity. In short, it is heavy and expensive to use MPP.
Then, is there any other solution?
The main purpose of using MPP is to obtain better computing performance. If the performance can be improved in a lightweight and cost-effective way, then we can give up MPP. So, is there a way like this?
After carefully analyzing the current computing scenarios that deal with structured data (database), we found that the data amount of task in most scenarios is not particularly large. Let’s take the financial institutions whose business data are usually large as examples: for a bank with tens of millions of accounts, its transaction records only amount to hundreds of millions a year, which is not considered large; for an e-commerce system with millions of accounts, its accumulated data amount is only similar to the scale of the bank. Except for a few top companies, the computing scenarios of the vast majority of users do not involve particularly large amount of data, and the data scale of a single task is only tens of gigabytes, and the task involving data of up to 100 gigabytes is rare, let alone the petabyte-level task claimed by many large data vendors.
Normally, a conventional database should be able to handle the task of this data scale easily, but it is not the case in reality. In the real world, it is very common to take several hours to do a batch job, yet there is no extra time to re-run the job if something goes wrong; it is also very common to take tens of seconds to minutes to query a report, and the time it takes will be longer once the query occurs concurrently (even if the number of concurrent queries is small).
To cope with this, users will consider using an MPP to speed up.
Why do these situations still occur though the amount of data is not large?
The main reason is that the current database does not fully utilize the hardware resources. In other words, the performance of database is too low.
There are two deeper reasons.
One reason is that MPP adopts many engineering optimization methods such as data compression, columnar storage, index, and vector-based computing to serve the AP scenarios. By means of these methods, the computing efficiency is significantly improved. However, these methods are rarely found in traditional databases, so the performance is naturally low. If traditional databases adopted such technologies, the performance would also be improved. Unfortunately, only MPP is now using these technologies.
The other reason is that although these slow-running operations do not involve large data amount, they are usually very complex. Moreover, due to the limitations of SQL itself, it is very difficult to implement some complex operations. Even if such operations can be coded in SQL, the amount of calculation is particularly large. For example, for the order-related multi-step operation, it is difficult to code in SQL and slow to run. SQL lacks features like record type, ordered operation, and procedural computation, making it impossible to code many high-performance algorithms. As a result, programmers can only resort to slow algorithms, so performance is poor.
Running slow needs more hardware to speed up. Therefore, even if the data scale is not large, the database cannot handle, and has to resort to the distributed MPP.
Of course, we hope to obtain both the speed of a high-speed train (MPP) and the volume of a car (light solution). However, within the scope of current knowledge, it seems that a car that runs as fast as a high-speed train cannot be found, so the heavy high-speed train is a solution that has to be taken.
Fortunately, now we have the lightweight esProcSPL to fill the gap. Just like a car, SPL can achieve the speed of a high-speed train! Here are some of the advantages of esProcSPL:
As an open-source computing engine, SPL is specifically designed for processing structured data. The high-performance mechanism provided in SPL can fully utilize hardware resources, allowing a single machine to exert the computing ability of a cluster, thus making it possible to handle most computing scenarios that previously required MPP without employing a distributed framework.
In terms of engineering, SPL also adopts the common mechanisms of MPP, such as compression, columnar storage, index, and vector-based calculation to ensure excellent performance. In addition, SPL provides high-performance file storage that supports these mechanisms, eliminating the need for a closed database management system. The file storage can be directly distributed on any file system, making it more open. Not only is SPL high in computing performance, it’s also an out-of-the-box tool and lighter.
More importantly, due to the inherent defects of SQL, SPL doesn't continue to use SQL system but adopts an independent programming language, i.e., Structured Process Language. Moreover, SPL provides more data types and operations, and makes innovation fundamentally (you should know that it is difficult to address theoretical defects with engineering methods). As we know, the software cannot change the speed of hardware. However, we can use low-complexity algorithms, then the hardware will execute less computation, and the performance will be improved naturally. SPL offers many such high-performance algorithms. For example, for the complicated multi-step ordered operation mentioned above, it is easy to implement in SPL and, it is simple to code and the running speed is fast.
High performance requires less hardware, which is a relationship we’ve talked about many times. In practice, for most of the scenarios that seem to require MPP, SPL can handle them through a single machine, which not only saves hardware cost, but is convenient in O&M.
Here are some cases for reference:
- Open-source SPL turns pre-association of query on bank mobile account into real-time association
- Open-source SPL Speeds up Query on Detail Table of Group Insurance by 2000+ Times
- Open-source SPL improves bank’s self-service analysis from 5-concurrency to 100-concurrency
- Open-source SPL speeds up intersection calculation of customer groups in bank user profile by 200+ times
In addition to improving the computing performance, the distributed technology is used to handle multi-concurrency query sometimes. For multi-concurrency query, a single machine is indeed hard to process sometimes. In this case, do we have to use MPP?
Not necessarily.
SPL provides the cloud mode, allowing us to dynamically start/stop the computing nodes based on the concurrent situation, hereby implementing elastic computing. The cloud mode of SPL is completely different from the relatively fixed cluster mode of MPP, it can flexibly handle concurrent requests, and consumes the least hardware resources.
The high-performance file storage of SPL mentioned earlier can fully ensure the computing performance. Yet, unlike the database, which need to store the data in database, SPL does not bind the storage, and the data files can be stored locally or remotely.
We know that the database has metadata, which takes up a lot of resources. As the data accumulate, the metadata will become larger and larger, and the whole system will become slower and slower, making it difficult to implement some methods such as data redundancy and trading space for time. In contrast, SPL has no metadata and is light to use.
Due to the fact that SPL does not bind the storage to computation, and is not subject to metadata, SPL naturally supports the separation between storage and computation. Even if the high-performance file of SPL is used, the performance in the whole system is the same as using text, and the file can be stored in local or network file system, and can also be stored directly to the cloud object storage like S3. With the support of separation between storage and computation, SPL can perform flexible scaling, making it very easy to cope with high-concurrency scenarios, and more flexible and scalable than MPP.
Let's start by comparing SPL and SQL. The similarity is that they are both the computing language for structured data, and the difference is that SQL does not have complete language ability, and even for simple data task, it is often difficult to implement independently. For example, for the calculation of maximum number of days that a stock keeps rising, and more complex e-commerce funnel calculation (such calculations are not rare and often appear in practice), it is extremely difficult to implement in SQL, and often needs to resort to Python or Java. Consequently, it will make the technology stack complex, and bring inconvenience to the operation and maintenance.
Compared to SQL (many scenarios are difficult or even impossible to implement in SQL), SPL provides more concise syntax and more complete ability. For example, to calculate the maximum number of days that a stock keeps rising, coding in SPL needs just one statement:
stock.sort(trade_date).group@i(close_price<close_price [-1]).max(~.len())
In contrast, when this calculation is done in SQL, it needs to nest multiple layers of code and implement in a very roundabout way.
In addition to conventional structured data computing library, SPL provides the ordered operation that SQL is not good at, the grouping operation that retains the grouped sets, as well as various association methods, and so on. In addition, SPL syntax provides many unique features, such as the option syntax, cascaded parameter and advanced Lambda syntax, making complex calculations easier to implement.
For more information about SPL syntax, visit: A programming language coding in a grid
Concise syntax together with complete language ability directly makes the development work very efficient, and eliminate the need to resort to other technologies, hereby making the technology stack simpler, allowing everything to be done in one system, and naturally, the operation and maintenance are simpler and more convenient.
We often encounter such a situation where the adoption of MPP improves the performance but brings inconvenience, as the data of diverse sources can be processed only after they are loaded into the database, resulting in a decrease in data real-timeness and, loading such data into database and persisting them will increase the cost, take up more space, and affect the operation and maintenance. Moreover, MPP cannot take the place of original TP database, which will add a very difficult cross-database action to implement real-time hot computing.
SPL not only does not bind storage but supports the connection to and mixed calculation over diverse sources, which gives SPL good openness, making it totally different from database’s closedness that process the data only after they are loaded into database.
The data that SPL is good at processing include the conventional structured data, multi-layer structured data (json/xml, etc.), string, text and mathematical data such as matrix and vector. In particular, SPL provides powerful support for multi-layer structured data such as json and xml, far surpassing traditional databases. Therefore, SPL can work well with json-like data sources such as mongodb and kafka, and can also easily exchange data with HTTP/Restful and microservices and provide computing service. In particular, it is easy for SPL to implement the mixed operation with TP database, making it highly suitable for real-time query and count.
The benefits of openness are self-evident. It can not only avoid the database capacity and performance problems caused by ETL, it can also fully ensure the real-timeness of data and calculation. Therefore, the openness of SPL is very friendly for real time computing scenarios.
In addition to not binding storage and having no metadata, the lightweight nature of SPL is reflected in simple operating environment. SPL can run on any operating system as long as JDK 1.8 or higher version is available, including common VMs and Container, and only takes up less than 1G of space after installation.
What’s even more special is that SPL can not only be deployed independently but can also be integrated into applications, providing powerful computing ability within application. In this way, applications do not have to rely on central MPP to obtain powerful computing ability, and the coupling of data processing between applications is eliminated, making it flexible to use and easy to manage, and avoiding conflicts caused by multiple applications competing for central computing resources. All these are impossible for MPP.
Overall, for most scenarios involving small data amount, there is no need to use MPP, as SPL can implement on a single machine. Even if a single machine is not enough (high-concurrency query), the good scalability of SPL makes it more advantageous than MPP, because SPL is lighter in volume, simpler in technology stack, and more convenient in operation and maintenance.
In the past, when we wanted a speed of 300 km/h, we could only place hope on a high-speed train (MPP). Now, everyone (application) can achieve this speed through a car (SPL). Although both tools have their own advantages and disadvantages, it is obvious that the car is easier and more convenient.
Maybe a simple and light car that can reach a speed of 300 km/h would fully meet your requirement!
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code