-
Notifications
You must be signed in to change notification settings - Fork 335
How Much Is One Terabyte of Data?
It seems that one mile distance isn’t very long, and that a cubic mile isn’t that big if compared with the size of the earth. You may be surprised if I tell you the entire world’s population can fit into a cubic mile of space. The statement is not from me, Hendrik Willem van Loon, a Dutch-American writer, once wrote this in his book.
Teradata is a famous data warehouse product. Over 30 years ago, such a brand name aimed to impress people with its ability of handling massive amounts of data. Today, TB is already the smallest unit many database vendors use when talking about the amount of data they can handle. And PB, even ZB, is often used. It seems that TB is not a big unit, and hundreds of terabytes of data even a petabyte of data is not intimidating at all.
Actually, one TB, like one cubic mile, is of a rather large size. As many people do not have much intuitive grasp of its size, we take a new angle to examine what 1TB of data means to the database.
Databases mainly process structured data, among which the ever-increasing transaction data takes up the largest space. The size of each piece of transaction data isn’t big, from dozens of to one hundred bytes when only the key information is stored. For example, the banking transaction information only includes account, date and amount; and a telecom company’s call records only contain phone number, time and duration. Suppose each record occupies 100 bytes, which is 0.1 KB, a terabyte of storage space can accommodate about ten billion records.
What does this mean? There are about 30 million seconds in a year. To accumulate 1 terabyte of data in a year, about 300 records per second will be generated around the clock!
It isn’t a ridiculously large number. In a large country like the U.S., it is easy for businesses of national telecom operators, national banks, and internet giants to reach that scale. But for a city wide or even some state wide institutions, it is really a big number. It is not probable that the tax information collected by local tax bureaus, the purchase data of a local chain store, or the transaction data of a city commercial bank increases by 300 records per second. Besides, many organizations generate data only on days or weekdays. To have dozens of, even one hundred terabytes of data, business volume should be one or two orders of magnitude bigger.
Just talking about a TB scale of data may be too abstract to make sense of it. But by translating it to the corresponding business volume per second, we can have a clear idea of it.
On the other hand, some not so large-scale organizations also have a data volume ranging from hundreds of terabytes to even the one petabyte. How does it happen?
Any piece of audio and video unstructured data can be several, even dozens of, megabytes in size. It is easy to reach the PB level; but these data won’t be computed in the database.
Different info-systems of an organization have together accumulated a huge volume of data in N years – each contributing 200 GB per year, 50 GB per year, or others. In addition, there are redundant intermediate computing results. Put them altogether and we probably get a total of hundreds of terabytes, even one petabyte. Maybe they are stored in the database, but generally they won’t be used at one time in the same computing task.
It is normal to generate hundreds of, even ten thousand, records per second if the data is automatically collected by the machine or it is the user behavior data. The total volume of data may reach hundreds of terabytes, even the PB level. In this case, the database should be able to process TB-level data or above. Yet such type of trivial data is of little use and has very simple computing logics. Basically, we just need to find and retrieve them.
Now let’s look at what a database that can process TB-level data looks like. Some database vendors claim that their products can handle TB-level, even PB-level, data in seconds and that’s what users often expect them to do. But is it true?
To process data, we need to read them through once at least. The high-speed SSD reads 300 megabytes of data per second (the technical parameters hard disk manufacturers provide cannot fully achievable under the operating system). It takes 3000 seconds, which is nearly an hour, to retrieve one terabyte of data, without performing any other operations. How can one TB data be processed in seconds? It is simply done by putting 1000 hard disks in place, and one TB data can be retrieved in about 3 seconds.
That is the ideal estimate. In reality, it’s unlikely that data is stored in neat order and that performance becomes terrible when data is retrieved from the hard disk discontinuously. Obviously the 1000 hard disks won’t be equipped in one machine. For a cluster, there’s the network latency; there may be some computations involving the writeback operation, such as sorting operation and join operation; and instant query access is often accompanied by concurrent requests. Considering all these factors, it is not surprising that processing becomes several times slower.
Now we know that one terabyte of data means several hours or 1000 hard disks. As we said, this is just about one terabyte. You can imagine what dozens of, or one hundred terabytes of data, will bring.
You can understand that it is difficult to move one TB data if you have any experiences of transferring files online. Probably the quickest way is to carry the hard disks away. This also gives us the idea of the size of one TB data.
In practice, most computing tasks of most users involve a data volume ranging from dozens of to hundreds of gigabytes at most. It is rare that the data volume reaches the TB level. Yet, it takes the distributed database to run several hours to process even this amount of data. A review of the very slow tasks you handled may help prove it. The computing logics may be complex, during which it is not uncommon that repeated traversals and writebacks are involved. Generally, the currently running distributed database has only from several to a dozen of nodes; it is almost impossible to create an environment made up of thousands of hard disks. In this case it is not surprising at all that several hours are spent in performing the computation, which is almost the normal of batch processing tasks in finance industry.
Even the large, top organizations that have a total data volume of N petabytes and that have thousands of, even ten thousand, nodes in their computing center, most of their computing tasks involve data volumes that only range from dozens of to hundreds of gigabytes, and maybe about ten virtual machines will be allocated to a certain task. A large-scale organization has too many things to take care of. It is impossible to allocate all its resources to one task.
PB-level data does exist in many organizations, but it is mainly a storage concept rather than the computing requirement. A total data volume of PB level requires that the database have the ability of processing PB-level data, which is the consequence of databases’ closedness. Existence does not mean rationality. Actually, that is a terrible solution, which we will talk about later.
One TB data is a huge volume for the database used for data analysis and computing. The name Teradata does not become outdated even today. It will be significant if a tool can process TB-level data smoothly, such as improving user experience by reducing the processing time from several hours to several minutes; or if it can simplify a small-scale distributed environment and convert it to a single machine so that the operation and maintenance costs can be greatly reduced. Well, esProc SPL is such a tool.
For most of the user scenarios, pursuing the ability of processing PB-level data is both unnecessary and impracticable.
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code