-
Notifications
You must be signed in to change notification settings - Fork 335
Samples of Merging and Splitting Files
This article explains how to merge multiple text files or Excel files into one file, and split a text file or Excel file into multiple smaller ones. It illustrates several types of scenarios and offers sample programs in esProc SPL.${article}!
To merge text files or Excel files of same or similar structures into a single file and to split such a file into smaller files are common for data analysts. Here we look at different types of scenarios and offers sample programs in esProc SPL. It’s convenient to do file merge and split with esProc because, as a professional data computing engine, it has a complete set of functions for handling data import/export and operations over a directory.
A directory contains a number of text files of same column headers and structure but with different number of rows and detailed data. We want to merge these files into a single file under one set of column headers.
Example: there are same-structure text files that record daily orders under e:/orders. Each file has column headers on the first row and detailed data starting from the second row, as shown below. We want to merge them into one file named orders.txt.
esProc SPL script:
A | Comment | |
---|---|---|
1 | =directory@p("e:/orders/*.txt") | Get full paths of all text files under e:/orders |
2 | =A1.conj(file(~).import@t()) | Merge data of all text files |
3 | =file("e:/orders.txt").export@t(A2) | Export merged data to orders.txt |
If e:/orders has subdirectories, the text files under each subdirectory need to be merged too. In such case, A1 can be rewritten as =directory@ps("e:/orders/*.txt"). @s option enables getting files under all subdirectories recursively.
The above script is written on the condition that the merged data can fit into the memory. If it cannot, we can use the following script:
A | B | C | |
---|---|---|---|
1 | =directory@p("e:/orders/*.txt") | =file("e:/orders.txt") | |
2 | for A1 | =file(A2).cursor@t() | |
3 | If #A2==1 | =B1.export@t(B2) | |
4 | else | =B1.export@a(B2) |
A1 List full paths of all text files under the directory; use @s option to get text files under subdirectories if there are any.
A2 Loop through each of the text files.
B2 Read in data from each file with the cursor; @t option enables reading the first row as column headers.
B3-C4 Export data from B2’s cursor; column headers are exported first and once and then use @a option to append data to the existing file.
ometimes the text files don’t have completely same structure. They may contain different number of columns or have columns in different order, though all share a number of common columns. To merge the common columns in all files into one file, we need to retrieve them from each file in a specified order.
Example: In the previous e:/orders, all orders files have ID, Company, Area, OrderDate and Amount columns, with different orders though. Some files contain other columns. The task is to merge these five columns of each file and write the result to orders.txt.
esProc SPL script:
A | Comment | |
---|---|---|
1 | =directory@p("e:/orders/*.txt") | Get full paths of all text files under e:/orders |
2 | =A1.conj(file(~).import@t(ID,Company,Area,OrderDate,Amount)) | Retrieve the five columns from every file in the specified order and merge them |
3 | =file("e:/orders.txt").export@t(A2) | Export the merge result to orders.txt |
Use the following script if the merge result cannot fit into the memory:
A | B | C | |
---|---|---|---|
1 | =directory@p("e:/orders/*.txt") | =file("e:/orders.txt") | |
2 | for A1 | =file(A2).cursor@t(ID,Company,Area,OrderDate,Amount) | |
3 | If #A2==1 | =B1.export@t(B2) | |
4 | else | =B1.export@a(B2) |
In some cases, we need to add a column to the merged data and assign the original file names to the new column to mark the category or source of the data.
Example: Directory e:/orders contains a number of Excel files that record orders of various types of parts. Each file is named after a type of part, such as tyre.xlsx and engine.xlsx. These Excel files have same structure. In each file, column headers occupy the first row and detailed data starts from the second row. The task is to merge them into one file named orders.xlsx and add a new column PartName to store part names.
esProc SPL script:
A | Comment | |
---|---|---|
1 | =directory@p("e:/orders/*.xlsx") | Get full paths of all Excel files under e:/orders* * |
2 | =A1.conj((fn=filename@n( |
Define the file names as a temporary variable; after the files are imported and a new column PartName is added, assign values the new column through fn; finally merge data of these files |
3 | =file("e:/orders.xlsx").xlsexport@t(A2) | Export merged data to orders.xlsx |
Use the following script if the merge result is too big to be wholly loaded into the memory:
A | B | C | |
---|---|---|---|
1 | =directory@p("e:/orders/*.xlsx") | =file("e:/orders.xlsx") | |
2 | for A1 | =file(A2).xlsimport@tc().derive(filename@n(A2):PartName) | |
3 | If #A2==1 | =B1.xlsexport@ts(B2) | |
4 | else | =B1.xlsexport@as(B2) |
Below is part of the merge result set orders.xlsx:
To split a file, we divide data in the file into a number of groups, write each group into a new file and name the new file after the group name.
Example: An Excel file records orders of various parts. We want to split it and store orders of each type of parts into a new Excel file to send them to the parts manufacturer.
esProc SPL script:
A | B | |
---|---|---|
1 | =file(“e:/orders/orders.xlsx”).xlsimport@t() | =A1.group(partName) |
2 | for B1 | =file("e:/parts/"+A2(1).partName+”.xlsx”).xlsexport@t(A2) |
A1 Import all original Excel files.
B1 Group imported data by partName.
A2 Loop through each group of parts orders.
B2 Export orders of each type of parts to a new Excel file named after the parts name.
If the size of the original files is too big to be loaded into the memory at a time, we can import it with the cursor using the following script:
A | B | C | D | |
---|---|---|---|---|
1 | =file(“e:/orders/orders.xlsx”).xlsimport@tc() | |||
2 | for A1,50000 | =A2.group(partName) | ||
3 | for B2 | =file("e:/parts/"+B3(1).partName+”.xlsx”) | ||
4 | if C3.exists() | =C3.xlsexport@a(B3) | ||
5 | else | =C3.xlsexport@t(B3) |
A1 Create the cursor to import the original Excel files.
A2 Loop through the cursor to retrieve 50,000 rows each time (the number is determined according to the memory space).
B2 Group each batch of retrieved rows by partName.
B3 Loop through each group of parts.
C3 Create a file object named after the parts name.
C4-D5 If a file object already exists, use @a option to append the parts orders to it; if there isn’t such a file object, use @t to import column headers first and then the detailed data.
In some text files, a record consists of multiple rows. To split such a file, we need to identify the rows for forming one record and make sure that one record won’t be split into two files.
Example 1: log.txt records web operations logs. Each piece of log occupies 5 lines. The task is to split the file into smaller files, and each will contain 1000 pieces of logs.
esProc SPL script:
A | B | |
---|---|---|
1 | =file(“e:/log.txt”:”UTF-8”).cursor@s() | |
2 | for A1,5000 | =file(“e:/log/log_”/#A2/”.txt”).export(A2) |
A1 Import the log file with the cursor; @s option enables reading the whole line as a string.
A2 Loop through the cursor to retrieve 5000 rows each time, which forms 1000 piece of logs.
B2 Generate logs file names according to the loop number, such as log_1.txt, log_2.txt… and export all retrieved rows in the current time to A2.
Example 2: log.txt stores application execution logs, as shown below. Each piece of log is made up of an indefinite number of lines that starts with a left square bracket and ends before another left square bracket. The task is to split the file into multiple files and each will contain 1000 logs.
esProc SPL script:
A | B | |
---|---|---|
1 | =file(“e:/log.txt”:”UTF-8”).read@n() | =A1.select(~!="") |
2 | =B1.group@i(pos(~,"[")==1).cursor() | |
3 | for A2,1000 | =file(“e:/log/log_”/#A3/”.txt”) |
4 | =A3.(~.(B3.write@a(~))) |
A1 Open the logs file to read in data; @n option enables reading each line as a string and returns a sequence of strings.
B1 Get non-empty rows.
A2 Group these rows according to whether a row starts with a left square bracket. Create a new group if a row begins with a left square bracket and put a row into the current group if it doesn’t. Then read in all groups of sequences as a cursor.
A3 Loop through the cursor to retrieve 1000 groups each time, that is, 1000 pieces of logs.
B3 Generate logs file names according to the loop number, such as log_1.txt, log_2.txt….
B4 Two levels of loops. The outer loop is 1000 groups and the inner loop is the rows in each group. Write each row to the end of the target file.
Find more examples in SPL CookBook.
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code