Run multiple process asynchronously to fetch large data parallelly #126

Sushma58 · 2022-05-16T13:59:16Z

Sushma58
May 16, 2022

Hi, I have a scenario wherein I am fetching more than 50 million records, the fetching takes good amount 15+ seconds time. The records would always remain same so I was thinking if

I can run 2 async process parallelly to fetch first 25 and last 25 million records.

Is there a feasibility to run data reader for this scenario? If not can anyone suggest how the processing time can be reduced. I am using below code to fetch data

using var csv = CsvDataReader.Create(fileName, options);
result = csv.GetRecords().Where(x => x.finalValue > 0).ToArray();
result = result.OrderBy(x => x.input).ToArray();

MarkPflug · 2022-05-16T15:16:01Z

MarkPflug
May 16, 2022
Maintainer

What is the shape of the data? How many columns? What data types?

How many records does this predicate filter out: .Where(x => x.finalValue > 0)?

Does the file contain any records that span a line, quoted fields that contain newline characters?

Technically, it isn't possible to start parsing a CSV file from the middle of a file, due to the rules around quoted fields it isn't possible to know the parse state. A new line character might indicate the end of a record. However, while it isn't possible for all data sets, it might be possible for your dataset assuming it doesn't have any quoted fields with newlines. If that's the case, then you should be able to seek the stream to the middle of the file, locate the next newline, and start parsing from there.

Having said that, if the data never changes, I would probably apply the filter and sorting to the data and store it in a format that won't require parsing.

0 replies

Sushma58 · 2022-05-16T15:29:10Z

Sushma58
May 16, 2022
Author

Csv has 3 columns input, output and finalValue, all columns are of integer data type. The condition filterValue > 0 removes around 10k records.
The file doesn't contain any newline character.

Having said that, if the data never changes, I would probably apply the filter and sorting to the data and store it in a format that won't require parsing - can you please give an example how that can be achieved?

0 replies

MarkPflug · 2022-05-16T16:01:06Z

MarkPflug
May 16, 2022
Maintainer

You can use System.IO.BinaryReader/Writer:

// pre-process
using var csv = CsvDataReader.Create(fileName, options);
result = csv.GetRecords().Where(x => x.finalValue > 0).ToArray();
result = result.OrderBy(x => x.input).ToArray();

using var oStream = File.Create("myData.bin");
using var bw = new BinaryWriter(oStream);

// write the number of records first.
bw.Write(result.Length);

foreach(Var record in result){
  bw.Write(record.input);
  bw.Write(record.output);
  bw.Write(record.finalValue);
}

That will give you a file, "myData.bin" that contains the filtered/sorted data as packed binary integers.
You can then use that file instead of the .csv and use the following code to load it:

var iStream = File.OpenRead("myData.bin");
var br = new BinaryReader(iStream);

var count = br.ReadInt32();
var records = new Record[count];

for(int i = 0; i < count; i++){
 var record = new Record();
 record.input = br.ReadInt32();
 record.output = br.ReadInt32();
 record.finalValue = br.ReadInt32();
}

This will almost certainly be quite a bit faster than using the CSV.

0 replies

Sushma58 · 2022-05-16T16:07:44Z

Sushma58
May 16, 2022
Author

Thanks for the help here. While I try to run the above code the first 3 lines i.e
// pre-process
using var csv = CsvDataReader.Create(fileName, options);
result = csv.GetRecords().Where(x => x.finalValue > 0).ToArray();
result = result.OrderBy(x => x.input).ToArray();

are taking 15 seconds in Raspberry Pi, is there a way just this load can be optimized? The post load process is working quite fast for me.

0 replies

MarkPflug · 2022-05-16T16:12:35Z

MarkPflug
May 16, 2022
Maintainer

The idea is that you only have to do that once as an external preprocessing step, since the data doesn't change. Then your main process only deals with the .bin file from then on, and the csv is never used again. The optimization is that you eschew the need for CSV in your main process altogether.

Maybe I misunderstood your requirements.

0 replies

Sushma58 · 2022-05-17T03:57:20Z

Sushma58
May 17, 2022
Author

My intention is to reduce the processing time while I am reading the excel and converting it to strongly typed data. The below series of steps takes around 20 seconds to execute, is there a possibility I could reduce this time?

// pre-process
using var csv = CsvDataReader.Create(fileName, options);
result = csv.GetRecords().Where(x => x.finalValue > 0).ToArray();
result = result.OrderBy(x => x.input).ToArray();

I noticed that if I don't do .ToArray() and process 1 record at a time, this fastens the process but I also have a requirement to sort the excel data before processing.

0 replies

MarkPflug · 2022-05-17T14:46:43Z

MarkPflug
May 17, 2022
Maintainer

The records would always remain same

If the records never change, there is no reason to optimize that step, as you only need to run it once.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run multiple process asynchronously to fetch large data parallelly #126

{{title}}

Replies: 7 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Run multiple process asynchronously to fetch large data parallelly #126

Sushma58 May 16, 2022

Replies: 7 comments

MarkPflug May 16, 2022 Maintainer

Sushma58 May 16, 2022 Author

MarkPflug May 16, 2022 Maintainer

Sushma58 May 16, 2022 Author

MarkPflug May 16, 2022 Maintainer

Sushma58 May 17, 2022 Author

MarkPflug May 17, 2022 Maintainer

Sushma58
May 16, 2022

MarkPflug
May 16, 2022
Maintainer

Sushma58
May 16, 2022
Author

MarkPflug
May 16, 2022
Maintainer

Sushma58
May 16, 2022
Author

MarkPflug
May 16, 2022
Maintainer

Sushma58
May 17, 2022
Author

MarkPflug
May 17, 2022
Maintainer