multiple headers #128

rychlym · 2022-05-31T09:28:59Z

rychlym
May 31, 2022

Hello,
is it possible to programmatically set the starting (header) row and ending line?

The input file consists of two different tables and I need to read them both.
The first row is planned to contain a comment describing the tables are starting: e.g.
# tbl1, 2, tbl2, 40
It indicates the tbl1 header is at the 2nd row and the tbl2 header is at the 40'th line.

(This does not adhere to the csv standard, so I assume it can't be solved using your API, and it will be needed to split the file first. But please let me know if I am mistaken).

Answered by MarkPflug

May 31, 2022

Without knowing more about the shape of your file, here are a couple things you can try.

You can set CsvDataReaderOptions.ResultSetMode = ResultSetMode.MultiResult. This will cause the reader to consider all changes in the number of columns to be the start of a new dataset. You would call Read in a loop, like normal, but then you would need to call NextResult to start reading the next table. This only works if the number of columns in the tables is different though, or there is an empty line that separates them.

Another option would be to read the header comment and parse the row info. You can do this manually on the TextReader before passing to CsvDataReader.Create, or you can use the Cs…

View full answer

MarkPflug · 2022-05-31T14:15:15Z

MarkPflug
May 31, 2022
Maintainer

Without knowing more about the shape of your file, here are a couple things you can try.

You can set CsvDataReaderOptions.ResultSetMode = ResultSetMode.MultiResult. This will cause the reader to consider all changes in the number of columns to be the start of a new dataset. You would call Read in a loop, like normal, but then you would need to call NextResult to start reading the next table. This only works if the number of columns in the tables is different though, or there is an empty line that separates them.

Another option would be to read the header comment and parse the row info. You can do this manually on the TextReader before passing to CsvDataReader.Create, or you can use the CsvDataReaderOption.CommentHandler to pass a function that will be called when a comment is encountered. The first option is probably simpler in your case. Once you have that you can use it to track the number of times you call Read.

Hopefully one of these options works for you.

0 replies

timotheuspreisinger · 2022-05-31T14:16:41Z

timotheuspreisinger
May 31, 2022

You could write a Reader implementation that on the first readline reads the line with the file format spec (a list of table name + line count tuples) and then return null after all line for one table are read. The first read loop will start with the first table's header line and return null once all lines are read. The next read loop will return all rows of the second table and so on.

3 replies

MarkPflug May 31, 2022
Maintainer

@rychlym This suggestion is good, I just didn't want to explain how to implement it. I've actually been meaning to provide an implementation myself, so I finally got around to it this morning. If you include the Sylvan.Data v0.2.1 (just published to nuget this morning), you can use the DbDataReader Skip/Take extension methods to control what rows are being read. This allows you to hand the DbDataReader to other APIs that don't know anything about your special requirements.

var textReader = File.OpenText("data.csv");
var comment = textReader.ReadLine();

var (skip, take) = ParseComment(comment); // This is your business...

var reader = CsvDataReader.Create(textReader);

reader = reader.Skip(skip).Take(take);

// You can pass the reader to DataTable.Load which will load just the range you specified via skip/take.
var dt = new DataTable();
dt.Load(reader);

rychlym May 31, 2022
Author

All above is quite inspiring, as I just came across this great lib today. Thanks a lot or you effort and quick reactions, which I haven't expected!

MarkPflug May 31, 2022
Maintainer

Thanks! If you've found them helpful, drop a star on the github repos. Helps other people find them.

plsft · 2023-03-28T15:50:10Z

plsft
Mar 28, 2023

reader = reader.Skip(skip).Take(take);

This bit doesn't work for me. It can't convert from CsvDataReader to DbDataReader

I also tried

csv = (csv.Skip(skip).Take(take).AsDbDataReader() as CsvDataReader)!;

but it throws a null exception. any ideas @MarkPflug

1 reply

MarkPflug Mar 28, 2023
Maintainer

If you can share a minimal repro I can take a look.

plsft · 2023-03-28T16:29:56Z

plsft
Mar 28, 2023

      var dataTable = new DataTable("test_table");
        var workingPath = "C:\\Work\\2023\\Data\\MESA";
        var file = "EI09 Individual U65 OEP Bonus _ 41980784_January.xls";
        var ext = Path.GetExtension(file);
        var fileOnly = Path.GetFileNameWithoutExtension(file);
        var excel = ext.StartsWith(".xl");
        var ix = 0;
        var recordCount = 0;
        var fieldCount = 0;

        DbDataReader dbr = null;
        CsvDataReader csv = null; 

        if (excel)
        {
            System.Text.Encoding.RegisterProvider(System.Text.CodePagesEncodingProvider.Instance);

            var sheetName = "";
            var edr = ExcelDataReader.Create(Path.Combine(workingPath, file), new ExcelDataReaderOptions() {  });

            do
            {
                sheetName = edr.WorksheetName;
                using var cdw = CsvDataWriter.Create(Path.Combine(workingPath, fileOnly + "-" + sheetName + ".csv"), new CsvDataWriterOptions() {  });
                cdw.Write(edr);
                recordCount = edr.RowCount;
                fieldCount = edr.FieldCount;
            } while (edr.NextResult());

            file = fileOnly + "-" + sheetName + ".csv";
        }
        else
        {
            file = fileOnly + ".csv";
        }

        var text = File.OpenText(Path.Combine(workingPath, file)).ReadToEnd().Split('\n', '\r');
        

        var skip = 3; // header starts on 3
        var take = recordCount-4; // footer is 4 lines 

        //var reader = CsvDataReader.Create(textReader);

        csv = CsvDataReader.Create(File.OpenText(Path.Combine(workingPath, file)), new CsvDataReaderOptions
        {
            Delimiter = ',',
            Comment = '#',
            Escape = '\"'
        });

       csv = (csv.Skip(skip).Take(take).AsDbDataReader() as CsvDataReader)!; // throw null exception

this is a basic test I'm writing to strip out headers in XLS or CSV file for pre-processing and bulk import

  var result = analyzer.Analyze(csv);
        var schema = result.GetSchema();
        var schemaBuilder = result.GetSchemaBuilder();
        var sch = schemaBuilder.Build();
        var columns = schema.GetColumnSchema();
        var dataSchema = new CsvSchema(schema);
        var validateData = true;
        var errorList = new ConcurrentBag<string>();

        dataTable.Load(csv);

1 reply

MarkPflug Mar 28, 2023
Maintainer

Skip and Take don't return a CsvDataReader, they return a DbDataReader (abstract), the concrete type is internal to Sylvan.Data. You should be able to change that line to:

DbDataReader dr = csv.Skip(skip).Take(take);

Then use the dr for further processing/data loading.

plsft · 2023-03-28T16:58:42Z

plsft
Mar 28, 2023

it's still not skipping from the reader.

csv = CsvDataReader.Create(Path.Combine(workingPath, file), new CsvDataReaderOptions
        {
            Delimiter = ',',
            Comment = '#',
            Escape = '\"'
        });

        var reader = (csv.Skip(skip).Take(take));

        var header = string.Join(",", reader.GetColumnSchema().Select(c => string.IsNullOrEmpty(c.ColumnName) ? $"col_{++ix}": c.ColumnName.Replace('\n', ' ').Replace('\r', ' ').Trim() ).ToArray()); // header-string for MD5_HASH

reader isn't returning the first column where the actual header starts for the file. the first 3 lines are report heading info that i need to skip

1 reply

MarkPflug Mar 28, 2023
Maintainer

Does your csv file have comment lines (#)?

plsft · 2023-03-28T17:24:54Z

plsft
Mar 28, 2023

Hi @MarkPflug -- thanks for the support! In some cases, yes they will. The main issue I'm having is that I need to be able to skip x lines at start and trim X lines from the end of files to process them correctly. I am also requiring the header be the first line so I can determine the file type by signature. Skip/Take is likely the solution here but when I call reader.GetColumnSchema(), it's returning the first record -- not skipping X lines.

1 reply

MarkPflug Mar 28, 2023
Maintainer

Trimming X lines from the end is going to be difficult. The reader is a forward only reader, so there is no way of knowing how close you are to the end until you get there. A better approach might be to use TakeWhile that would allow you to read until you identify some criteria that indicates the end of the data.

Here are some things that might not be clear:

The CsvDataReader will skip comment lines (#) automatically; you don't need to use Skip to skip those lines. Skip would only be needed if there is a line of text that would otherwise be indistinguishable from a data record.
If you don't provide an explicit schema, the first non-comment line will be used to determine the "schema", which will simply be the number of fields (FieldCount), and by default every column will be a typed as a non-nullable string.

Without seeing an example csv file I can only make guesses at what is going on. If you don't want to post the file here you can email it to me.

plsft · 2023-03-28T17:48:55Z

plsft
Mar 28, 2023

Thanks @MarkPflug - I sent you the sample file.

3 replies

MarkPflug Mar 28, 2023
Maintainer

Got your email. It looks like there is a bug in the Sylvan.Data.Excel when processing that .xls file. I'll open a ticket to fix that, but it might take a while to resolve it.

After saving as an ".xlsx" file, this is how I would write the code to process this file using:

using Sylvan.Data;
using Sylvan.Data.Excel;
using System.Data;
using System.Text;

Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);

var edr = ExcelDataReader.Create("sample-file.xlsx");

bool headersFound = false;
// read rows until the "header" row is found
while(edr.Read())
{
    // you could also do edr.RowFieldCount == 22, or some other
    // approach to identify that it is the headers.
    if (edr.GetString(0).StartsWith("Agency"))
    {
        // found the header row, initialize the reader using
        // the current row as the "headers".
        edr.Initialize();
        headersFound = true;
        break;
    }
}

// read the whole file without finding the header row.
if (!headersFound)
{
    throw new Exception("Could not find header row in data file");
}

// skip any rows that don't have the expected number of data fields.
var dr = edr.Where(r => edr.RowFieldCount > 6);


// you can now do whatever you need with "dr" which should only 
// return the "data rows"
while (dr.Read())
{
    // only outputs the "data" rows.
    Console.WriteLine(dr.GetString(0) + " " + dr.GetString(1));
}

plsft Mar 28, 2023

There should be a line count or record count? I can use that to skip can't I?

MarkPflug Mar 28, 2023
Maintainer

CsvDataReader and ExcelDataReader both have a RowNumber, but the base DbDataReader does not.

plsft · 2023-03-29T01:27:01Z

plsft
Mar 29, 2023

@MarkPflug I was testing this and I noticed that given the sample file I sent you and this code:

System.Text.Encoding.RegisterProvider(System.Text.CodePagesEncodingProvider.Instance);

            var sheetName = "";
            var edr = ExcelDataReader.Create(Path.Combine(workingPath, file), new ExcelDataReaderOptions {  });
            var dr = edr.Skip(3).Take(edr.RowCount - 3);
            do
            {
                sheetName = edr.WorksheetName;
                using var cdw = CsvDataWriter.Create(Path.Combine(workingPath, fileOnly + "-" + sheetName + ".csv"), new CsvDataWriterOptions()
                {
                    
                });
                cdw.Write(dr);
                recordCount = edr.RowCount;
                fieldCount = edr.FieldCount;
                
            } while (dr.NextResult());

the resulting file contains only 1 column -- I think the reader is thrown off by the fact that there is a header that I can't seem to skip here.

2 replies

MarkPflug Mar 29, 2023
Maintainer

ExcelDataReader.Create implicitly calls Initialize internally. So, when creating a new ExcelDataReader (or moving to the next sheet) the FIRST row in the sheet will be used to determine the expected number of columns. In the sample file you sent, the first row has 1 column, that's why you're seeing this.

Skip/Take operate in much the same way as the linq operators, they lazy evaluate. Unfortunately, this means there is no location for you to insert the appropriate Initialize call after the first three rows are skipped.

The RowCount property depends on the "dimension" of the sheet being recorded properly in the spreadsheet, and unfortunatley, this isn't always going to be reliable for two reasons:
first, not all spreadsheet software (or libraries that write spreadsheets) write the dimension record. Without a dimension record, the RowCount will return -1 as documented. Second, sometimes the dimension will include rows at the end of the spreadsheet that appear empty (maybe they contained data that was subsequently deleted) but the dimension (and thus RowCount) includes them in the recorded range.

For these reasons, I'd recommend avoiding Skip/Take/RowCount for your use-case, and instead use an approach similar to what I provided above. You have failed to provide an explanation for why my solution didn't work, so I'm going to assume that you want the "Agent Sub Total" rows to be included in the csv output? If that's the case, you would only need to provide a more complex predicate to the Where method:

...

// see the RowFilter method below
var dr = edr.Where(RowFilter);

// you can now do whatever you need with "dr" which should only 
// return the "data rows"
while (dr.Read())
{
    // only outputs the "data" rows.
    Console.WriteLine(dr.GetString(0) + " " + dr.GetString(1));
}

// yield any rows that start with "Agent Sub Total" or appear to be data records based
// on the number for columns of data
bool RowFilter(DbDataReader r)
{
    return r.GetString(0) == "Agent Sub Total" ||
        edr.RowFieldCount >= 6;
}

If this doesn't do what you want then you'll need to provide a very clear explanation of what you're looking for.

plsft Mar 29, 2023

Hi @MarkPflug -- thanks again for the great explanation and code sample. The row filter idea might work for this file, but it's not a solution longer term. The headers may change. I need a way to try to read the raw rows, and only skip X rows that can be configured per file. The content might change (e.g. dates etc.) but the header line count will not. I already have a way to filter out the sub-total and footer lines. its the skipping the header that I'm stuck on for both csv and excel. I'm thinking this might work however. Even though in the headers, only the first 2 columns have data, the rest are blank. I know there are always 3 header lines, and 23 columns, if I skipped 69 fields, then start reading, would that work?

var dr = edr.Where(RowFilter)
//write dr as csv and process ....

bool RowFilter(DbDataReader r) 
{
 return r.RowFieldCount == 69;
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multiple headers #128

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 8 comments 12 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

multiple headers #128

Replies: 8 comments · 12 replies

MarkPflug May 31, 2022 Maintainer

MarkPflug May 31, 2022 Maintainer

rychlym May 31, 2022 Author

MarkPflug May 31, 2022 Maintainer

MarkPflug Mar 28, 2023 Maintainer

MarkPflug Mar 28, 2023 Maintainer

MarkPflug Mar 28, 2023 Maintainer

MarkPflug Mar 28, 2023 Maintainer

MarkPflug Mar 28, 2023 Maintainer

MarkPflug Mar 28, 2023 Maintainer

MarkPflug Mar 29, 2023 Maintainer

Replies: 8 comments 12 replies

MarkPflug
May 31, 2022
Maintainer

MarkPflug May 31, 2022
Maintainer

rychlym May 31, 2022
Author

MarkPflug May 31, 2022
Maintainer

MarkPflug Mar 28, 2023
Maintainer

MarkPflug Mar 28, 2023
Maintainer

MarkPflug Mar 28, 2023
Maintainer

MarkPflug Mar 28, 2023
Maintainer

MarkPflug Mar 28, 2023
Maintainer

MarkPflug Mar 28, 2023
Maintainer

MarkPflug Mar 29, 2023
Maintainer