CSV dialect detection: implementation without third party libraries #2247

ws-garcia · 2024-10-25T19:56:17Z

Discussed in #2246

^{Originally posted by ws-garcia October 25, 2024}

Problem overview

Currently, this project does not have a stable alternative that allows detecting CSV file configuration. An example of this is raised in #1719, where the utility fails to detect the configuration for the given files.

Details

At the moment, @jqnatividad has begun digging into the problem and claiming

Perhaps, we can tag-team on qsv-sniffer to make its CSV schema inferencing more reliable?

He pointed

Aligning qsv-sniffer's behavior with python's csv sniffer is the way to go!

The work path to go, until now, is outlined in jqnatividad/qsv-sniffer#14. Currently, all tasks are under study but not completed.

New path

In this I will discuss a new approach to implement dialect detection in qsv using trivial elements:

Regexes: determine fields data types.
Current implemented parser: load data.
Table Uniformity measure: detect the table with the best structure.

With this approach the dialect detection is reliable as the CleverCSV one, being able to obtain results with greater certainty. The process is as follows:

In the first phase, potential dialects are built based on field/column separator, quotation marks, and record delimiter characters. In this stage user can provide custom delimiter list, giving the tool a level of flexibility.
With each potential dialect, we attempt to parse the CSV file and use the data to construct temporary table.
The table is scored using the Table Uniformity measurement. Each score is saved in a collection using the dialect as a key.
The dialect that produces the table with the highest score is then selected as the desired one.

A Python implementation of this exact approach is described in a GitHub repository. The evaluation of this methods gives:

Tool	F1 score
`CSVsniffer`	0.9260
`CleverCSV`	0.8425
`csv.Sniffer`	0.8049

This sheds light over one point: the presented approach is clearly outperforming csv.Sniffer and also CleverCSV in the research datasets.

Hoping this can help this wonderful project!

Edit:

Code snippet will be presented in the discussion.

The text was updated successfully, but these errors were encountered:

jqnatividad · 2024-10-25T20:03:40Z

Thanks @ws-garcia !

This is very timely as I was dreading taking on the csv-sniffer python port, thus the lack of activity.

Your step-by-step "new path" breakdown is certainly easier to digest than the paper :)

Will be sure to loop you in as we mark progress...

ws-garcia · 2024-10-25T20:07:27Z

You can use the paper only to implement some logic if you're confused at porting the Python code. So, look at the research as a backup reference to dive in into the implementation.

jqnatividad · 2024-11-07T15:15:12Z

Hi @ws-garcia , just wanted to let you know that I'm thinking of implementing your paper as a Rust library given the utility of CSV dialect detection, as other developers may want to use your CSV dialect detection algorithm, and qsv is a command-line utility.

As the name csv-sniffer is already used by the apparently unmaintained crate, I'm thinking of naming it
csv-garciasniffer. 😄

I will deprecate the existing qsv-sniffer csv-sniffer fork and use the new csv-garciasniffer crate once its implemented.

Thoughts?

ws-garcia · 2024-11-07T15:22:54Z

Hey @jqnatividad, I am honored that you have the idea of adding my name to the library. But there is a name that would sound great and promote the amazing product that is qsv: csv-qsniffer.

I continue to think that adding a high-precision dialect detector to qsv would be a great milestone for the project. So, go ahead with the library and its implementation!

jqnatividad · 2024-11-07T15:32:40Z

Great! 🎉 csv-qsniffer it is then! 🥳

Will keep you posted as we mark progress on implementing the library and integrating it into qsv and qsv pro.

ws-garcia · 2024-11-07T15:45:27Z

The research paper methodology will be soon published as Open Access under Creative Commons Attribution License (CC BY 4.0). You only need to give the copyright ©️. Let's go make qsv as infalible as posible!

jqnatividad added the enhancement New feature or request. Once marked with this label, its in the backlog. label Oct 25, 2024

jqnatividad self-assigned this Oct 25, 2024

jqnatividad added the WIP work in progress label Oct 30, 2024

jqnatividad pinned this issue Nov 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSV dialect detection: implementation without third party libraries #2247

CSV dialect detection: implementation without third party libraries #2247

ws-garcia commented Oct 25, 2024 •

edited

Loading

Problem overview

Details

New path

jqnatividad commented Oct 25, 2024

ws-garcia commented Oct 25, 2024 •

edited

Loading

jqnatividad commented Nov 7, 2024

ws-garcia commented Nov 7, 2024 •

edited

Loading

jqnatividad commented Nov 7, 2024

ws-garcia commented Nov 7, 2024

CSV dialect detection: implementation without third party libraries #2247

CSV dialect detection: implementation without third party libraries #2247

Comments

ws-garcia commented Oct 25, 2024 • edited Loading

Discussed in #2246

Problem overview

Details

New path

jqnatividad commented Oct 25, 2024

ws-garcia commented Oct 25, 2024 • edited Loading

jqnatividad commented Nov 7, 2024

ws-garcia commented Nov 7, 2024 • edited Loading

jqnatividad commented Nov 7, 2024

ws-garcia commented Nov 7, 2024

ws-garcia commented Oct 25, 2024 •

edited

Loading

ws-garcia commented Oct 25, 2024 •

edited

Loading

ws-garcia commented Nov 7, 2024 •

edited

Loading