Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV dialect detection: implementation without third party libraries #2247

Open
ws-garcia opened this issue Oct 25, 2024 Discussed in #2246 · 6 comments
Open

CSV dialect detection: implementation without third party libraries #2247

ws-garcia opened this issue Oct 25, 2024 Discussed in #2246 · 6 comments
Assignees
Labels
enhancement New feature or request. Once marked with this label, its in the backlog. WIP work in progress

Comments

@ws-garcia
Copy link

ws-garcia commented Oct 25, 2024

Discussed in #2246

Originally posted by ws-garcia October 25, 2024

Problem overview

Currently, this project does not have a stable alternative that allows detecting CSV file configuration. An example of this is raised in #1719, where the utility fails to detect the configuration for the given files.

Details

At the moment, @jqnatividad has begun digging into the problem and claiming

Perhaps, we can tag-team on qsv-sniffer to make its CSV schema inferencing more reliable?

He pointed

Aligning qsv-sniffer's behavior with python's csv sniffer is the way to go!

The work path to go, until now, is outlined in jqnatividad/qsv-sniffer#14. Currently, all tasks are under study but not completed.

New path

In this I will discuss a new approach to implement dialect detection in qsv using trivial elements:

  • Regexes: determine fields data types.
  • Current implemented parser: load data.
  • Table Uniformity measure: detect the table with the best structure.

With this approach the dialect detection is reliable as the CleverCSV one, being able to obtain results with greater certainty. The process is as follows:

  • In the first phase, potential dialects are built based on field/column separator, quotation marks, and record delimiter characters. In this stage user can provide custom delimiter list, giving the tool a level of flexibility.
  • With each potential dialect, we attempt to parse the CSV file and use the data to construct temporary table.
  • The table is scored using the Table Uniformity measurement. Each score is saved in a collection using the dialect as a key.
  • The dialect that produces the table with the highest score is then selected as the desired one.

A Python implementation of this exact approach is described in a GitHub repository. The evaluation of this methods gives:

Tool F1 score
CSVsniffer 0.9260
CleverCSV 0.8425
csv.Sniffer 0.8049

This sheds light over one point: the presented approach is clearly outperforming csv.Sniffer and also CleverCSV in the research datasets.

Hoping this can help this wonderful project!

Edit:

Code snippet will be presented in the discussion.

@jqnatividad jqnatividad added the enhancement New feature or request. Once marked with this label, its in the backlog. label Oct 25, 2024
@jqnatividad
Copy link
Owner

Thanks @ws-garcia !

This is very timely as I was dreading taking on the csv-sniffer python port, thus the lack of activity.

Your step-by-step "new path" breakdown is certainly easier to digest than the paper :)

Will be sure to loop you in as we mark progress...

@jqnatividad jqnatividad self-assigned this Oct 25, 2024
@ws-garcia
Copy link
Author

ws-garcia commented Oct 25, 2024

You can use the paper only to implement some logic if you're confused at porting the Python code. So, look at the research as a backup reference to dive in into the implementation.

@jqnatividad jqnatividad added the WIP work in progress label Oct 30, 2024
@jqnatividad
Copy link
Owner

Hi @ws-garcia , just wanted to let you know that I'm thinking of implementing your paper as a Rust library given the utility of CSV dialect detection, as other developers may want to use your CSV dialect detection algorithm, and qsv is a command-line utility.

As the name csv-sniffer is already used by the apparently unmaintained crate, I'm thinking of naming it
csv-garciasniffer. 😄

I will deprecate the existing qsv-sniffer csv-sniffer fork and use the new csv-garciasniffer crate once its implemented.

Thoughts?

@ws-garcia
Copy link
Author

ws-garcia commented Nov 7, 2024

Hey @jqnatividad, I am honored that you have the idea of adding my name to the library. But there is a name that would sound great and promote the amazing product that is qsv: csv-qsniffer.

I continue to think that adding a high-precision dialect detector to qsv would be a great milestone for the project. So, go ahead with the library and its implementation!

@jqnatividad
Copy link
Owner

Great! 🎉 csv-qsniffer it is then! 🥳

Will keep you posted as we mark progress on implementing the library and integrating it into qsv and qsv pro.

@ws-garcia
Copy link
Author

The research paper methodology will be soon published as Open Access under Creative Commons Attribution License (CC BY 4.0). You only need to give the copyright ©️. Let's go make qsv as infalible as posible!

@jqnatividad jqnatividad pinned this issue Nov 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request. Once marked with this label, its in the backlog. WIP work in progress
Projects
None yet
Development

No branches or pull requests

3 participants
@jqnatividad @ws-garcia and others