Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with quotes in tsv files #40

Open
boydkelly opened this issue Sep 1, 2024 · 4 comments
Open

Problem with quotes in tsv files #40

boydkelly opened this issue Sep 1, 2024 · 4 comments

Comments

@boydkelly
Copy link

When linting tsv files, I get:

$ csvlint -delimiter='\t' build/neo_ex.tsv 
Warning: not using defaults, may not validate CSV to RFC 4180
Record #1035 has error: bare " in non-quoted-field

unable to parse any further

The record 1035 is as follows. But since this is tsv (for this very reason) should any quoting not be totally ignored as an error?

9010c36f-6958-48d9-ba2d-c50f65c8825d	dondon ko "ken ken kileri kɛ".	dyu	exm	dyuEx
@kmatt
Copy link

kmatt commented Sep 3, 2024

Parsing and detecting errors in this utility is handled by https://pkg.go.dev/encoding/csv#Reader

Which seems to complain if the quotes are not the first or last character in the field.

  1. In your sample text is the double quoted field delimited by tabs as in dondon ko\t"ken ken kileri kɛ".\tdyu ?

  2. Or is there whitespace before the leading quote as in dondon ko\t "ken ken kileri kɛ".\tdyu ?

Only the second case throws the error for me.

@boydkelly
Copy link
Author

It certainly could be the second case. Since this is foreign language prose and not 'clean' text the expectation is that when it is defined as tab delimited then it should not matter if and where any quote may occur. So in your second example the text should 'properly' lint as with \t replaced by line feed:

dondon
"ken ken kileri kɛ".
dyu

So it looks like the bug is with csv#Reader?

I'm really just checking that the number of columns is accurate. And for now Awk will do the job, But it would be great to see tsv handled correctly here.

@kmatt
Copy link

kmatt commented Sep 3, 2024

So it looks like the bug is with csv#Reader?

I'm not certain if its a bug or not, because the Reader docs are not explicit on tab delimited data.

-lazyquotes may be an option in this case.

@boydkelly
Copy link
Author

I'll just use awk. The whole point of tab delimiters is to avoid the numerous problems of quote delimiters. In a tab delimited file quotes should not be considered as anything but another string character. I guess csv#Reader is true to its name, comma separated. It does not understand tabs correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants