Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make encoder-marc21 more forgiving? #528

Open
dr0i opened this issue Apr 29, 2024 · 6 comments
Open

Make encoder-marc21 more forgiving? #528

dr0i opened this issue Apr 29, 2024 · 6 comments

Comments

@dr0i
Copy link
Member

dr0i commented Apr 29, 2024

Came up in #527 :

If we parse (assumingly) crude binary MARC the encoding fails.
(first broken MRC seems to be 02589nas a2200601 c 4500 in https://raw.githubusercontent.com/gbv/Catmandu-Tutorial/master/data/marc.mrc (should be double checked with another MARC-validator other than MF:
Because MARCs binary directory of field 787 points to Iso646Constants.INFORMATION_SEPARATOR_2 = 0x1e the encoding breaks))

If an encoding breaks not only the field is dumped or the whole record but the whole stream. The dumping of the record and - more important- the whole stream can be avoided by piping decode-marc21 to catch-stream-exception before piping to encode-marc21.

a) if the record is indeed invalid:
aa) shall we make the encode-marc21 make more forgiving?
ab) or is it enough to bail out (as it is atm) resp. to expect the user to use catch-stream-exception resp. fix the invalid MARC ?
b) if the record is valid: fix encode-marc21

@dr0i dr0i added this to Metafacture Apr 30, 2024
@dr0i dr0i moved this to Ready in Metafacture Apr 30, 2024
@dr0i dr0i moved this from Ready to Selected in Metafacture Apr 30, 2024
@dr0i dr0i moved this from Selected to Ready in Metafacture Apr 30, 2024
@dr0i dr0i removed their assignment Jun 20, 2024
@TobiasNx
Copy link
Contributor

TobiasNx commented Sep 3, 2024

If I am not mistaken MF in general has a "make or break" approach to transforming things especially the encode-marc21 modul has an integrated validator that is quite strict. I would assume that this okay. But It would be good if the error message would be more explanatory and hinting to the error.

@TobiasNx
Copy link
Contributor

TobiasNx commented Sep 3, 2024

I separated the broken records from the valid ones.

e.g.

6500\x1e":
  "7": ""
  "0": "(DE-588)4057379-5"
  "0": "(DE-101)040573796"
  a: "Steroide"
  "2": "gnd"
"650d\x1e":
  "7": ""
  "0": "(DE-588)4039983-7"
  "0": "(DE-101)040399834"
  a: "Molekularbiologie"
  "2": "gnd"
"650d\x1e":
  "7": ""
  "0": "(DE-588)4067488-5"
  "0": "(DE-101)040674886"
  a: "Zeitschrift"
  "2": "gnd"
"650d\x1e":
  "7": ""
  "0": "(DE-588)4057379-5"
  "0": "(DE-101)040573796"
  a: "Steroide"
  "2": "gnd"
"650d\x1e":
  "7": ""
  "0": "(DE-588)4006777-4"
  "0": "(DE-101)040067777"
  a: "Biochemie"
  "2": "gnd"

See here in the playground You can spot the broken indicators in the yaml result.

I also checked the broken records with yaz-marcdump:

$ yaz-marcdump -np '/home/tobias/Downloads/broken.mrc' 
<!-- Record 1 offset 0 (0x0) -->
No separator at end of field length=75
No separator at end of field length=19
No separator at end of field length=24
No separator at end of field length=88
<!-- Skipping bad byte 10 (0x0A) at offset 882 (0x372) -->
<!-- Record 2 offset 883 (0x373) -->
No separator at end of field length=124
No separator at end of field length=89
No separator at end of field length=21
No separator at end of field length=30
No separator at end of field length=88
Separator but not at end of field length=22
<!-- Skipping bad byte 10 (0x0A) at offset 1805 (0x70d) -->
<!-- Record 3 offset 1806 (0x70e) -->
No separator at end of field length=121
No separator at end of field length=17
No separator at end of field length=40
No separator at end of field length=14
No separator at end of field length=119
Separator but not at end of field length=91
<!-- Skipping bad byte 10 (0x0A) at offset 2733 (0xaad) -->
<!-- Record 4 offset 2734 (0xaae) -->
No separator at end of field length=117
No separator at end of field length=110
<!-- Skipping bad byte 10 (0x0A) at offset 3848 (0xf08) -->
<!-- Record 5 offset 3849 (0xf09) -->
No separator at end of field length=104
No separator at end of field length=80
<!-- Skipping bad byte 10 (0x0A) at offset 4975 (0x136f) -->
<!-- Record 6 offset 4976 (0x1370) -->
No separator at end of field length=176
No separator at end of field length=17
No separator at end of field length=16
No separator at end of field length=21
No separator at end of field length=27
No separator at end of field length=115
Separator but not at end of field length=45
Separator but not at end of field length=64
Separator but not at end of field length=45
<!-- Skipping bad byte 10 (0x0A) at offset 6185 (0x1829) -->
<!-- Record 7 offset 6186 (0x182a) -->
No separator at end of field length=132
No separator at end of field length=40
No separator at end of field length=12
No separator at end of field length=28
No separator at end of field length=245
No separator at end of field length=57
No separator at end of field length=59
No separator at end of field length=55
No separator at end of field length=57
No separator at end of field length=19
No separator at end of field length=97
No separator at end of field length=91
Separator but not at end of field length=96
<!-- Skipping bad byte 10 (0x0A) at offset 7805 (0x1e7d) -->
<!-- Record 8 offset 7806 (0x1e7e) -->
No separator at end of field length=65
No separator at end of field length=32
No separator at end of field length=19
No separator at end of field length=24
No separator at end of field length=43
No separator at end of field length=14
No separator at end of field length=57
No separator at end of field length=59
No separator at end of field length=57
No separator at end of field length=59
No separator at end of field length=55
No separator at end of field length=57
No separator at end of field length=19
No separator at end of field length=55
No separator at end of field length=57
No separator at end of field length=19
No separator at end of field length=109
No separator at end of field length=95
<!-- Skipping bad byte 10 (0x0A) at offset 9536 (0x2540) -->
<!-- Record 9 offset 9537 (0x2541) -->
No separator at end of field length=66
No separator at end of field length=31
No separator at end of field length=58
No separator at end of field length=23
No separator at end of field length=16
No separator at end of field length=56
No separator at end of field length=65
No separator at end of field length=59
No separator at end of field length=56
No separator at end of field length=57
No separator at end of field length=59
No separator at end of field length=54
No separator at end of field length=63
No separator at end of field length=57
No separator at end of field length=19
No separator at end of field length=54
No separator at end of field length=55
No separator at end of field length=57
No separator at end of field length=19
No separator at end of field length=118
Separator but not at end of field length=88
Separator but not at end of field length=206

Longer report with $ yaz-marcdump -npv '/home/tobias/Downloads/broken.mrc' here: https://gist.github.com/TobiasNx/9711cc680acdeb55ebb1b69700cb2477

The separators in these examples seem to be broken. Let me see how Catmandu is handling it.

@TobiasNx
Copy link
Contributor

TobiasNx commented Sep 3, 2024

I also tested the broken records with catmandu it seems that their marc decoder AND not the encoder handles the incomming data differently. It does not skip the broken separators but the broken elements as a whole. Here it replaces the broken indicators with whitespaces:

MF Result transforming MARC into MARCXML, have a look at the indicator and the first subelement :

		<marc:datafield tag="775" ind1="0" ind2="">
			<marc:subfield code="8"></marc:subfield>
			<marc:subfield code="i">Online-Ausg.</marc:subfield>
			<marc:subfield code="t">�The� journal of steroid biochemistry and molecular biology</marc:subfield>
			<marc:subfield code="w">(DE-600)1482780-3</marc:subfield>
			<marc:subfield code="w">(DE-101)019756801</marc:subfield>
		</marc:datafield>
		<marc:datafield tag="780" ind1="8" ind2="0">
			<marc:subfield code="">00</marc:subfield>
			<marc:subfield code="i">Vorg.:</marc:subfield>
			<marc:subfield code="t">�The� journal of steroid biochemistry</marc:subfield>
			<marc:subfield code="w">(DE-600)80169-0</marc:subfield>
			<marc:subfield code="w">(DE-101)010545514</marc:subfield>
		</marc:datafield>

CATMANDU Result transforming MARC into MARCXML with: $ catmandu convert MARC to MARC --type XML < '/home/tobias/Downloads/broken.mrc' > broken.xml . Here the broken first indicator and first element does not exist.

		<marc:datafield tag="775" ind1=" " ind2=" ">
			<marc:subfield code="i">Online-Ausg.</marc:subfield>
			<marc:subfield code="t">�The� journal of steroid biochemistry and molecular biology</marc:subfield>
			<marc:subfield code="w">(DE-600)1482780-3</marc:subfield>
			<marc:subfield code="w">(DE-101)019756</marc:subfield>
		</marc:datafield>
		<marc:datafield tag="780" ind1=" " ind2=" ">
			<marc:subfield code="i">Vorg.:</marc:subfield>
			<marc:subfield code="t">�The� journal of steroid biochemistry</marc:subfield>
			<marc:subfield code="w">(DE-600)80169-0</marc:subfield>
			<marc:subfield code="w">(DE-101)0105</marc:subfield>
		</marc:datafield>

I would be in favour of adjust the behaviour of the decoder as an option that it does not create broken values from an broken separator.
Perhaps the CATMANDU MARC Decoder even if they handle marc very differently could hint a solution:

https://metacpan.org/release/HOCHSTEN/Catmandu-MARC-1.32/source/lib/Catmandu/Importer/MARC/Decoder.pm#PCatmandu::Importer::MARC::Decoder

@TobiasNx TobiasNx assigned dr0i and unassigned TobiasNx and maipet Sep 3, 2024
@dr0i
Copy link
Member Author

dr0i commented Sep 3, 2024

I try to follow. But the playground example in #528 (comment) results in "Request-URI Too Long".

@dr0i dr0i assigned TobiasNx and unassigned dr0i Sep 3, 2024
@dr0i dr0i moved this from Ready to Review in Metafacture Sep 3, 2024
@TobiasNx
Copy link
Contributor

TobiasNx commented Sep 3, 2024

I try to follow. But the playground example in #528 (comment) results in "Request-URI Too Long".

Thanks for the hint. MF Playground does not complain anymore if the URL is too long. Should open a ticket there.

I fixed the example and added some more info to my comments: #528 (comment)

@TobiasNx TobiasNx removed their assignment Sep 23, 2024
@TobiasNx
Copy link
Contributor

TobiasNx commented Sep 24, 2024

As I revised my comments: @dr0i in short: we should not change the behaviour of encode-marc21 but of decode-marc21. So that the decoder optionally does not create broken values due to the invalid separators as catmandu would.

Perhaps the CATMANDU MARC Decoder even if they handle marc very differently could hint a solution:

metacpan.org/release/HOCHSTEN/Catmandu-MARC-1.32/source/lib/Catmandu/Importer/MARC/Decoder.pm#PCatmandu::Importer::MARC::Decoder

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Backlog
Development

No branches or pull requests

3 participants