Phase II? #8

mimno · 2020-08-13T01:55:57Z

According to https://textcreationpartnership.org/tcp-texts/eebo-tcp-early-english-books-online/, all the Phase II texts should now be freely available as of 1 Aug 2020. Will this repo be updated with the 28k that are currently listed as "Restricted"? It's a really convenient way to distributed raw files!

PFSchaffner · 2020-08-13T02:37:16Z

If I can figure out how, yes. It is a bit complicated. The files on GitHub are the TEI P5 files created by Sebastian Rahtz at Oxford, and they have fallen a bit behind the release schedule: there are 35,000 EEBO phase 2 files released, not all of which were available to be converted by Sebastian. So we need either to reconstruct his process or find another that achieves a similar end. Moreover, the underlying metadata found in the file headers needs to be re-generated for ALL the files, based on new MARC records released by ProQuest, and incorporating links to the new EEBO platform. In the meantime, all released EEBO TCP files can be downloaded (albeit not as comfortably for some as from gitHub) from Box, in their original more-or-less TEI P4 form (as well as their raw SGML form, that in which they were created). All of those (as you probably already know) are freely available here: https://umich.app.box.com/s/f3mphvepm20akwloqna2

The P5 version of the Phase 2 files, so far as it goes, is also (I think) available from the same folder on Box, (seee /EEBO_phase2/Oxford_P5 ) But since I received these as a lump from Oxford, and have not checked them, I cannot vouch for either their encoding or their completeness.

pfs.

jamescummings · 2020-08-13T10:19:37Z

Hi @PFSchaffner
I feel duty-bound to try to assist you in this -- we could probably recreate Sebastian's workflow.
I know that @tuurma was also around and I think helped on the script that produces the readme.md files, so mention her in case she has anything to contribute. I've got a dropbox folder with the TEI P5 files as I had them before I left Oxford. I could share that with you if it is helpful.

mimno · 2020-08-13T13:25:17Z

If it's not too much trouble to recreate the workflow, that would be great! P5 + direct download links would make a big difference. I'm happy to contribute some compute time if it's straightforward.

…

On Thu, Aug 13, 2020 at 6:19 AM James Cummings ***@***.***> wrote: Hi @PFSchaffner <https://github.com/PFSchaffner> I feel duty-bound to try to assist you in this -- we could probably recreate Sebastian's workflow. I know that @tuurma <https://github.com/tuurma> was also around and I think helped on the script that produces the readme.md files, so mention her in case she has anything to contribute. I've got a dropbox folder with the TEI P5 files as I had them before I left Oxford. I could share that with you if it is helpful. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#8 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AARPWOTRUVN3LIS5BLIMBBDSAO44RANCNFSM4P5THI2Q> .

PFSchaffner · 2020-08-25T12:50:25Z

@jamescummings Am I right in thinking that all of Seb's stuff is here: https://github.com/textcreationpartnership/TCPTools ? The feeder files, the ones he got from me, such as lists of character entities, file lists with pubdates (not extractable from the texts themselves), ID numbers, and the like, would have to be refreshed, since they are quite stale.

tuurma · 2020-08-25T13:15:17Z

@PFSchaffner Correct, this is where the scripts to create individual repositories for each text are (it basically generates the README and copies both XML source and readme into the new github repo). Looks like this process could be easily run again on the Phase II bunch.

I don't see how the actual P5 XML file have been created though? You do mention conversion from P4/SGML to P5 and external MARC metadata if I understood correctly?

PFSchaffner · 2020-08-25T13:20:16Z

@tuurma @jamescummings I think those are here: https://github.com/textcreationpartnership/TCPTools/tree/master/SGML with the heart being the three files named ant*.xml . The files headers themselves reference "tcp2tei.xsl" which I haven't spotted yet.

lb42 · 2020-08-28T09:15:36Z

I have been working on the EEBO bibliography as published by Proquest, putting it into TEI format, and enriching it with TCP identifiers from Paul's "eebodat.sgm" file. This might also provide a useful way of checking which files are available in which format/s , it occurs to me. See further https://foxglove.hypotheses.org

PFSchaffner · 2020-08-28T12:37:23Z

Just a slight update on this, in the interests of tidiness and accuracy (on our part). Driven by Lou's interest, I have downloaded, identified, counted, and am now re-uploading in more logical and consistent form, all the files that we host on Box. (Haven't touched gitHub yet.) I've done the Phase 1 files so far, and am slowly uploading them at very slow upload speeds from home, distinguishing the versions (inaccurately but conveniently) as P3 P4 and P5. All three versions, as hosted, now contain the same documents, and the IDs of same are supplied in a simple "IDs_in_phase1.txt" file. When this is done, I'll move on to Phase 2. One P5 document needed to be changed, and that change needs to be moved upstream back to gitHub. I'm also doing away with the tarballs and am using straight zip (7zip) without any intervening tar.

PFSchaffner · 2020-08-28T12:42:22Z

Also a caveat to Lou's project of reconciling our documentation with that of ProQuest. It may in some cases be impossible. I will see for myself when I get around to this reconciliation myself in a month or two, but in the past there have been several -- many -- cases where (for example) we have discovered an EEBO image set to contain more than one work, in which case we split the text into two files, attaching to each file the information appropriate to the individual work, whereas ProQuest tends to ignore such 'bound-with' situations and treats each image set as indivisible. There are other possible disagreements, notably those arising from changes at ProQuest (re-scanning, re-identifying, de-duping records), but 'bound-withs' are the most common.

lb42 · 2020-08-28T12:44:45Z

Could you give a couple of examples of this distressing phenomenon?

PFSchaffner · 2020-08-28T12:53:12Z

In general, I'm afraid that the first thing I learned about running this show was that it was going to be like transferring people in rough seas from one ship to another: neither platform is stable and the best you can do is connect via a rope pulley once in a while and hope you don't crash into each other. As for bound-withs in particular, they are mostly noted as such in the comments in eebodat (search for "bound[- ]?with" ); or definitively identifiable (I think) by the existence of two entries with the same VID. Nor do I know whether ProQuest might not have responded to some or all of them. But here is one example, as noted:
<E><IDG S="MARC" R="UM" ID="A97377"><STC T="S">4068</STC><STC T="C">S113331</STC><BIBNO T="umi">99848567</BIBNO><VID>13656</VID></IDG><AUTHOR>Bullinger, Heinrich,$d1504-1575.</AUTHOR><ADDAUTHOR>Véron, John,$dd. 1563.</ADDAUTHOR><STIT>A most necessary & frutefull dialogue, betwene [the] seditious libertin or rebel Anabaptist, & the true obedient christia[n] : wherin, as in a mirrour or glasse ye shal se [the] excellencte and worthynesse of a christia[n] magistrate: & again what obedience is due vnto publique rulers of all th[os]e [that] professe Christ yea, though [the] rulers, in externe & outward thinges, to their vtter dampnatyon, do otherwyse then well: translated out of Latyn into Englishe, by Iho[n] Veron Senonoys.</STIT><UTIT>Von dem unverschampten Fräfel der Widertöuffer. English. Selections</UTIT><YR>1551</YR><IMGS>45</IMGS><ADAT V="PDCC">2013-01</ADAT><KB>119</KB><V>PDCC</V><VDAT>2013-01</VDAT><PDAT MURP="lorand">2013-03</PDAT><D>3A</D><RDAT MURP="lorand">2013-03</RDAT><DDAT TCP="E2">2014-03</DDAT><ODAT>2014-10</ODAT></E>

PFSchaffner · 2020-08-28T12:55:08Z

Or this: <E><IDG S="MARC" R="UM" ID="A89915"><STC T="S">4217</STC><STC T="C">S107140</STC><BIBNO T="umi">99842842</BIBNO><VID>4894</VID></IDG><NOTIS>AUB6065</NOTIS><ALEPH>003723689</ALEPH><STIT>An exposition vpon the Epistle to the Colossians Wherein, not onely the text is methodically analysed, and the sence of the words, by the help of writers, both ancient and moderne is explayned: but also, by doctrine and vse, the intent of the holy Ghost is in euery place more fully vnfolded and vrged. ... Being, the substance of neare seuen yeeres weeke-dayes sermons, of N. Byfield, late one of the preachers for the citie of Chester.</STIT><YR>1617</YR><IMGS>256</IMGS><ADAT V="PDCC">2013-08</ADAT><KB>2298</KB><V>PDCC</V><VDAT>2013-09</VDAT><PDAT MURP="pasj">2013-10</PDAT><D>3A</D><RDAT MURP="pasj">2013-10</RDAT><DDAT TCP="E2">2014-03</DDAT><ODAT>2014-10</ODAT></E>

PFSchaffner · 2020-08-28T12:59:25Z

(The term 'bound-with' is cataloguer jargon and refers of course to bound volumes in which more than one work has been bound up together. Some of the EEBO bound-withs may be literal ones -- i.e. the works in question may be in fact bound together in a physical volume, some may be virtual ones -- the break between works not noticed during filming or scanning; and some may be one of those complicated early-print situations in which works are issued together, but with separate title pages, and may also have been issued separately. Most of those are treated bibliographically as single items and we have accepted that unity, but it can get confusing.)

lb42 · 2020-08-28T13:56:07Z

My merge workflow looks through a file of records like this

<bibl xml:id="A16900" n="99840393" vid="4894" pp="23"/>
<bibl xml:id="A89915" n="99842842" vid="4894" pp="256"/>

extracted from your eebodat file, seeking out items which have the same @n AND the same @vid as the corresponding protext record, and then enriching the latter with the @xml:id as a TCP identifier.

This seems to give the right result for your second case, but not the first. Not sure why.

lb42 · 2020-08-28T14:20:34Z

I don't find any Proquest record for tcp:A97377. This is because the Proquest catalogue only has two records for eebo:99848567 (one for vid 13670 and one for vid 179308). There's no record for the same eebo id associated with vid 13656. Nor do I see how I could reconstruct one from your data...

lb42 · 2020-08-28T14:33:46Z

The merged catalogue currently has 143734 entries; there are 144528 entries in the file I extract from eebodat. So either I am failing to find 794 entries, or your data has lots of ghosts. Bother.

PFSchaffner · 2020-08-28T15:59:57Z

In reply to "There's no record for the same eebo id
associated with vid 13656."

Of course there's not. Imagine if you will that
you're faced with sorting out a cluttered attic.
ProQuest, confronted by 150,000 boxes, approaches
them with a bunch of pre-printed labels, "RIBBONS"
"HATS" "PENCILS" etc., but can only put one label
on each box. You might have two boxes of pencils,
(and each one can get a 'pencils' label), but the
only way to label a box containing both pencils
and hats is to concoct a hybrid label "pencils/hats".

(There are analogies here to the @calendar issue!)

In the original (Chadwyck) EEBO architecture, this
could be represented as labels with dependent boxes:

PENCILS

Box 1
Box 4

HATS

Box 2
Box 3
Box 5

In the new architecture, they have gotten rid
of the hierarchy and created label-box pairs, thus

PENCILS-Box 1
PENCILS-Box 4
HATS-Box 2
HATS-Box 3
HATS-Box 5

But then we come along, and actually start rummaging
through the boxes and find that Box 5 doesn't contain
just hats, but pencils too. So we do this:

PENCILS

Box 1
Box 4
*Box 5

HATS

Box 2
Box 3
Box 5

Which will turn into this

PENCILS-Box 1
PENCILS-Box 4
*PENCILS-Box 5
HATS-Box 2
HATS-Box 3
HATS-Box 5

Those * combinations are possible to us, because we
are oriented around the contents (pencils, hats),
and we are free to say that this batch of pencils is
found in box 5, and this batch of hats is also found
in box 5 (same vid, different bibnos).

But ProQuest doesn't have that ability.
They allow only one label per box, one bibno per vid. Most of the time
they probably don't even know what's in the boxes (they
haven't rummaged through them as we have), but if they do
discover the truth they have ony three options: split the box
into two boxes (Box 5a contains pencils, Box 5b contains
hats -- in which case the VIDs will change), or create a
new composite label ("PENCILS_HATS", in which case the bibno will
change), or ignore the pencils in the box, and leave well
enough alone (in which case our *-marked combinations will
be correct and valid but will match nothing in their system. I don't
know which option they've chosen, but in the past it has mostly
been option 3.

PFSchaffner · 2020-08-28T16:06:48Z

Now edited to supply context.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase II? #8

Phase II? #8

mimno commented Aug 13, 2020

PFSchaffner commented Aug 13, 2020 •

edited

Loading

jamescummings commented Aug 13, 2020

mimno commented Aug 13, 2020 via email

PFSchaffner commented Aug 25, 2020

tuurma commented Aug 25, 2020

PFSchaffner commented Aug 25, 2020 •

edited

Loading

lb42 commented Aug 28, 2020

PFSchaffner commented Aug 28, 2020 •

edited

Loading

PFSchaffner commented Aug 28, 2020 •

edited

Loading

lb42 commented Aug 28, 2020

PFSchaffner commented Aug 28, 2020 •

edited

Loading

PFSchaffner commented Aug 28, 2020

PFSchaffner commented Aug 28, 2020

lb42 commented Aug 28, 2020

lb42 commented Aug 28, 2020 •

edited

Loading

lb42 commented Aug 28, 2020

PFSchaffner commented Aug 28, 2020 •

edited

Loading

PFSchaffner commented Aug 28, 2020

Phase II? #8

Phase II? #8

Comments

mimno commented Aug 13, 2020

PFSchaffner commented Aug 13, 2020 • edited Loading

jamescummings commented Aug 13, 2020

mimno commented Aug 13, 2020 via email

PFSchaffner commented Aug 25, 2020

tuurma commented Aug 25, 2020

PFSchaffner commented Aug 25, 2020 • edited Loading

lb42 commented Aug 28, 2020

PFSchaffner commented Aug 28, 2020 • edited Loading

PFSchaffner commented Aug 28, 2020 • edited Loading

lb42 commented Aug 28, 2020

PFSchaffner commented Aug 28, 2020 • edited Loading

PFSchaffner commented Aug 28, 2020

PFSchaffner commented Aug 28, 2020

lb42 commented Aug 28, 2020

lb42 commented Aug 28, 2020 • edited Loading

lb42 commented Aug 28, 2020

PFSchaffner commented Aug 28, 2020 • edited Loading

PFSchaffner commented Aug 28, 2020

PFSchaffner commented Aug 13, 2020 •

edited

Loading

PFSchaffner commented Aug 25, 2020 •

edited

Loading

PFSchaffner commented Aug 28, 2020 •

edited

Loading

PFSchaffner commented Aug 28, 2020 •

edited

Loading

PFSchaffner commented Aug 28, 2020 •

edited

Loading

lb42 commented Aug 28, 2020 •

edited

Loading

PFSchaffner commented Aug 28, 2020 •

edited

Loading