Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Phase II? #8

Open
mimno opened this issue Aug 13, 2020 · 18 comments
Open

Phase II? #8

mimno opened this issue Aug 13, 2020 · 18 comments

Comments

@mimno
Copy link

mimno commented Aug 13, 2020

According to https://textcreationpartnership.org/tcp-texts/eebo-tcp-early-english-books-online/, all the Phase II texts should now be freely available as of 1 Aug 2020. Will this repo be updated with the 28k that are currently listed as "Restricted"? It's a really convenient way to distributed raw files!

@PFSchaffner
Copy link

PFSchaffner commented Aug 13, 2020

If I can figure out how, yes. It is a bit complicated. The files on GitHub are the TEI P5 files created by Sebastian Rahtz at Oxford, and they have fallen a bit behind the release schedule: there are 35,000 EEBO phase 2 files released, not all of which were available to be converted by Sebastian. So we need either to reconstruct his process or find another that achieves a similar end. Moreover, the underlying metadata found in the file headers needs to be re-generated for ALL the files, based on new MARC records released by ProQuest, and incorporating links to the new EEBO platform. In the meantime, all released EEBO TCP files can be downloaded (albeit not as comfortably for some as from gitHub) from Box, in their original more-or-less TEI P4 form (as well as their raw SGML form, that in which they were created). All of those (as you probably already know) are freely available here: https://umich.app.box.com/s/f3mphvepm20akwloqna2

The P5 version of the Phase 2 files, so far as it goes, is also (I think) available from the same folder on Box, (seee /EEBO_phase2/Oxford_P5 ) But since I received these as a lump from Oxford, and have not checked them, I cannot vouch for either their encoding or their completeness.

pfs.

@jamescummings
Copy link
Member

Hi @PFSchaffner
I feel duty-bound to try to assist you in this -- we could probably recreate Sebastian's workflow.
I know that @tuurma was also around and I think helped on the script that produces the readme.md files, so mention her in case she has anything to contribute. I've got a dropbox folder with the TEI P5 files as I had them before I left Oxford. I could share that with you if it is helpful.

@mimno
Copy link
Author

mimno commented Aug 13, 2020 via email

@PFSchaffner
Copy link

@jamescummings Am I right in thinking that all of Seb's stuff is here: https://github.com/textcreationpartnership/TCPTools ? The feeder files, the ones he got from me, such as lists of character entities, file lists with pubdates (not extractable from the texts themselves), ID numbers, and the like, would have to be refreshed, since they are quite stale.

@tuurma
Copy link

tuurma commented Aug 25, 2020

@PFSchaffner Correct, this is where the scripts to create individual repositories for each text are (it basically generates the README and copies both XML source and readme into the new github repo). Looks like this process could be easily run again on the Phase II bunch.

I don't see how the actual P5 XML file have been created though? You do mention conversion from P4/SGML to P5 and external MARC metadata if I understood correctly?

@PFSchaffner
Copy link

PFSchaffner commented Aug 25, 2020

@tuurma @jamescummings I think those are here: https://github.com/textcreationpartnership/TCPTools/tree/master/SGML with the heart being the three files named ant*.xml . The files headers themselves reference "tcp2tei.xsl" which I haven't spotted yet.

@lb42
Copy link

lb42 commented Aug 28, 2020

I have been working on the EEBO bibliography as published by Proquest, putting it into TEI format, and enriching it with TCP identifiers from Paul's "eebodat.sgm" file. This might also provide a useful way of checking which files are available in which format/s , it occurs to me. See further https://foxglove.hypotheses.org

@PFSchaffner
Copy link

PFSchaffner commented Aug 28, 2020

Just a slight update on this, in the interests of tidiness and accuracy (on our part). Driven by Lou's interest, I have downloaded, identified, counted, and am now re-uploading in more logical and consistent form, all the files that we host on Box. (Haven't touched gitHub yet.) I've done the Phase 1 files so far, and am slowly uploading them at very slow upload speeds from home, distinguishing the versions (inaccurately but conveniently) as P3 P4 and P5. All three versions, as hosted, now contain the same documents, and the IDs of same are supplied in a simple "IDs_in_phase1.txt" file. When this is done, I'll move on to Phase 2. One P5 document needed to be changed, and that change needs to be moved upstream back to gitHub. I'm also doing away with the tarballs and am using straight zip (7zip) without any intervening tar.

@PFSchaffner
Copy link

PFSchaffner commented Aug 28, 2020

Also a caveat to Lou's project of reconciling our documentation with that of ProQuest. It may in some cases be impossible. I will see for myself when I get around to this reconciliation myself in a month or two, but in the past there have been several -- many -- cases where (for example) we have discovered an EEBO image set to contain more than one work, in which case we split the text into two files, attaching to each file the information appropriate to the individual work, whereas ProQuest tends to ignore such 'bound-with' situations and treats each image set as indivisible. There are other possible disagreements, notably those arising from changes at ProQuest (re-scanning, re-identifying, de-duping records), but 'bound-withs' are the most common.

@lb42
Copy link

lb42 commented Aug 28, 2020

Could you give a couple of examples of this distressing phenomenon?

@PFSchaffner
Copy link

PFSchaffner commented Aug 28, 2020

In general, I'm afraid that the first thing I learned about running this show was that it was going to be like transferring people in rough seas from one ship to another: neither platform is stable and the best you can do is connect via a rope pulley once in a while and hope you don't crash into each other. As for bound-withs in particular, they are mostly noted as such in the comments in eebodat (search for "bound[- ]?with" ); or definitively identifiable (I think) by the existence of two entries with the same VID. Nor do I know whether ProQuest might not have responded to some or all of them. But here is one example, as noted:
<!-- SORT{S004068} --><E><IDG S="MARC" R="UM" ID="A97377"><STC T="S">4068</STC><STC T="C">S113331</STC><BIBNO T="umi">99848567</BIBNO><VID>13656</VID></IDG><AUTHOR>Bullinger, Heinrich,$d1504-1575.</AUTHOR><ADDAUTHOR>V&eacute;ron, John,$dd. 1563.</ADDAUTHOR><STIT>A most necessary &amp; frutefull dialogue, betwene [the] seditious libertin or rebel Anabaptist, &amp; the true obedient christia[n] : wherin, as in a mirrour or glasse ye shal se [the] excellencte and worthynesse of a christia[n] magistrate: &amp; again what obedience is due vnto publique rulers of all th[os]e [that] professe Christ yea, though [the] rulers, in externe &amp; outward thinges, to their vtter dampnatyon, do otherwyse then well: translated out of Latyn into Englishe, by Iho[n] Veron Senonoys.</STIT><UTIT>Von dem unverschampten Fr&auml;fel der Widert&ouml;uffer. English. Selections</UTIT><YR>1551</YR><IMGS>45</IMGS><ADAT V="PDCC">2013-01</ADAT><KB>119</KB><V>PDCC</V><VDAT>2013-01</VDAT><PDAT MURP="lorand">2013-03</PDAT><D>3A</D><RDAT MURP="lorand">2013-03</RDAT><DDAT TCP="E2">2014-03</DDAT><ODAT>2014-10</ODAT></E><!-- bound-with. Images 13-57 of this image set contain a copy of S4068 identical to that found in VID 179308 and 13670, from the same Folger copy. File arrived as S3552.7; split into two files containing S3552.7 and S4068 respectively. 2013-03 pfs -->

@PFSchaffner
Copy link

Or this: <!-- SORT{S004217} --><E><IDG S="MARC" R="UM" ID="A89915"><STC T="S">4217</STC><STC T="C">S107140</STC><BIBNO T="umi">99842842</BIBNO><VID>4894</VID></IDG><NOTIS>AUB6065</NOTIS><ALEPH>003723689</ALEPH><STIT>An exposition vpon the Epistle to the Colossians Wherein, not onely the text is methodically analysed, and the sence of the words, by the help of writers, both ancient and moderne is explayned: but also, by doctrine and vse, the intent of the holy Ghost is in euery place more fully vnfolded and vrged. ... Being, the substance of neare seuen yeeres weeke-dayes sermons, of N. Byfield, late one of the preachers for the citie of Chester.</STIT><YR>1617</YR><IMGS>256</IMGS><ADAT V="PDCC">2013-08</ADAT><KB>2298</KB><V>PDCC</V><VDAT>2013-09</VDAT><PDAT MURP="pasj">2013-10</PDAT><D>3A</D><RDAT MURP="pasj">2013-10</RDAT><DDAT TCP="E2">2014-03</DDAT><ODAT>2014-10</ODAT></E><!-- bound-with. File split off from A16900 (S3794), which appears on the first 23 images of VID 4894, and this work on the remaining 256 images. 2013-10 pfs --><!--I manually added all the fields following the image count, -pasj, 2013-11-->

@PFSchaffner
Copy link

(The term 'bound-with' is cataloguer jargon and refers of course to bound volumes in which more than one work has been bound up together. Some of the EEBO bound-withs may be literal ones -- i.e. the works in question may be in fact bound together in a physical volume, some may be virtual ones -- the break between works not noticed during filming or scanning; and some may be one of those complicated early-print situations in which works are issued together, but with separate title pages, and may also have been issued separately. Most of those are treated bibliographically as single items and we have accepted that unity, but it can get confusing.)

@lb42
Copy link

lb42 commented Aug 28, 2020

My merge workflow looks through a file of records like this

<bibl xml:id="A16900" n="99840393" vid="4894" pp="23"/>
<bibl xml:id="A89915" n="99842842" vid="4894" pp="256"/>

extracted from your eebodat file, seeking out items which have the same @n AND the same @vid as the corresponding protext record, and then enriching the latter with the @xml:id as a TCP identifier.

This seems to give the right result for your second case, but not the first. Not sure why.

@lb42
Copy link

lb42 commented Aug 28, 2020

I don't find any Proquest record for tcp:A97377. This is because the Proquest catalogue only has two records for eebo:99848567 (one for vid 13670 and one for vid 179308). There's no record for the same eebo id associated with vid 13656. Nor do I see how I could reconstruct one from your data...

@lb42
Copy link

lb42 commented Aug 28, 2020

The merged catalogue currently has 143734 entries; there are 144528 entries in the file I extract from eebodat. So either I am failing to find 794 entries, or your data has lots of ghosts. Bother.

@PFSchaffner
Copy link

PFSchaffner commented Aug 28, 2020

In reply to "There's no record for the same eebo id
associated with vid 13656."

Of course there's not. Imagine if you will that
you're faced with sorting out a cluttered attic.
ProQuest, confronted by 150,000 boxes, approaches
them with a bunch of pre-printed labels, "RIBBONS"
"HATS" "PENCILS" etc., but can only put one label
on each box. You might have two boxes of pencils,
(and each one can get a 'pencils' label), but the
only way to label a box containing both pencils
and hats is to concoct a hybrid label "pencils/hats".

(There are analogies here to the @calendar issue!)

In the original (Chadwyck) EEBO architecture, this
could be represented as labels with dependent boxes:

PENCILS

  • Box 1
  • Box 4

HATS

  • Box 2
  • Box 3
  • Box 5

In the new architecture, they have gotten rid
of the hierarchy and created label-box pairs, thus

  • PENCILS-Box 1
  • PENCILS-Box 4
  • HATS-Box 2
  • HATS-Box 3
  • HATS-Box 5

But then we come along, and actually start rummaging
through the boxes and find that Box 5 doesn't contain
just hats, but pencils too. So we do this:

PENCILS

  • Box 1
  • Box 4
  • *Box 5

HATS

  • Box 2
  • Box 3
  • Box 5

Which will turn into this

  • PENCILS-Box 1
  • PENCILS-Box 4
  • *PENCILS-Box 5
  • HATS-Box 2
  • HATS-Box 3
  • HATS-Box 5

Those * combinations are possible to us, because we
are oriented around the contents (pencils, hats),
and we are free to say that this batch of pencils is
found in box 5, and this batch of hats is also found
in box 5 (same vid, different bibnos).

But ProQuest doesn't have that ability.
They allow only one label per box, one bibno per vid. Most of the time
they probably don't even know what's in the boxes (they
haven't rummaged through them as we have), but if they do
discover the truth they have ony three options: split the box
into two boxes (Box 5a contains pencils, Box 5b contains
hats -- in which case the VIDs will change), or create a
new composite label ("PENCILS_HATS", in which case the bibno will
change), or ignore the pencils in the box, and leave well
enough alone (in which case our *-marked combinations will
be correct and valid but will match nothing in their system. I don't
know which option they've chosen, but in the past it has mostly
been option 3.

@PFSchaffner
Copy link

Now edited to supply context.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants