-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Phase II? #8
Comments
If I can figure out how, yes. It is a bit complicated. The files on GitHub are the TEI P5 files created by Sebastian Rahtz at Oxford, and they have fallen a bit behind the release schedule: there are 35,000 EEBO phase 2 files released, not all of which were available to be converted by Sebastian. So we need either to reconstruct his process or find another that achieves a similar end. Moreover, the underlying metadata found in the file headers needs to be re-generated for ALL the files, based on new MARC records released by ProQuest, and incorporating links to the new EEBO platform. In the meantime, all released EEBO TCP files can be downloaded (albeit not as comfortably for some as from gitHub) from Box, in their original more-or-less TEI P4 form (as well as their raw SGML form, that in which they were created). All of those (as you probably already know) are freely available here: https://umich.app.box.com/s/f3mphvepm20akwloqna2 The P5 version of the Phase 2 files, so far as it goes, is also (I think) available from the same folder on Box, (seee /EEBO_phase2/Oxford_P5 ) But since I received these as a lump from Oxford, and have not checked them, I cannot vouch for either their encoding or their completeness. pfs. |
Hi @PFSchaffner |
If it's not too much trouble to recreate the workflow, that would be great!
P5 + direct download links would make a big difference.
I'm happy to contribute some compute time if it's straightforward.
…On Thu, Aug 13, 2020 at 6:19 AM James Cummings ***@***.***> wrote:
Hi @PFSchaffner <https://github.com/PFSchaffner>
I feel duty-bound to try to assist you in this -- we could probably
recreate Sebastian's workflow.
I know that @tuurma <https://github.com/tuurma> was also around and I
think helped on the script that produces the readme.md files, so mention
her in case she has anything to contribute. I've got a dropbox folder with
the TEI P5 files as I had them before I left Oxford. I could share that
with you if it is helpful.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#8 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AARPWOTRUVN3LIS5BLIMBBDSAO44RANCNFSM4P5THI2Q>
.
|
@jamescummings Am I right in thinking that all of Seb's stuff is here: https://github.com/textcreationpartnership/TCPTools ? The feeder files, the ones he got from me, such as lists of character entities, file lists with pubdates (not extractable from the texts themselves), ID numbers, and the like, would have to be refreshed, since they are quite stale. |
@PFSchaffner Correct, this is where the scripts to create individual repositories for each text are (it basically generates the README and copies both XML source and readme into the new github repo). Looks like this process could be easily run again on the Phase II bunch. I don't see how the actual P5 XML file have been created though? You do mention conversion from P4/SGML to P5 and external MARC metadata if I understood correctly? |
@tuurma @jamescummings I think those are here: https://github.com/textcreationpartnership/TCPTools/tree/master/SGML with the heart being the three files named ant*.xml . The files headers themselves reference "tcp2tei.xsl" which I haven't spotted yet. |
I have been working on the EEBO bibliography as published by Proquest, putting it into TEI format, and enriching it with TCP identifiers from Paul's "eebodat.sgm" file. This might also provide a useful way of checking which files are available in which format/s , it occurs to me. See further https://foxglove.hypotheses.org |
Just a slight update on this, in the interests of tidiness and accuracy (on our part). Driven by Lou's interest, I have downloaded, identified, counted, and am now re-uploading in more logical and consistent form, all the files that we host on Box. (Haven't touched gitHub yet.) I've done the Phase 1 files so far, and am slowly uploading them at very slow upload speeds from home, distinguishing the versions (inaccurately but conveniently) as P3 P4 and P5. All three versions, as hosted, now contain the same documents, and the IDs of same are supplied in a simple "IDs_in_phase1.txt" file. When this is done, I'll move on to Phase 2. One P5 document needed to be changed, and that change needs to be moved upstream back to gitHub. I'm also doing away with the tarballs and am using straight zip (7zip) without any intervening tar. |
Also a caveat to Lou's project of reconciling our documentation with that of ProQuest. It may in some cases be impossible. I will see for myself when I get around to this reconciliation myself in a month or two, but in the past there have been several -- many -- cases where (for example) we have discovered an EEBO image set to contain more than one work, in which case we split the text into two files, attaching to each file the information appropriate to the individual work, whereas ProQuest tends to ignore such 'bound-with' situations and treats each image set as indivisible. There are other possible disagreements, notably those arising from changes at ProQuest (re-scanning, re-identifying, de-duping records), but 'bound-withs' are the most common. |
Could you give a couple of examples of this distressing phenomenon? |
In general, I'm afraid that the first thing I learned about running this show was that it was going to be like transferring people in rough seas from one ship to another: neither platform is stable and the best you can do is connect via a rope pulley once in a while and hope you don't crash into each other. As for bound-withs in particular, they are mostly noted as such in the comments in eebodat (search for "bound[- ]?with" ); or definitively identifiable (I think) by the existence of two entries with the same VID. Nor do I know whether ProQuest might not have responded to some or all of them. But here is one example, as noted: |
Or this: |
(The term 'bound-with' is cataloguer jargon and refers of course to bound volumes in which more than one work has been bound up together. Some of the EEBO bound-withs may be literal ones -- i.e. the works in question may be in fact bound together in a physical volume, some may be virtual ones -- the break between works not noticed during filming or scanning; and some may be one of those complicated early-print situations in which works are issued together, but with separate title pages, and may also have been issued separately. Most of those are treated bibliographically as single items and we have accepted that unity, but it can get confusing.) |
My merge workflow looks through a file of records like this
extracted from your eebodat file, seeking out items which have the same @n AND the same @vid as the corresponding protext record, and then enriching the latter with the @xml:id as a TCP identifier. This seems to give the right result for your second case, but not the first. Not sure why. |
I don't find any Proquest record for tcp:A97377. This is because the Proquest catalogue only has two records for eebo:99848567 (one for vid 13670 and one for vid 179308). There's no record for the same eebo id associated with vid 13656. Nor do I see how I could reconstruct one from your data... |
The merged catalogue currently has 143734 entries; there are 144528 entries in the file I extract from eebodat. So either I am failing to find 794 entries, or your data has lots of ghosts. Bother. |
In reply to "There's no record for the same eebo id Of course there's not. Imagine if you will that (There are analogies here to the In the original (Chadwyck) EEBO architecture, this PENCILS
HATS
In the new architecture, they have gotten rid
But then we come along, and actually start rummaging PENCILS
HATS
Which will turn into this
Those * combinations are possible to us, because we But ProQuest doesn't have that ability. |
Now edited to supply context. |
According to https://textcreationpartnership.org/tcp-texts/eebo-tcp-early-english-books-online/, all the Phase II texts should now be freely available as of 1 Aug 2020. Will this repo be updated with the 28k that are currently listed as "Restricted"? It's a really convenient way to distributed raw files!
The text was updated successfully, but these errors were encountered: