-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
87 additions
and
10 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,23 +1,100 @@ | ||
# readux-ingest-ecds | ||
# Readux Ingest ECDS | ||
|
||
Django app for Readux ingest specific to ECDS' infrastructure | ||
Django app for Readux ingest specific to ECDS' infrastructure. | ||
|
||
1. [Install](#install) | ||
2. [Settings](#settings) | ||
3. [Process](#process) | ||
1. [Local Ingest](#local-ingest) | ||
2. [Bulk Ingest](#bulk-ingest) | ||
3. [Remote Ingest](#remote-ingest) | ||
|
||
## Install | ||
|
||
bash~~~ | ||
pip ... | ||
~~~bash | ||
pip install git+https://github.com/ecds/readux-ingest-ecds@develop | ||
~~~ | ||
|
||
Add readux_ingest_ecds to the INSTALLED_APPS in config/settings/local.py | ||
|
||
~~~python | ||
INSTALLED_APPS += ['readux_ingest_ecds'] | ||
~~~ | ||
|
||
### Make and run migrations | ||
Create and run the migrations. | ||
|
||
bash~~~ | ||
~~~bash | ||
python manage.py makemigrations readux_ingest_ecds | ||
python manage.py migrate | ||
~~~ | ||
|
||
## Settings | ||
|
||
- IIIF_APPS | ||
- INGEST_TMP_DIR | ||
- INGEST_PROCESSING_DIR | ||
- INGEST_OCR_DIR | ||
**NOTE:** All values are simple strings. | ||
| Setting | Value| | ||
|---------|-------| | ||
| IIIF_MANIFEST_MODEL | Model reference, eg. 'iiif.Manifest' | | ||
| IIIF_IMAGE_SERVER_MODEL | Model reference, eg. 'iiif.ImageServer' | | ||
| IIIF_RELATED_LINK_MODEL | Model reference, eg. 'iiif.RelatedLink' | | ||
| IIIF_CANVAS_MODEL | Model reference, eg. 'iiif.Canvas' | | ||
| IIIF_COLLECTION_MODEL | Model reference, eg. 'iiif.Collection' | | ||
| INGEST_TMP_DIR | Absolute path where files will be temporarily stored. | | ||
| INGEST_PROCESSING_DIR | Absolute path where Lambda will look for images. | | ||
| INGEST_OCR_DIR | Absolute path where OCR files will be preserved. | | ||
|
||
## Process | ||
|
||
### Local Ingest | ||
|
||
A person uploads a zip file with the following internal structure. | ||
|
||
~~~bash | ||
. | ||
├── | ||
│ └── metadata.(csv|tsv|xlsx) | ||
│ └── images | ||
│ │ └── **.(tiff|jpg|png|gif|webp) | ||
│ └── ocr | ||
│ │ └── **.(txt|tsv|xml|hocr) | ||
~~~ | ||
|
||
#### Image Files | ||
|
||
The images directory should contain all images sequentially named with numbers. Images can be in any format (other than PDF). Non-pyramidal tiffs will be converted during the ingest process. | ||
|
||
#### OCR Files | ||
|
||
OCR files file names should match its corresponding image. Readux currently supports hocr, Alto, and tab delimited (tsv). | ||
|
||
#### Metadata File | ||
|
||
The optional metadata file should be a spreadsheet. CSV is best, but TSV and Excel files are supported. The table below lists the supported column headers. | ||
|
||
| Header | Description | | ||
|--------|-------------| | ||
| PID | **UNIQUE** identifier. If it is missing, Readux will assign one. | | ||
| Label | Volume Title, if the title is extremely long, you can abbreviate it and put the rest into the Summary. | | ||
| Summary | All descriptive information, you can use html <br/> to automatically add line breaks into the text. | | ||
| Author | Last name, First name, dates; separate multiple authors by semi-colon. | | ||
| Published city | City from publisher information. | | ||
| Published date | Date of publication. | | ||
| Published date edtf | Date of publication in [extended date time format](https://www.loc.gov/standards/datetime/) for search. Year can be the same (1688 = 1688) but a range changes (1688-1690 = 1688/1690). | | ||
| Publisher | Publisher from publisher information. | | ||
| PDF | Link to a file if available (optional). | | ||
| Scanned by | Usually "Emory Libraries" | | ||
| Identifier | The Library Call Number. | | ||
| Identifier uri | Link to the item in the Library database. | | ||
|
||
#### How It Works | ||
|
||
When the zip file is uploaded, the metadata file will be read, a new manifest/volume will be created. A background job will start unpacking all the image and OCR files and the person will be redirected to the edit form for the new manifest. | ||
|
||
The background job will save teh OCR files and save all the image files in a staging directory. While the image files are being unpacked, each file name is added to a text file. That text file is uploaded to a specific S3 bucket. When the file is saved to the S3 bucket, an AWS Lambda function will convert each file in the list to a ptiff and save it in the image directory for the IIP server. | ||
|
||
### Bulk Ingest | ||
|
||
Coming soon... | ||
|
||
### Remote Ingest | ||
|
||
Coming soon... |