Skip to content

Commit

Permalink
Update README
Browse files Browse the repository at this point in the history
  • Loading branch information
jayvarner committed Jan 11, 2024
1 parent b7c7a6a commit 961ca14
Showing 1 changed file with 87 additions and 10 deletions.
97 changes: 87 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,23 +1,100 @@
# readux-ingest-ecds
# Readux Ingest ECDS

Django app for Readux ingest specific to ECDS' infrastructure
Django app for Readux ingest specific to ECDS' infrastructure.

1. [Install](#install)
2. [Settings](#settings)
3. [Process](#process)
1. [Local Ingest](#local-ingest)
2. [Bulk Ingest](#bulk-ingest)
3. [Remote Ingest](#remote-ingest)

## Install

bash~~~
pip ...
~~~bash
pip install git+https://github.com/ecds/readux-ingest-ecds@develop
~~~

Add readux_ingest_ecds to the INSTALLED_APPS in config/settings/local.py

~~~python
INSTALLED_APPS += ['readux_ingest_ecds']
~~~

### Make and run migrations
Create and run the migrations.

bash~~~
~~~bash
python manage.py makemigrations readux_ingest_ecds
python manage.py migrate
~~~

## Settings

- IIIF_APPS
- INGEST_TMP_DIR
- INGEST_PROCESSING_DIR
- INGEST_OCR_DIR
**NOTE:** All values are simple strings.
| Setting | Value|
|---------|-------|
| IIIF_MANIFEST_MODEL | Model reference, eg. 'iiif.Manifest' |
| IIIF_IMAGE_SERVER_MODEL | Model reference, eg. 'iiif.ImageServer' |
| IIIF_RELATED_LINK_MODEL | Model reference, eg. 'iiif.RelatedLink' |
| IIIF_CANVAS_MODEL | Model reference, eg. 'iiif.Canvas' |
| IIIF_COLLECTION_MODEL | Model reference, eg. 'iiif.Collection' |
| INGEST_TMP_DIR | Absolute path where files will be temporarily stored. |
| INGEST_PROCESSING_DIR | Absolute path where Lambda will look for images. |
| INGEST_OCR_DIR | Absolute path where OCR files will be preserved. |

## Process

### Local Ingest

A person uploads a zip file with the following internal structure.

~~~bash
.
├──
│ └── metadata.(csv|tsv|xlsx)
│ └── images
│ │ └── **.(tiff|jpg|png|gif|webp)
│ └── ocr
│ │ └── **.(txt|tsv|xml|hocr)
~~~

#### Image Files

The images directory should contain all images sequentially named with numbers. Images can be in any format (other than PDF). Non-pyramidal tiffs will be converted during the ingest process.

#### OCR Files

OCR files file names should match its corresponding image. Readux currently supports hocr, Alto, and tab delimited (tsv).

#### Metadata File

The optional metadata file should be a spreadsheet. CSV is best, but TSV and Excel files are supported. The table below lists the supported column headers.

| Header | Description |
|--------|-------------|
| PID | **UNIQUE** identifier. If it is missing, Readux will assign one. |
| Label | Volume Title, if the title is extremely long, you can abbreviate it and put the rest into the Summary. |
| Summary | All descriptive information, you can use html <br/> to automatically add line breaks into the text. |
| Author | Last name, First name, dates; separate multiple authors by semi-colon. |
| Published city | City from publisher information. |
| Published date | Date of publication. |
| Published date edtf | Date of publication in [extended date time format](https://www.loc.gov/standards/datetime/) for search. Year can be the same (1688 = 1688) but a range changes (1688-1690 = 1688/1690). |
| Publisher | Publisher from publisher information. |
| PDF | Link to a file if available (optional). |
| Scanned by | Usually "Emory Libraries" |
| Identifier | The Library Call Number. |
| Identifier uri | Link to the item in the Library database. |

#### How It Works

When the zip file is uploaded, the metadata file will be read, a new manifest/volume will be created. A background job will start unpacking all the image and OCR files and the person will be redirected to the edit form for the new manifest.

The background job will save teh OCR files and save all the image files in a staging directory. While the image files are being unpacked, each file name is added to a text file. That text file is uploaded to a specific S3 bucket. When the file is saved to the S3 bucket, an AWS Lambda function will convert each file in the list to a ptiff and save it in the image directory for the IIP server.

### Bulk Ingest

Coming soon...

### Remote Ingest

Coming soon...

0 comments on commit 961ca14

Please sign in to comment.