This service provides full-document context over a set of PDFs stored in a document store. It responds to needs of the CriticalMAAS program for designing workflows linking machine-reading capabilities with human evaluation and verification, over large sets of internal documents.
The system serves a key role as a PDF document store component of the "CriticalMAAS Data Repository". It also fills a more general capability gap for linking xDD to external workflows that can be relevant outside of CriticalMAAS.
The APIs atop this store provide:
- A standardized data service for accessing PDF document sets for ML pipeline development, and storing document extractions produced by ML processing pipelines.
- Contextual page snippets of PDFs for use in "human-in-the-loop" applications
The system is integrated with xDD capabilities for search, discovery, and filtering (e.g., the xDD snippets API) but can be managed independently (e.g., by USGS as part of the CriticalMAAS program).
Work is ongoing to integrate the system into the CriticalMAAS CDR, beginning by porting the model into the central CriticalMAAS CDR schemas maintained by Jataware.
xDD will not provide or integrate capabilities for accessing full-text documents as part of its core services.
[!warning] Copyright safety This application provides full text access to PDFs, which is often reserved to the original publisher. It is designed for internal use for testing and validation, except in unusual cases where all documents can be verified to be in the public domain.
Attestation that document access follows legal requirements is the responsibility of the maintainer of the PDF document store, not xDD. When legal obligations preclude such access, or are ambiguous, xDD core services should be used without this application.
Access to PDFs is controlled at the document store level. Each document store must contain only PDFs with copyright status appropriate to its level of public access. Examples:
- A public S3 bucket for documents that we know are in the public domain
- A secure S3 bucket with "USGS internal" access for work products associated
with CriticalMAAS with more ambiguous copyright (e.g.,
geoarchive
set, or some subset of that)
API access controls will require access limitations corresponding to those for the document PDFs (except for metadata routes, which will have only basic information). At present, API key authentication is used to limit access to the system.
- Document store: An unstructured set of PDF documents stored an S3 bucket
- Metadata database: PostgreSQL database containing PDF links, xDD ID, source document link, page counts and sizes, document extractions, and other metadata
- Metadata API: API providing read and write access to document metadata
- Document API: API providing authenticated access to full document content and page-level document content
- Extractions API: API providing read and write access to the results of external metadata extraction workflows. This API is designed around storing COSMOS figure extractions but is extensible for other extraction types.
A stable ID is assigned to each document upon ingest. This is distinct from the xDD ID, to allow for the upload of documents not contained in xDD.
[!todo] OCR text, ElasticSearch, and other services Providing access and search over the OCR text layer of PDFs will require significant backend infrastructure beyond a simple collection of PDFs, but it might be worthwhile to drive certain extraction tasks.
An OpenAPI specification for the Document Store API is maintained as part of the development process. A summary of the endpoints provided is as follows:
GET /documents
: Paginated list of all documents in the storeGET /document/<stable-id>
: Document metadata for an individual documentGET /query
: Search for documents by xDD ID or DOIPOST /documents
: Upload a new document and its metadata
GET /document/<stable-id>/content
: The full-text PDF associated with an xDD ID
GET /document/<stable-id>/page/<n>?content_type={pdf|webp|svg}
: Page PDF thumbnailGET /document/<stable-id>/page/<n>/snippet/<x1>,<y1>,<x2>,<y2>?content_type={pdf|webp|svg}
: Page PDF thumbnail with the given snippet highlighted
GET /document/<stable-id>/extractions
: List all extractions associated with a documentPOST /document/<stable-id>/extractions
: Add a new extraction with location metadata to a document
- Documents with full-text PDFs available can be listed
- Clients should integrate with xDD's API for search and discovery capabilities
- No documents in an xDD set is guaranteed to be part of a full-text document store
- Applications should fail gracefully for documents where full-text PDFs cannot be found.
If PDF context is desired for documents not in xDD, the ideal approach will be to integrate with the "bring your own documents" flow currently being tested by xDD (November 2023). So new documents would be added to xDD and their full-text contents retained in this system, in tandem.
Documents uploaded to the document store but not xDD are currently excluded from other potentially useful services provided by xDD (e.g., OCR processing, ElasticSearch). It remains an open question how many services provided by xDD should be replicated independently in this framework.
[!todo] A staging area for xDD ingestion? If this store contained documents not in xDD, this application could serve as a "staging area" for bringing documents into the system, if the metadata database contained the requisite information (e.g., citations, links to the source) for successful ingestion into xDD.