Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARXIVNG-1495 Data architecture for the canonical record #25

Merged
merged 59 commits into from
Oct 30, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
59 commits
Select commit Hold shift + click to select a range
a661c51
ARXIVNG-2062 initial implementation of serialization
erickpeirson Jul 10, 2019
acb1335
ARXIVNG-2062 some simplification of key generation
erickpeirson Jul 11, 2019
3741d31
ARXIVNG-2062 added deserialization methods
erickpeirson Jul 11, 2019
63f1722
proof-of-concept implementation of lazily loaded content
erickpeirson Jul 11, 2019
1413092
testing config/
erickpeirson Jul 11, 2019
78b823f
fixed a bunch of type annotations
erickpeirson Jul 11, 2019
769c71a
Merge branch 'develop' into task/ARXIVNG-2062
erickpeirson Jul 16, 2019
b4c3a16
minor
erickpeirson Jul 17, 2019
32f6d34
working prototype for integrity checks on listings
erickpeirson Aug 27, 2019
25745c7
swapped base classes for register
erickpeirson Aug 27, 2019
468421b
paring down repeated code, shoring up typing
erickpeirson Aug 28, 2019
f6ad2d5
more cleanup
erickpeirson Aug 28, 2019
1ba0864
shoring up eprint methods
erickpeirson Aug 29, 2019
df0ea35
fleshing out roles, more coherent resource resolution
erickpeirson Sep 16, 2019
cde0025
some reorganization, tests
erickpeirson Oct 3, 2019
bfdf372
more tests
erickpeirson Oct 3, 2019
51f2cbc
updated jsonschema
erickpeirson Oct 4, 2019
1bd4450
added API tests
erickpeirson Oct 4, 2019
dce976c
fixed some tests
erickpeirson Oct 4, 2019
4333ef8
disabled some currently unused tests
erickpeirson Oct 4, 2019
59b4f81
minor
erickpeirson Oct 4, 2019
ab0d4ab
updated abs parsing
erickpeirson Oct 5, 2019
5fa290c
restored tests
erickpeirson Oct 5, 2019
3a39841
cleanup
erickpeirson Oct 5, 2019
cf85a2c
more tests
erickpeirson Oct 5, 2019
d4657f5
updated repository app to use arxiv.canonical api
erickpeirson Oct 8, 2019
fa57a43
cleanup
erickpeirson Oct 13, 2019
6277b9b
cleanup
erickpeirson Oct 13, 2019
c1b5878
add date filter in daily.log parser
erickpeirson Oct 14, 2019
03359b3
initial work on backfill
erickpeirson Oct 15, 2019
6821ff8
more work on backfill
erickpeirson Oct 16, 2019
2bde553
added test file
erickpeirson Oct 16, 2019
68fee0e
added .coveragerc
erickpeirson Oct 16, 2019
5797976
more tests
erickpeirson Oct 16, 2019
e2ef703
added another test; minor fix for backfill
erickpeirson Oct 16, 2019
05e3214
tests, bugfixes, for backfill procedure
erickpeirson Oct 17, 2019
251c524
minor
erickpeirson Oct 17, 2019
ff3c2f7
removed callbacks, since they are not needed
erickpeirson Oct 18, 2019
2f6cbea
initial support for additional formats
erickpeirson Oct 18, 2019
4a4edeb
improved some tests
erickpeirson Oct 18, 2019
ff7def4
added source types
erickpeirson Oct 18, 2019
fde2d05
render is optional
erickpeirson Oct 18, 2019
33a2315
added some tests for source format
erickpeirson Oct 18, 2019
c76eb15
more dissemination type logic
erickpeirson Oct 18, 2019
d3c6ef0
ported over some tests for dissemination format by filename from browse
erickpeirson Oct 18, 2019
8acf406
minor
erickpeirson Oct 18, 2019
47f0e3e
minor
erickpeirson Oct 18, 2019
a3279d6
minor
erickpeirson Oct 18, 2019
ff591e2
minor
erickpeirson Oct 18, 2019
15161a6
bugfixes in backfill
erickpeirson Oct 24, 2019
c2d2a88
work on cli, more bugfixes
erickpeirson Oct 24, 2019
a0456f0
handle gzipped classic files, bugfixes, metadata consistency
erickpeirson Oct 25, 2019
f968adc
working on tests
erickpeirson Oct 28, 2019
27b5a8b
cleanup, docstrings
erickpeirson Oct 28, 2019
5fe2dd4
removed unused lock module
erickpeirson Oct 28, 2019
dc0fa37
cleanup, docstrings
erickpeirson Oct 28, 2019
33871f9
updated docstrings
erickpeirson Oct 28, 2019
c1904c7
updated some docs
erickpeirson Oct 28, 2019
b92f654
added api docs
erickpeirson Oct 30, 2019
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
6 changes: 6 additions & 0 deletions .coveragerc
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
[run]
omit =
app.py
*setup.py
docs/*
*test*
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -104,3 +104,7 @@ venv.bak/
.mypy_cache/

.DS_Store
.pytest_cache/
.vscode/
.history/
.backfill/
2 changes: 1 addition & 1 deletion .pylintrc
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ confidence=
# --enable=similarities". If you want to run only the classes checker, but have
# no Warning level messages displayed, use"--disable=all --enable=classes
# --disable=W"
disable=blacklisted-name,invalid-name,import-error,print-statement,parameter-unpacking,unpacking-in-except,old-raise-syntax,backtick,long-suffix,old-ne-operator,old-octal-literal,import-star-module-level,parse-error,raw-checker-failed,bad-inline-option,locally-disabled,locally-enabled,file-ignored,suppressed-message,useless-suppression,deprecated-pragma,too-many-return-statements,too-many-arguments,too-many-locals,arguments-differ,signature-differs,unused-import,redefined-builtin,broad-except,apply-builtin,basestring-builtin,buffer-builtin,cmp-builtin,coerce-builtin,execfile-builtin,file-builtin,long-builtin,raw_input-builtin,reduce-builtin,standarderror-builtin,unicode-builtin,xrange-builtin,coerce-method,delslice-method,getslice-method,setslice-method,no-absolute-import,old-division,dict-iter-method,dict-view-method,next-method-called,metaclass-assignment,indexing-exception,raising-string,reload-builtin,oct-method,hex-method,nonzero-method,cmp-method,input-builtin,round-builtin,intern-builtin,unichr-builtin,map-builtin-not-iterating,zip-builtin-not-iterating,range-builtin-not-iterating,filter-builtin-not-iterating,using-cmp-argument,eq-without-hash,div-method,idiv-method,rdiv-method,exception-message-attribute,invalid-str-codec,sys-max-int,bad-python3-import,deprecated-string-function,deprecated-str-translate-call,too-few-public-methods
disable=blacklisted-name,invalid-name,import-error,print-statement,parameter-unpacking,unpacking-in-except,old-raise-syntax,backtick,long-suffix,old-ne-operator,old-octal-literal,import-star-module-level,parse-error,raw-checker-failed,bad-inline-option,locally-disabled,locally-enabled,file-ignored,suppressed-message,useless-suppression,deprecated-pragma,too-many-return-statements,too-many-arguments,too-many-locals,arguments-differ,signature-differs,unused-import,redefined-builtin,broad-except,apply-builtin,basestring-builtin,buffer-builtin,cmp-builtin,coerce-builtin,execfile-builtin,file-builtin,long-builtin,raw_input-builtin,reduce-builtin,standarderror-builtin,unicode-builtin,xrange-builtin,coerce-method,delslice-method,getslice-method,setslice-method,no-absolute-import,old-division,dict-iter-method,dict-view-method,next-method-called,metaclass-assignment,indexing-exception,raising-string,reload-builtin,oct-method,hex-method,nonzero-method,cmp-method,input-builtin,round-builtin,intern-builtin,unichr-builtin,map-builtin-not-iterating,zip-builtin-not-iterating,range-builtin-not-iterating,filter-builtin-not-iterating,using-cmp-argument,eq-without-hash,div-method,idiv-method,rdiv-method,exception-message-attribute,invalid-str-codec,sys-max-int,bad-python3-import,deprecated-string-function,deprecated-str-translate-call,too-few-public-methods,pointless-string-statement

# Enable the message, report, category or checker with the given id(s). You can
# either give multiple identifier separated by comma (,) or put this option
Expand Down
4 changes: 3 additions & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,14 +5,16 @@ services:
- docker
os:
- linux
env:
- BOTO_CONFIG=/dev/null
python:
- "3.6"
script:
- pip install -U pip pipenv
- pipenv install --dev
- pipenv run pytest --cov=arxiv --cov=announcement/announcement --cov=repository/repository --cov-report=term-missing arxiv announcement/announcement repository/repository
after_success:
- coveralls
- pipenv run -m coveralls
- "./tests/lint.sh arxiv"
- "./tests/lint.sh announcement/announcement"
- "./tests/lint.sh repository/repository"
Expand Down
8 changes: 7 additions & 1 deletion Pipfile
Original file line number Diff line number Diff line change
Expand Up @@ -13,16 +13,22 @@ pydocstyle = "*"
mypy = "*"
pytest-cov = "*"
arxiv-canonical = {path = "."}
moto = "*"
sphinx = "*"
sphinx-autodoc-typehints = "*"

[packages]
backports-datetime-fromisoformat = "*"
jsonschema = "*"
python-dateutil = "*"
pytz = "*"
typing-extensions = "*"
arxiv-base = "==0.15.8.post1"
arxiv-base = "==0.16.2"
arxiv-auth = "*"
arxiv-canonical = {path = "."}
mypy = "==0.720"
moto = "==1.3.13"
retry = "*"

[requires]
python_version = "3.6"
1,222 changes: 1,053 additions & 169 deletions Pipfile.lock

Large diffs are not rendered by default.

88 changes: 64 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,22 @@
# arXiv NG Canonical Record

This repository contains a library and applications for working with the core
arXiv canonical record. The canonical record is the authoritative history and
arXiv canonical record. The canonical record is the authoritative history and
state for announced e-prints on the arXiv platform.

Work on this project will proceed in two phases, each corresponding to a major
Work on this project will proceed in two phases, each corresponding to a major
version:

## Version 0: Replication of the Legacy Record to the Canonical Record

The first major objective of this project is to replicate all of the core
announcement events that occur in the legacy system to the cloud-native
The first major objective of this project is to replicate all of the core
announcement events that occur in the legacy system to the cloud-native
canonical record.

- The legacy system emits event notifications via a Kinesis stream for new
- The legacy system emits event notifications via a Kinesis stream for new
e-prints, replacements, cross-listing, withdrawals, and updates.
- An announcement agent (``announcement/`` in this repo)...

- consumes legacy events,
- retrieves metadata, source package, and first-compiled PDF from legacy,
- formats and stores content as part of the canonical record. The canonical
Expand All @@ -27,17 +27,57 @@ canonical record.
content, and events available via a RESTful JSON API. This is a Flask
application that will be deployed as a Docker container.

Both the ``announcement/`` and ``repository/`` applications use the
Both the ``announcement/`` and ``repository/`` applications use the
``arxiv.canonical`` package (``arxiv/canonical/`` in this repo) to interact
with the canonical record.

### Implementation notes

- ``arxiv.canonical.classic`` provides a CLI with ``backfill`` and
``backfill_today`` commands.

- ``backfill` can be used to backfill the NG canonical record from the legacy
record.
- ``backfill_today`` can be used to update the NG canonical record from the
legacy record on a daily basis. This should be run after the announcement
process has completed.
- These commands should be extended with an option to also propagate events
once they are successfully backfilled. See that module docstring for
details.

- ``repository/`` has a minimal integration with the updated
``arxiv.canonical`` package. This could be a guide for a similar service
module in the ``browse`` application.

### TODO

- [ ] Consider removing entirely the ``render`` property throughout
``arxiv.canonical``. In the initial implementation, only a source package
and a single rendered output (e.g. PDF) were stored. In the current
implementation, the source file plus all classic dissemination formats are
preserved. Thus ``render`` is more or less obviated.
- [ ] Some attention to the semantics of exceptions throughout
``arxiv.canonical``. In many places we are still using native Python
exceptions that may not provide the most meaningful information or be used
consistently.
- [ ] Implementation of the daily preservation package. This will involve
implementing supporting structs in ``domain``, ``record``, and ``integrity``
modules, and probably something analagous to the ``register`` API for
constructing and storing the daily preservation package.
- [ ] Integration in arxiv-browse. The recommended approach is to treat browse
as a flavor of the canonical repository (see
https://arxiv.github.io/arxiv-arxitecture/subsystems/announcement.html#primary-repository)
with an HTML interface. See ``repository/`` in this repo for how this
integration could look. The ``RegisterAPI`` may need to be extended to
support some of browse's requirements, such as listing events by week.

## Version 1: Orchestration of the Announcement Process

Once several other dependencies are resolved in the legacy system, this project
will assume primary responsibility for announcing submitted e-prints on a
will assume primary responsibility for announcing submitted e-prints on a
daily basis. This is a bit further down the road.

# Contributing
# Contributing

For a list of things that need doing, please see the issues tracker for this
repository.
Expand All @@ -62,7 +102,7 @@ and the corresponding ``wsgi_[xxx].py`` entrypoints.
## AWS services, mocking

It's helpful to use a live API when developing components against AWS services.
We use [Localstack](https://github.com/localstack/localstack) for this
We use [Localstack](https://github.com/localstack/localstack) for this
purpose.

## Contributor guidelines
Expand Down Expand Up @@ -101,16 +141,16 @@ The canonical record can be stored on any system that supports a key-binary
data structure, such as a filesystem or an object store. The two core data
structures in the record are:

1. E-prints, comprised of...
1. E-prints, comprised of...

- metadata,
- submitted content,
- metadata,
- submitted content,
- and the first rendering of the PDF.

2. Announcement records, representing a single announcement-related event, such
as a new version, a withdrawal, or a cross-list; these records are:

- organized into daily announcement listing files and
- organized into daily announcement listing files and
- emitted via a notification broker in real time, to trigger updates to
downstream services and data stores.

Expand All @@ -119,7 +159,7 @@ An e-print is comprised of (1) a metadata record, (2) a source package,
containing the original content provided by the submitter, and (3) a canonical
rendering of the e-print in PDF format. A manifest is also stored for each
e-print, containing the keys for the resources above and a base-64 encoded MD5
hash of their binary content.
hash of their binary content.

The key prefix structure for an e-print record is:

Expand All @@ -128,12 +168,12 @@ e-prints/<YYYY>/<MM>/<arXiv ID>/v<version>/
```

Where ``YYYY`` is the year and ``MM`` the month during which the first version
of the e-print was announced.
of the e-print was announced.

Sub-keys are:

- Metadata record: ``<arXiv ID>v<version>.json``
- Source package: ``<arXiv ID>v<version>.tar.gz``
- Source package: ``<arXiv ID>v<version>.tar``
- PDF: ``<arXiv ID>v<version>.pdf``
- Manifest: ``<arXiv ID>v<version>.manifest.json``

Expand All @@ -144,7 +184,7 @@ versions of an e-print.
## Announcement listings
The announcement listings commemorate the announcement-related events that
occur on a given day. This includes new e-prints/versions, withdrawals,
cross-lists, etc.
cross-lists, etc.

The key prefix structure for an announcement listing file is:

Expand All @@ -155,11 +195,11 @@ announcement/<YYYY>/<MM>/<DD>/
Each daily key prefix may contain one or more sub-keys. Each sub-key ending in
.json is treated as a listing file. This allows for the possibility of
sharded/multi-threaded announcement processes that write separate listing
files, e.g. for specific classification domains.
files, e.g. for specific classification domains.

``YYYY`` is the year, ``MM`` the month, and ``DD`` the day on which the
announcement events encoded therein occurred and on which the subordinate
listing files were generated.
listing files were generated.

## Preservation record
The preservation record is a daily digest containing e-print content,
Expand All @@ -170,10 +210,10 @@ corresponding tombstones).
```
announcement/<listing>.json
e-prints/<arXiv ID>v<version>/
<arXiv ID>v<version>.json # Metadata record
<arXiv ID>v<version>.tar.gz # Source package
<arXiv ID>v<version>.pdf # First PDF
<arXiv ID>v<version>.manifest.json # Manifest.
<arXiv ID>v<version>.json # Metadata record
<arXiv ID>v<version>.tar # Source package
<arXiv ID>v<version>.pdf # First PDF
<arXiv ID>v<version>.manifest.json # Manifest.
suppress/<arXiv ID>v<version>/tombstone
preservation.manifest.json
```
Expand Down
50 changes: 1 addition & 49 deletions announcement/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,55 +3,7 @@
The announcement agent is responsible for adding new e-prints to the canonical
record.

## Version 0 : Clone legacy announcement record to canonical format

In v0 of the announcement agent, the e-print event consumer...

- Processes events from a Kinesis stream (see below).
- Retrieves metadata, content for e-prints from the legacy system.
- Uses ``arxiv.canonical`` to update the canonical record in the cloud.

## Events

The legacy system produces e-print events on a Kinesis stream called
``Announce``. Each message has the structure:

```json
{
"event_type": "...",
"identifier": "...",
"version": "...",
"timestamp": "..."
}
```

``event_type`` may be one of:

| Event type | Description |
|------------|----------------------------------------------------------------|
| ``new`` | An e-print is announced for the first time. |
| ``updated`` | An e-print is updated without producing a new version. |
| ``replaced`` | A new version of an e-print is announced. |
| ``cross-list`` | Cross-list classifications are added for an e-print. |
| ``withdrawn`` | An e-print is withdrawn. This generates a new version. |

``identifier`` is an arXiv identifier without a version affix.

``version`` is a positive integer.

``timestamp`` is an ISO-8601 datetime, localized to UTC.

## Legacy integration

Metadata, PDFs, and source are retrieved from the legacy system via HTTP
request.

- Metadata: ``arxiv.org/docmeta/{IDENTIFIER}v{VERSION}``
- PDF: ``arxiv.org/pdf/{IDENTIFIER}v{VERSION}``
- Source: ``arxiv.org/src/{IDENTIFIER}v{VERSION}``


# Contributing
# Contributing

For a list of things that need doing, please see the issues tracker for this
repository.
Expand Down
9 changes: 5 additions & 4 deletions announcement/announcement/agent/consumer.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
"""
E-print event consumer.

In v0 of the announcement agent, the e-print event consumer processes
In v0 of the announcement agent, the e-print event consumer processes
notifications about announcement events generated by the legacy system, and
updates its version of the canonical record.

Expand All @@ -15,7 +15,7 @@
"event_type": "...",
"identifier": "...",
"version": "...",
"timestamp": "..."
"timestamp": "..."
}
```

Expand All @@ -29,7 +29,7 @@
| cross-list | Cross-list classifications are added for an e-print. |
| withdrawn | An e-print is withdrawn. This generates a new version. |

``identifier`` is an arXiv identifier; see :class:`.Identifier`.
``identifier`` is an arXiv identifier; see :class:`.Identifier`.

``version`` is a positive integer.

Expand All @@ -42,6 +42,7 @@
(https://github.com/arXiv/arxiv-base/blob/master/arxiv/integration/kinesis/consumer/__init__.py).

"""
from typing import Any

from arxiv.integration.kinesis.consumer import BaseConsumer

Expand All @@ -54,7 +55,7 @@
class AnnouncementConsumer(BaseConsumer):
"""Consumes announcement events, and updates the canonical record."""

def __init__(self, *args, **kwargs) -> None:
def __init__(self, *args: Any, **kwargs: Any) -> None:
super(AnnouncementConsumer, self).__init__(*args, **kwargs)
self._metadata_service = LegacyMetadataService.current_session()
self._pdf_service = LegacyPDFService.current_session()
Expand Down
15 changes: 13 additions & 2 deletions announcement/announcement/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,21 @@
Docstrings are from the `Flask configuration documentation
<http://flask.pocoo.org/docs/0.12/config/>`_.
"""
from typing import Optional
from typing import Any, Optional, Type
import warnings
from os import environ


def _showwarning(message: str,
*args: Any,
category: Type[Exception] = UserWarning,
filename: str = '',
lineno: int = -1,
**kwargs: Any) -> None:
print(message)

warnings.showwarning = _showwarning

NAMESPACE = environ.get('NAMESPACE')
"""Namespace in which this service is deployed; to qualify keys for secrets."""

Expand Down Expand Up @@ -245,7 +256,7 @@
BASE_SERVER = environ.get('BASE_SERVER', 'arxiv.org')

URLS = [

]
"""
URLs for external services, for use with :func:`flask.url_for`.
Expand Down
Loading