Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: backup & restore ES #109

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 2 additions & 4 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,6 @@ notifications:

dist: xenial

sudo: false

language: python

matrix:
Expand All @@ -26,8 +24,8 @@ cache:
- pip

python:
- "2.7"
- "3.6"
- "3.7"

services:
- redis-server
Expand Down Expand Up @@ -84,6 +82,6 @@ deploy:
distributions: "sdist bdist_wheel"
on:
tags: true
python: "2.7"
python: "3.6"
repo: inveniosoftware/invenio-stats
condition: $DEPLOY = true
2 changes: 1 addition & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@

# General information about the project.
project = u'Invenio-Stats'
copyright = u'2017, CERN'
copyright = u'2020, CERN'
author = u'CERN'

# The version info for the project you're documenting, acts as replacement for
Expand Down
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ Invenio-Stats.
overview
configuration
usage
operations
examplesapp


Expand Down
121 changes: 121 additions & 0 deletions docs/operations.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
..
This file is part of Invenio.
Copyright (C) 2016-2020 CERN.

Invenio is free software; you can redistribute it and/or modify it
under the terms of the MIT License; see LICENSE file for more details.

Operations
==========

Since our only copy of stats is stored in the indices of Elasticsearch in case
of a cluster error or failure we will lose our stats data. Thus it is advised
to setup a backup/restore mechanism for projects in production.

We have several options when it comes down to tooling and methods for preserving
our data in Elasticsearch.

- `elasticdump <https://github.com/taskrabbit/elasticsearch-dump#readme>`_
A simple and straight forward tool to for moving and saving indices.
- `Elasticsearch Snapshots <https://www.elastic.co/guide/en/elasticsearch/reference/current/snapshot-restore.html>`_
is a tool that takes snapshots of our cluster. Snapshots are build in incremental
fashion so current snapshots do not include data from previous ones.
We can also take snapshots of individual indices or the whole cluster.
- `Curator <https://github.com/elastic/curator>`_
is an advanced python library from elastic, you can read more about
curator and how to configure and use it, in the official `Elasticsearch
documentation <https://www.elastic.co/guide/en/elasticsearch/client/curator/current/index.html>`_
- Not recommended, but if you want, you can even keep raw filesystem backups for
each of your elasticsearch nodes.

Demonstrating all the aforementioned tools falls out of the scope of this
guide so we will provide examples only for elasticdump.

.. note::
To give you a magnitude of the produced data for stats, `Zenodo <https://zenodo.org>`_
for January 2020, got approximately **3M** visits (combined harvesters and users),
which produced approximately **10Gb** of stats data.


Backup with elasticdump
~~~~~~~~~~~~~~~~~~~~~~~

.. note::
Apart from the data, you will also have to backup the mappings, so you are
able to restore data properly. The following example will backup only stats
for record-views (not the events), you can go through your indices and
select which ones make sense to backup.


Save our mappings and our index data to record_view_mapping_backup.json and
record_view_index_backup.json files respectively.

.. code-block:: console

$ elasticdump \
> --input=http://localhost:9200/stats-record-view-2020-03 \
> --output=record_view_mapping_backup.json \
> --type=mapping

Fri, 13 Mar 2020 13:13:01 GMT | starting dump
Fri, 13 Mar 2020 13:13:01 GMT | got 1 objects from source elasticsearch (offset: 0)
Fri, 13 Mar 2020 13:13:01 GMT | sent 1 objects to destination file, wrote 1
Fri, 13 Mar 2020 13:13:01 GMT | got 0 objects from source elasticsearch (offset: 1)
Fri, 13 Mar 2020 13:13:01 GMT | Total Writes: 1
Fri, 13 Mar 2020 13:13:01 GMT | dump complete

$ elasticdump \
> --input=http://localhost:9200/stats-record-view-2020-03 \
> --output=record_view_index_backup.json \
> --type=data

Fri, 13 Mar 2020 13:13:13 GMT | starting dump
Fri, 13 Mar 2020 13:13:13 GMT | got 5 objects from source elasticsearch (offset: 0)
Fri, 13 Mar 2020 13:13:13 GMT | sent 5 objects to destination file, wrote 5
Fri, 13 Mar 2020 13:13:13 GMT | got 0 objects from source elasticsearch (offset: 5)
Fri, 13 Mar 2020 13:13:13 GMT | Total Writes: 5
Fri, 13 Mar 2020 13:13:13 GMT | dump complete

In order to test restore functionality below I will delete on purpose the
index we backed up, from my instance.

.. code-block:: console

$ curl -XDELETE http://localhost:9200/stats-record-view-2020-03
{"acknowledged":true}


Restore with elasticdump
~~~~~~~~~~~~~~~~~~~~~~~~

As we are all aware a backup did not work until it gets restored. Note that
before importing our data, we need to import the mappings to re-create the index.
The process is identical with the backup with just reversed sources --input and
--output.


.. code-block:: console

$ elasticdump \
> --input=record_view_mapping_backup.json \
> --output=http://localhost:9200/stats-record-view-2020-03 \
> --type=mapping

Fri, 13 Mar 2020 15:22:17 GMT | starting dump
Fri, 13 Mar 2020 15:22:17 GMT | got 1 objects from source file (offset: 0)
Fri, 13 Mar 2020 15:22:17 GMT | sent 1 objects to destination elasticsearch, wrote 4
Fri, 13 Mar 2020 15:22:17 GMT | got 0 objects from source file (offset: 1)
Fri, 13 Mar 2020 15:22:17 GMT | Total Writes: 4
Fri, 13 Mar 2020 15:22:17 GMT | dump complete

$ elasticdump \
> --input=record_view_mapping_backup.json \
> --output=http://localhost:9200/stats-record-view-2020-03 \
> --type=mapping

Fri, 13 Mar 2020 15:23:01 GMT | starting dump
Fri, 13 Mar 2020 15:23:01 GMT | got 5 objects from source file (offset: 0)
Fri, 13 Mar 2020 15:23:01 GMT | sent 5 objects to destination elasticsearch, wrote 5
Fri, 13 Mar 2020 15:23:01 GMT | got 0 objects from source file (offset: 5)
Fri, 13 Mar 2020 15:23:01 GMT | Total Writes: 5
Fri, 13 Mar 2020 15:23:01 GMT | dump complete
4 changes: 3 additions & 1 deletion examples/app.py
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,7 @@ def fixtures():

def publish_filedownload(nb_events, user_id, file_key,
file_id, bucket_id, date):
"""Publish file download event."""
current_stats.publish('file-download', [dict(
# When:
timestamp=(
Expand All @@ -143,7 +144,7 @@ def publish_filedownload(nb_events, user_id, file_key,

@fixtures.command()
def events():
# Create events
"""Create events."""
nb_days = 20
day = datetime(2016, 12, 1, 0, 0, 0)
max_events = 10
Expand All @@ -162,6 +163,7 @@ def events():

@fixtures.command()
def aggregations():
"""Aggregate events."""
aggregate_events(['file-download-agg'])
# flush elasticsearch indices so that the aggregations become searchable
current_search_client.indices.flush(index='*')
8 changes: 4 additions & 4 deletions invenio_stats/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -223,14 +223,14 @@ def register_events():
delete or archive old indices.

2. Aggregating
^^^^^^^^^^^^^^
~~~~~~~~~~~~~~

The :py:class:`~invenio_stats.processors.EventsIndexer` processor indexes raw
events. Querying those events can put a big strain on the Elasticsearch
cluster. Thus Invenio-Stats provides a way to *compress* those events by
pre-aggregating them into meaningful statistics.

*Example: individual file downoalds events can be aggregated into the number of
*Example: individual file downloads events can be aggregated into the number of
file download per day and per file.*

Aggregations are registered in the same way as events, under the entrypoint
Expand Down Expand Up @@ -270,7 +270,7 @@ def register_aggregations():
]

An aggregator class must be specified. The dictionary ``params``
contains all the arguments given to its construtor. An Aggregator class is
contains all the arguments given to its constructor. An Aggregator class is
just required to have a ``run()`` method.

The default one is :py:class:`~invenio_stats.aggregations.StatAggregator`
Expand Down Expand Up @@ -300,7 +300,7 @@ def register_aggregations():
]
}

Again the registering function returns the configuraton for the query:
Again the registering function returns the configuration for the query:

.. code-block:: python

Expand Down
2 changes: 1 addition & 1 deletion requirements-devel.txt
Original file line number Diff line number Diff line change
Expand Up @@ -13,4 +13,4 @@
# -e git+git://github.com/mitsuhiko/jinja2.git#egg=Jinja2

-e git+https://github.com/inveniosoftware/invenio-queues.git#egg=invenio-queues
-e git+https://github.com/inveniosoftware/invenio-search.git#egg=invenio-search
-e git+https://github.com/inveniosoftware/invenio-search.git#egg=invenio-search
3 changes: 1 addition & 2 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,6 @@
'invenio-records>=1.0.0',
'invenio-records-ui>=1.0.1',
'isort>=4.2.15',
'mock>=1.3.0',
'pydocstyle>=1.0.0',
'pytest-cov>=1.8.0',
'pytest-pep8>=1.0.6',
Expand Down Expand Up @@ -77,7 +76,7 @@
'maxminddb-geolite2>=2017.0404',
'python-dateutil>=2.6.1',
'python-geoip>=1.2',
'Werkzeug>=0.15.0,<1.0.0'
'Werkzeug>=0.15.0, <1.0.0',
]

packages = find_packages()
Expand Down
2 changes: 1 addition & 1 deletion tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
import uuid
from contextlib import contextmanager
from copy import deepcopy
from unittest.mock import Mock, patch

# imported to make sure that
# login_oauth2_user(valid, oauth) is included
Expand All @@ -42,7 +43,6 @@
from invenio_records.api import Record
from invenio_search import InvenioSearch, current_search, current_search_client
from kombu import Exchange
from mock import Mock, patch
from six import BytesIO
from sqlalchemy_utils.functions import create_database, database_exists

Expand Down
3 changes: 1 addition & 2 deletions tests/contrib/test_event_builders.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,7 @@
"""Test event builders."""

import datetime

from mock import patch
from unittest.mock import patch

from invenio_stats.contrib.event_builders import file_download_event_builder, \
record_view_event_builder
Expand Down
2 changes: 1 addition & 1 deletion tests/test_aggregations.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,12 +9,12 @@
"""Aggregation tests."""

import datetime
from unittest.mock import patch

import pytest
from conftest import _create_file_download_event
from elasticsearch_dsl import Index, Search
from invenio_search import current_search
from mock import patch

from invenio_stats import current_stats
from invenio_stats.aggregations import StatAggregator, filter_robots
Expand Down
2 changes: 1 addition & 1 deletion tests/test_processors.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@

import logging
from datetime import datetime
from unittest.mock import patch

import pytest
from conftest import _create_file_download_event
Expand All @@ -18,7 +19,6 @@
from helpers import get_queue_size
from invenio_queues.proxies import current_queues
from invenio_search import current_search
from mock import patch

from invenio_stats.contrib.event_builders import build_file_unique_id, \
file_download_event_builder
Expand Down
2 changes: 1 addition & 1 deletion tests/test_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@

"""Test utility functions."""

from mock import patch
from unittest.mock import patch

from invenio_stats.utils import get_geoip, get_user, obj_or_import_string

Expand Down