edx-analytics-pipeline

The Hadoop-based data pipeline.

Installation instructions

We have some draft docs:

https://openedx.atlassian.net/wiki/display/OpenOPS/edX+Analytics+Installation describes how to install a small scale analytics stack on a single machine. It'll get merged into the next doc below at some point. Help welcome!
http://edx.readthedocs.org/projects/edx-installing-configuring-and-running/en/latest/analytics/install_analytics.html has an overview of a larger scale AWS install.

Requirements

Your machine will need the following in order to run the code in this repository:

Python 2.7.x
GCC (to compile numpy)
MySQL
GnuPG 1.4.x

All of the components above can be installed with your preferred package manager (e.g. apt, yum, brew.

The requirements in requirements/default.txt and requirements/test.txt can be installed with pip (via make):

make requirements

Known Issues on Mac OS X

If you are running the code on Mac OS X, you may encounter a couple issues when installing numpy. If pip complains about being unable to compile Fortran, ensure that you have GCC installed. The easiest way to install GCC is using Homebrew: brew install gcc. If after installing GCC you see an error along the lines of cannot link a simple C program, execute the following command to trigger the compiler to not throw an error when it encounters unused command arguments:

export ARCHFLAGS=-Wno-error=unused-command-line-argument-hard-error-in-future

Note: If you need to frequently re-install/upgrade requirements, you may find it convenient to add the export statements above to your .bashrc or .bash_profile file so that the statement is run whenever you open a new shell.

Wheel

Wheel can help cut down the time to install requirements. The Makefile is setup to use the environment variables WHEEL_URL and WHEEL_PYVER to find the Wheel server. You can set these variables using the commands below. If you want to set these variables every time you open a shell, add them to your .bashrc or .bash_profile files.

export WHEEL_PYVER=2.7
export WHEEL_URL=http://edx-wheelhouse.s3-website-us-east-1.amazonaws.com/<OPERATING SYSTEM>/<OS VARIANT>

Values for <OPERATING SYSTEM>/<OS VARIANT>:

Ubuntu/precise
MacOSX/lion

Running the Tests

Unit tests

Run make test to install the Python requirements and run the unit tests.

Some of the tests rely on AWS. If you encounter errors such as NoAuthHandlerFound: No handler was ready to authenticate. 1 handlers were checked. ['HmacAuthV1Handler'] Check your credentials, you need to set the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables. The values do not need to be valid credentials for the tests to pass, so the commands below should fix the failures.

export AWS_ACCESS_KEY_ID='AK123'
export AWS_SECRET_ACCESS_KEY='abc123'

Acceptance tests

Preconditions

To be able to run acceptance tests on an instance of the Analytics Devstack, the following conditions must hold:

/var/tmp/acceptance.json is present on your instance of Analytics Devstack. This file contains the default configuration for running acceptance tests.

The pipeline001 user has access to tables storing acceptance test data. You can verify this by doing the following (as vagrant user):

mysql -u root
SHOW GRANTS FOR 'pipeline001'@'localhost';

If your instance is configured correctly, you should get the following output:

+----------------------------------------------------------------------+
| Grants for pipeline001@localhost                                     |
+----------------------------------------------------------------------+
| ...                                                                  |
| GRANT ALL PRIVILEGES ON `acceptance%`.* TO 'pipeline001'@'localhost' |
+----------------------------------------------------------------------+

The pipeline has installed itself on your instance of Analytics Devstack. This happens when you run a pipeline task on the instance as described in this section of the official installation instructions.

If you are running a recent version of the Analytics Devstack (and followed the official installation instructions to set it up), the conditions above should already be satisfied.

Steps

If you haven't already, check out a local copy of this repository into the local directory of your Analytics Devstack installation. Restart the instance and make sure the repository is mounted at /edx/app/analytics_pipeline/analytics_pipeline.
If the code you would like to test lives on a different branch than master, navigate to /edx/app/analytics_pipeline/analytics_pipeline and check out that branch.
Navigate to /var/lib/analytics-tasks/devstack/repo/scripts (this is where the pipeline installs itself by default) and check out the branch that contains the test code you would like to use. This step is only necessary if you modified existing tests or created new tests that you'd like to run. You can skip it if you want to run acceptance tests to verify changes to existing pipeline tasks.
From /var/lib/analytics-tasks/devstack/repo/scripts, run the following command to launch the entire test suite:
```
./run_acceptance.sh --update
```
The --update/-u option ensures that the venv in which the tests will be executed is up-to-date; you can leave it off when running another test.

You can also limit test execution to a certain module, class, or method. To do this, use the --test/-t option:
```
./run_acceptance.sh --test test_course_catalog
./run_acceptance.sh --test test_course_catalog:CourseSubjectsAcceptanceTest
./run_acceptance.sh --test test_course_catalog:CourseSubjectsAcceptanceTest.test_course_subjects
```
The --test/t option allows you to specify multiple test modules/classes/methods at the same time:
```
./run_acceptance.sh -t test_course_catalog -t test_video
```
Note that when running multiple tests, execution will stop after the first error or failure. (This is helpful because acceptance tests can take a long time to finish.)

Note also that some tests depend on the availability of specific services to run successfully. When running these tests on an instance of the Analytics Devstack, they will be skipped with output similar to this:
```
Tests the workflow for the course subjects, end to end. ... SKIP: S3 is not available

----------------------------------------------------------------------
XML: nosetests.xml
----------------------------------------------------------------------
Ran 1 test in 0.001s

OK (SKIP=1)
```

How to Contribute

Contributions are very welcome, but for legal reasons, you must submit a signed individual contributor's agreement before we can accept your contribution. See our CONTRIBUTING file for more information -- it also contains guidelines for how to maintain high code quality, which will make your contribution more likely to be accepted.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

edx-analytics-pipeline

Installation instructions

Requirements

Running the Tests

Unit tests

Acceptance tests

How to Contribute

Files

README.md

Latest commit

History

README.md

File metadata and controls

edx-analytics-pipeline

Installation instructions

Requirements

Running the Tests

Unit tests

Acceptance tests

How to Contribute