Tempus Data Engineer Challenge Project
Two Apache Airflow data pipelines were developed which fetch news data from a News REST API, store the data on the local filesystem, and perform a series of ETL tasks that extract the top headlines; transform them into a CSV tabular structure; and upload the transformations to a given Amazon S3 bucket.
- Python and Virtualenv
- author's Python and virtualenv versions are
3.6
and16.0.0
respectively.
- author's Python and virtualenv versions are
- PyPI
- author's PIP version is
18.1
- author's PIP version is
- Docker
- docker versions are
docker 18.06.1-ce
anddocker-compose 1.22.0
- docker versions are
- Register for a free News API key in order to use their News RESTFul API service.
- Register for a free Amazon Web Services account. This is required for authenticating to S3 using the boto Python SDK library.
- Before beginning to use the Boto library, you should set up authentication credentials. Credentials for your AWS account can be found in the IAM Console. You can create or use an existing user. Go to manage access keys and generate a new set of keys. These are needed required for the project to make S3 bucket-upload requests. For more details on this see the documentation here
- Create two AWS S3 buckets with your account. Name them
tempus-challenge-csv-headlines
andtempus-bonus-challenge-csv-headlines
. These buckets will hold the final csv transformations and the project-code expects these two buckets to already exist, as it does not programmatically create them before uploading and will throw errors if they are detected not to exist in S3.
Optional Prerequisites - for running Integration test with a Fake Amazon S3 server
- RubyGems installation
- Register for a free FakeS3 server license
- Install FakeS3
-
Clone a copy of the github repo into your working directory or Download it as a zip file and extract its contents into your working directory.
-
Open a command line terminal and navigate to the root of the repo directory.
-
Setup a Python virtual environment with the command
virtualenv ENV
where ENV is the directory path to where the virtual environment will be created. -
The application uses environmental variables to access the api keys needed for the News API and Amazon S3 usage. These keys are read from an
.env
file and.aws
directory respectively, in the root directory of the repo, which you must create (and place in that directory) before proceeding to the next step. During Docker build-time, these files are copied into the container and made available to the application.- In the terminal run the command
export AIRFLOW_GPL_UNIDECODE=yes
, this resolves a dependency issue with the Airflow version used in this project (version 1.10.0). This command needs to be run beforemake init
in the next step, so that this environmental variable is available to Airflow prior to its installation. - An example of an
.env
is shown below, the generated News API Key you obtained after registration is given the environmental variable nameNEWS_API_KEY
and its value should be set to the key you obtained. - An example of the
.aws
directory,config
andcredentials
files are shown below.
- In the terminal run the command
-
Run the command
make init
; this downloads all of the project's dependencies.make init
installs the Amazon Python (Boto) SDK library. Ensure your AWS account credentials are setup, to use the SDK, after this step. See here for more details.
-
Run the commands
make test
andmake integration-test
; this runs all the unit and integration tests, respectively, for the project and ensures they are passing. -
Run the command
make run
; this starts up Docker, reads in the Dockerfile, and configures the container with Airflow to begin running.- This takes a few seconds to about three minutes; for the container images to be downloaded and setup. Thereafter, Airflow's scheduler and webserver start up and the User interface and Admin Console becomes accessible. Open a web browser a navigate to http://localhost:9090 to access the Console.
- The two data pipelines "tempus_challenge_dag" and "tempus_bonus_challenge_dag" will have been loaded and are visible.
- In the Console UI (shown below) click on the toggle next to each pipeline name to activate them, and click on the the play button icon on the right to start each. The steps are numbered in order.
- By default, Airflow loads DAGs paused, clicking on toggle described previously will unpause them. For convenience the pipelines are preconfigured to be unpaused when Airflow starts, and thereafter to run at their prescheduled times of 12AM and 1AM each day (i.e. 1hour apart). They can be run immediately, however, by clicking on the "Trigger Dag" icon, described previously and shown above in the Airflow UI. Their respective logs can be viewed from their Task Instance Context Menus
Discusses the breakdown of the project goals into the two pipelines.
The first pipeline, named 'tempus_challenge_dag' is scheduled to run once a day at 12AM, and consists of eight tasks (six of which are the core). Its structure is shown below:
The pipeline tasks are as follows:
-
The first task is an Airflow DummyOperator which does nothing and is used merely to visually indicate the beginning of the pipeline.
-
Next, using a predefined Airflow PythonOperator, it calls a python function to create three datastore folders for storing the intermediary data for the 'tempus_challenge_dag' that is later on downloaded and transformed. The 'news', 'headlines', and 'csv' folders are created under the parent 'tempdata' directory which is made relative to the airflow home directory.
-
The third task involves a defined Airflow SimpleHTTPOperator making an HTTP GET request to the News API's 'sources' endpoint with the assigned API Key, to fetch all English news sources. A Python callback function is defined with this operator, and handles processing of the returned Response object, storing the JSON news data as a file in the pipeline's 'news' datastore folder.
-
The fourth task involves a defined Airflow FileSensor detects whenever the JSON news files have landed in the appropriate directory, this kicks off the subsequent ETL stages of the pipeline.
-
The fifth task - Extraction - involves a defined Airflow PythonOperator which reads from the news sources directory and for each source in the JSON file it makes a remote api call to get the latest headlines; then using JSON and Pandas libraries extracts the top-headlines from it, storing the result in the 'headlines' folder.
-
The sixth task, extraction and transformation of the headlines take place and it involves a separate predefined Airflow PythonOperator using a python function that reads the top-headlines JSON data from the 'headlines' folder, and using Pandas converts it into an intermidiary DataFrame object which is flattened into CSV. The flattened CSV files are stored in the 'csv' folder. If no news articles were found in the data then no CSV file is created, the application logs this csv-file absence to the Airflow Logs.
-
The seventh task, the Upload task, involves a defined custom Airflow PythonOperator, as Airflow does not have an existing Operator for transferring data directly from the local filesystem to Amazon S3. The Operator is built ontop of the Amazon Python Boto library, using preexisting credentials already setup, and moves the transformed data from the 'csv' folder to an S3 bucket already setup by the author. Two Amazon S3 buckets were setup by the author:
to store the flattened csv files from the 'tempus_challenge_dag' and 'tempus_bonus_challenge_dag' pipeline respectively. Navigating to these bucket links from any web browser returns any XML list of all their contents.
By default a dummy.txt
file is all that exists in the buckets. To view or download any of the files in the bucket, append the name of that document to the end of the aforementioned links.
- The eighth and final task is an Airflow DummyOperator which does nothing and is used merely to signify the end of the pipeline.
The second pipeline, named 'tempus_bonus_challenge_dag' is similar to the first; but consisting of seven tasks. It is scheduled to run once a day at 1AM. Its structure is shown below:
The pipeline tasks are identical to that of the first. The only main difference is in the third task of calling the News API:
-
Four Airflow SimpleHTTPOperators are defined which make separate, but parallel, HTTP GET requests to the News API's 'top-headlines' endpoint directly with the assigned API Key and a query for specific keywords: 'Tempus Labs', 'Eric Lefokosky', 'Cancer', and Immunotherapy. This fetches data on each of these keywords. The Python callback function which handles the return Response object stores them as four JSON files in the 'headlines' folder, created in an earlier step, for the 'tempus_bonus_challenge_dag'.
-
In its fifth task, extraction and transformation sub-operations take place in this task, named
flatten_to_csv_kw_task
, this is similar to Pipeline 1's sixth task.
The end transformations are stored in the csv
datastore folders of the respective pipelines.
For the 'tempus_challenge_dag' pipeline all the news headlines from all the english sources are flattened and transformed into one single csv file, the pipeline execution date is appended to the end transformed csv. It is of the form: pipeline-execution-date_headlines.csv
For each of the four keywords queries of the 'tempus_bonus_challenge_dag' - 'Tempus Labs', 'Eric Lefkofsky', 'Cancer', 'Immunotheraphy' - the result is four separate csv files, each representing all the headlines about that particular keyword. The pipeline execution date is appended to the end transformed csv's. The keyword headline files are of form:pipeline-execution-date_keyword_headlines.csv
- This project's unit and integration tests can be found in the
tests
folder in the root directory.- Running
make test
andmake integration-test
from the command line run all the tests for the associated Python functions used in the project.
- Running
- The project uses Flake8 as its Python Linter, ensuring code conformance to the Python PEP-8 standards. It is also setup with Travis CI to remotely run all the tests and Codecov reports test-coverage for the project; these can be further integrated in a Continuous Build/Integration/Delivery pipeline later on if needed.
The unit tests consists of six test suites corresponding to the core tasks in the two data pipelines. They are split into python files with the prefix test_xxxxx
, where xxxxx is the name of the kind of functionality being tested.
The tests make use of Pytest for unit testing and test coverage checks, as well as the Python Mocking library and PyFakeFS for simulating I/O dependencies such as functions interacting with the filesystem or making external network calls. The core test suites are:
- TestFileStorage which runs tests on the task involving creation of the datastore folders and actions on them.
- TestNetworkOperations which run tests on the task involving HTTP calls to the News API.
- TestExtractOperations which run tests on the task involving extracting headlines from the news data.
- TestTransformOperations which run tests on the task involving conversion of the news headlines JSON data into CSV.
- TestUploadOperations which run tests on the task involving data-transfer of the flattened CSVs to a predefined Amazon S3 bucket.
- TestNewsInfoDTO which run tests on NewsInfoDTO, a Data Transfer Object Python class, used by many of the other python class and module functions for moving information about news data between processing functions.
Integration tests exercise the overall combination of components interacting with each and other and external services. This implies that for tasks in the pipelines it would particularly test their interaction with the two main external services used: the News API and Amazon S3. Integration tests were written only for the UploadOperation interaction with an external Amazon S3 server; using Moto. Moto an (embedded) Amazon S3 Mock Server, is used to mock/simulate the behavior of running the project's csv-file upload operations (the last main task in each pipeline) interacting with the external Amazon S3 storage service.
The Amazon S3 integrations mock tests were done with the moto library standalone, as well as with a live running FakeS3 server. The test with the FakeS3 server is by default skipped in the test suite. Details on how to run the integration test with the FakeS3 server are describe below:
- With FakeS3 installed already and the license key obtained, in the terminal navigate to a directory of your choice and run the following command:
fakes3 -r . -p 4567 --license YOUR_LICENSE_KEY
, replacing 'YOUR_LICENSE_KEY with the obtained key. This command then starts the Fake Amazon S3 server. - In the
tests
directory of the project open thetest_upload_operations.py
file, remove the@pytest.mark.skip
line on thetest_upload_csv_to_s3_fakes3_integration_succeeds
test. - Run
make integration-test
to execute the test case, which invokes live calls to the fake Amazon S3 server. - To stop the Fake Amazon S3 server, return to the previous terminal and press
Ctrl+C
to stop it.
- Amazon Python SDK (boto 3) library
- Apache Airflow CLI
- Codecov
- FakeS3
- Flake8 - Python Pep-8 Style Guide Enforcement
- Moto Amazon S3 Mock Server
- News API
- PostgreSQL Python library
- Pyfakefs
- Pytest
- Python Data Analysis library (Pandas)
- Python JSON library
- Python Requests library
-
Where to store the data at rest: locally as files or using a database ? There are a number of tradeoffs using either, but given the scope of the project I decided to use a local filesystem.
- Although Airflow has an inter-task mechanism (called XCom) for passing data between tasks, from reading the Airflow documentation and research on the topic it was generally not recommended Xcoms be used for transferring large data between tasks (though the news json data in this project is small it could become large if the news data returned from the newsapi increases). Hence why, ultimately, the data-at-rest decision was narrowed down to only the filesystem or database options.
- In a production environment I would rather configure a real database, like PostgreSQL, to serve as a data warehouse for the News data retrieved, as well as setup tempoary data-staging areas for the intermediate data created during the ETL tasks.
-
For the bonus challenge, on experimenting with the News API, it was discovered that using all four keywords in the same api-request returned 0 hits. Hence, I decided four separate api-request calls would made; for each individual keyword.
-
To reduce the number of calls to the News API in the task of DAG pipeline 1
tempus_challenge_dag
, to retrieve the source headlines, the list of sources from the previous upstream task can be batched up and fed as a comma-separated string of identifiers to thesources
parameter of thetop-headlines
endpoint.- However, the returned Response objects will be very large and would can consist of a mix of headlines from all these news sources, which can be very confusing to parse programmatically (without some ample patience for writing more unit tests to extensively validate the behaviors and edge cases).
- The alternative then, which was what I chose, was to make the http calls to the News API
top-headlines
endpoint be separate for each news source. While this amounts to more http calls to the endpoint, it is easier and more understandable to parse the returned response objects, programmatically.
-
Note security concern of hardcoding the News API apikey in functions used for the http requests.
- After doing some research on the topic of 'api key storage and security', I decide based on reading some discussions online - for example from here, here, here and here - to store the key in an environmental variable that is injected into the Docker container and then accessed in the Airflow instance and Python at runtime.
- Airflow has an option of storing keys in a Variable but, based on the Airflow documentation it doesn't seem to be a very secure approach. Might want to look into better ways of api key management and encryption ? Perhaps using something like Vault or AWS Secrets Manager
-
No S3 bucket link was given in the project requirements, thus I created my own S3 bucket. The project implementation was designed such that anyone could use their own preexisting S3 buckets when running the code locally, as long as their bucket names corresponded to the two developed for this project:
tempus-challenge-csv-headlines
andtempus-bonus-challenge-csv-headlines
. -
I added
pip install --upgrade pip
andpip install --upgrade setuptools
commands to the Makefile, underinit
, to ensure an up to date version of pip is always used when the code is run. Though, in hindsight, this could potentially cause build-breaking issues; if there are new changes in pip to the python packages used in the project that weren't supported. -
It was observed that in some instances the Amazon Boto library doesn't detect the AWS API keys when set from within the Docker container environmental variables. The workaround was to create an
.aws
directory inside the container during Docker build-time and inject theconfig
andcredentials
files with the keys. The dockerfile was modified for this purpose. Due to the obvious security concerns around this approach these two files are never kept in git. -
The Airflow
scheduler
needs to be running alongside the Airflowwebserver
, for the scheduled dag-pipeline to run at their predefined times. This was very tricky to setup correctly in the docker container; as the current command, in the webserver section, of the docker-compose file starts only the webserver.- Attempts were made to add the extra command, in the docker-compose file, to start the scheduler. By replacing the command 'webserver' with a bash compound command to run 'scheduler and webserver', however this broke the container at build-time. One of the issues being the scheduler and webserver can't run one after the other in the same docker shell (using the typical bash '&& or &' commands) - as both run in continuous loops which would prevent the other from starting if they were ran in the same shell. Rather, the commands need to be called such that the services run concurrently (in parallel) and in separate shells, within the same container.
- Various approaches were experimented with to achieved this desired intent, however they all continued to break the container during build-time. The dockerfile for airflow itself has an option to run
scheduler
andwebserver
together - but only if Airflow was running inLocalExecutor
mode. However, even with that option set in the aiflow.cfg config file, no visible changes were noticed in the container airflow startup process. - The remaining options include exploring either placing a
docker run
command somewhere in the dockerfiles or injecting some other command-script into the container at build-time or exploring use of a service supervisor and manager such as upstart or systemd that would run within the docker-container.
-
The Apache Airflow version in the
requirements.txt
file was changed to1.10.0
(from the original1.9.0
) this was because the support for the FileSensor operator, used in one of the pipeline tasks, was only added in1.10.0
- When installing
1.10.0
it throws a RuntimeError:
RuntimeError: By default one of Airflow's dependencies installs a GPL dependency (unidecode). To avoid this dependency set SLUGIFY_USES_TEXT_UNIDECODE=yes in your environment when you install or upgrade Airflow. To force installing the GPL version set AIRFLOW_GPL_UNIDECODE.
- More details are discussed here, here, here, and here
- The solution to this error involved setting either
AIRFLOW_GPL_UNIDECODE=yes
ORSLUGIFY_USES_TEXT_UNIDECODE=yes
as one of the environment variables in the Docker config file - creating the variable at build-time - so that it would be available to Airflow during its installation. - Running version
1.10.0
gives empty logs (when you click on a task node log in its Task Instance Context Menu) in the UI. The solution to this (found here) is to change this line in the airflow.cfg file:task_log_reader = file.task
totask_log_reader = task
- When installing
-
The Airflow Community contributed
airflow.contrib.sensor.file_sensor
andairflow.contrib.hooks.fs_hook
modules were found to be very buggy to use, especially when trying to configure and test them in a DAG task pipeline. -
There are known issues with using the Moto Server in stand-alone server mode when testing a locally created url endpoint. See here for more details.
-
Moto breaks the CI builds (e.g. Travis-CI, Circle-CI) when those CI tools make pull requests to build this project within their environments. This is due a Moto dependency on a google-compute-engine module that is non-existent in the environment and fails to download. More details on this issue with Travis-CI are here and its fix here and here.
- A week of learning and experimentation with completely new topics: Working with Airflow, RESTFul APIs, Docker, AWS Python Boto library, Python Integration Testing, Using Python Test Doubles, and applying Python Test-Driven Development in practice.
- A week of coding, testing, and developing the solution.