Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reenable advanced observatory features like search #16

Open
wants to merge 38 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
a314033
Update .gitignore
maxious Jun 5, 2019
251ea52
Merge pull request #2 from govau/alex_branch
maxious Jun 6, 2019
7a3df29
Merge pull request #4 from govau/brendan-branch2
maxious Jun 7, 2019
931edb9
update DAG to pull observatory from git
maxious Jun 11, 2019
83176ed
add blue green deploy and healthcheck for observatory
maxious Jun 11, 2019
8c7f28f
Added functional pie chart
patdbro Jun 6, 2019
ffb5103
Starting to add userjourneys
patdbro Jun 7, 2019
04c23a8
updates on userjourney and pie graphs
patdbro Jun 11, 2019
a551487
pie chart fix
patdbro Jun 11, 2019
59c67c3
Fix up agency data
patdbro Jun 11, 2019
112b51f
rename files
maxious Jun 11, 2019
c69f579
Merge branch 'master' into pdb_branch
maxious Jun 11, 2019
e9d2f71
Merge pull request #3 from govau/pdb_branch
maxious Jun 11, 2019
a7f893c
Add steps to user journeys
maxious Jun 7, 2019
1a82647
center text labels
maxious Jun 8, 2019
a3fc10e
increase size of text area
maxious Jun 11, 2019
212e908
cfignore
maxious Jun 11, 2019
142bc4e
formatting
maxious Jun 11, 2019
ead3ed3
compact view to 10 across
maxious Jun 11, 2019
03ba254
copy data before deploy
maxious Jun 11, 2019
5c3c2fd
.gitignore
maxious Jun 11, 2019
20e4d13
Merge pull request #5 from govau/alex_branch
maxious Jun 11, 2019
74f6cc2
Chart fixes
patdbro Jun 11, 2019
b5d8943
Merge pull request #6 from govau/pie-chart-fix
maxious Jun 11, 2019
22b7732
Changes for Adrians
patdbro Jun 12, 2019
e5ab228
fix html line endings
maxious Jun 12, 2019
009ed71
dagnamit typo
patdbro Jun 12, 2019
24fbc95
Add beginnings of HTML Observatory
maxious Jun 18, 2019
99c57db
iterate sidebar
maxious Jun 19, 2019
15387c2
group by domain
maxious Jun 20, 2019
398172c
highlight all website entries
maxious Jun 20, 2019
edb6a1b
Merge pull request #7 from govau/Adrian-Feedback
maxious Jun 20, 2019
cf1bde4
Merge pull request #8 from govau/html-observatory
maxious Jun 20, 2019
d29d892
Focus on domain level view (#9)
maxious Jun 26, 2019
99420b0
google search console downloader (#11)
maxious Jul 1, 2019
89b1f9a
IE11 fixes (#12)
maxious Jul 4, 2019
130de33
Internal site search downloader (#13)
maxious Jul 5, 2019
8109050
Reenable advanced observatory features like search
maxious Jul 9, 2019
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# R wants HTML files to have UNIX line endings ie. just \n or LF
*.html text eol=lf
17 changes: 14 additions & 3 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -144,11 +144,22 @@ vignettes/*.pdf
rsconnect/

/dags/r_scripts/credentials.json
/dags/galileo/*.csv
*.csv
/data
/.idea/
airflow.cfg
*.avro
/shiny/observatory/htpasswd
/tools/warcraider/*.warc
/shiny/observatory/*.csv
htpasswd
.htpasswd
*.warc
*.csv
*.rdata
*.xls
*.xlsx
*.gexf
*.graphml
Staticfile.auth
*.iml
*.ipr
*.iws
15 changes: 15 additions & 0 deletions .vscode/launch.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
{
// Use IntelliSense to learn about possible attributes.
// Hover to view descriptions of existing attributes.
// For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
"version": "0.2.0",
"configurations": [
{
"name": "Python: Current File",
"type": "python",
"request": "launch",
"program": "${file}",
"console": "integratedTerminal"
}
]
}
6 changes: 6 additions & 0 deletions .vscode/settings.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
{
"python.linting.pylintEnabled": false,
"python.linting.pep8Enabled": false,
"python.linting.enabled": true,
"python.linting.flake8Enabled": true
}
130 changes: 130 additions & 0 deletions AIRFLOW101.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
# Airflow 101

## Our Environment

Our analytics pipeline runs on open source [Apache Airflow](http://airflow.apache.org/tutorial.html) which is written in Python. This means we can deploy it to other clouds or inhouse if we need.

We have some special configuration:
- using latest beta version composer-1.6.1-airflow-1.10.1
- using Python 3
- google-api-python-client, tablib, python-igraph, plotly pypi packages preinstalled
- slack_default and bigquery_default connection
- special docker image that includes google cloud SDK and R with tidyverse/ggplot2
- sendgrid enabled to allow warning and other messages to be sent.

## Getting Started
To access "Cloud Composer" (the Google branding for Airflow), visit https://console.cloud.google.com/composer/environments
From this page you can access the Airflow webserver and the DAGs folder

Read https://cloud.google.com/composer/docs/ for more information.

## How to write a new workflow

### DAGs
Each pipeline is defined in a DAG file. (A Directed Acyclical Graph is a graph describing a process that goes step by step forward only with no infinite recursion or "cycles".)
DAG files are technically Python code but use some special keywords and operators to describe data processes. Each pipeline can have a schedule and a SLA (maximum expected run time).

DAG files are made up of Tasks that run Operators and can draw from Connections (via Hooks) and Variables. Definitions @ http://airflow.apache.org/concepts.html

Tutorials http://airflow.apache.org/tutorial.html and https://cloud.google.com/composer/docs/how-to/using/writing-dags and https://cloud.google.com/blog/products/gcp/how-to-aggregate-data-for-bigquery-using-apache-airflow

Tips for designing a workflow: https://en.wikipedia.org/wiki/SOLID

### Header


Set email_on_failure to True to send an email notification when an operator in the DAG fails.
```
default_dag_args = {

# Email whenever an Operator in the DAG fails.
'email': models.Variable.get('email'),
'email_on_failure': True,
'email_on_retry': False,
'retries': 1,
'retry_delay': datetime.timedelta(minutes=5)
}
```

Schedule, start date and SLA

**WARNING: if the start date is in the past, it will try to catch up running jobs for the schedule period (eg. daily) the first time the DAG is loaded **

If the task takes longer than the SLA, an alert email is triggered.

```
with models.DAG(
'ga_quarterly_reporter',
schedule_interval=datetime.timedelta(days=90),
sla=datetime.timedelta(hours=1),
default_args=default_dag_args) as dag:
```

### Variables

Variables are configured via the webserver under Admin -> Variables. A variable can be a string, Python list/array or Python dict.
The second parameter of the .get() function is the default value if the variable isn't found.
You can use variables in the python string formatting functions https://docs.python.org/3/library/string.html#formatexamples
```
from airflow import models

'dataflow_default_options': {
'project': models.Variable.get('GCP_PROJECT','dta-ga-bigquery'),
'tempLocation': 'gs://staging.%s.appspot.com/' % models.Variable.get('GCP_PROJECT','dta-ga-bigquery')
}
```

### Operators

Full listing at http://airflow.apache.org/_api/airflow/operators/index.html and http://airflow.apache.org/_api/airflow/contrib/operators/index.html includes operators for Bash scripts, JIRA, S3, SQL databases etc.

**Our favourite operators:**

- PythonOperator
http://airflow.apache.org/howto/operator/python.html

- BigQueryOperator and BigQueryToCloudStorageOperator

Our environment automatically has a connection to bigquery so no credentials are needed.

http://airflow.apache.org/_api/airflow/contrib/operators/bigquery_operator/index.html

- KubernetesPodOperator
Perfect for running an R script or a Python script that needs system packages like chart/graph rendering.

We run a custom docker image with extra R packages described in docker/Dockerfile

https://airflow.apache.org/_api/airflow/contrib/operators/kubernetes_pod_operator/index.html#airflow.contrib.operators.kubernetes_pod_operator.KubernetesPodOperator

**Honorable Mentions:**

- DataFlowOperator

uses the google cloud branded implementation of Apache Beam, another
- SlackWebHookOperator

- EmailOperator
Gmail seems to take 5 or 6 minutes to virus scan attachments before they appear.


## Dependencies and Deployment

At the end of the file, in the indented "with DAG:" section you can define dependencies between operators (else they will all run concurrently):
```
A >> B >> C

or
A >> B
B >> C

or

A >> B
C << B

```

Once you have a DAG, can drag drop it into the folder via the web browser and soon it will be visible in the webserver. When updating a DAG, there is also a Refresh (recycling icon) button.
You can either trigger the whole DAG or "clear" a task to make that task and all dependent tasks be retried.

Once it is good, check it into Git!
132 changes: 2 additions & 130 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,130 +1,2 @@
# Airflow 101

## Our Environment

Our analytics pipeline runs on open source [Apache Airflow](http://airflow.apache.org/tutorial.html) which is written in Python. This means we can deploy it to other clouds or inhouse if we need.

We have some special configuration:
- using latest beta version composer-1.6.1-airflow-1.10.1
- using Python 3
- google-api-python-client, tablib, python-igraph, plotly pypi packages preinstalled
- slack_default and bigquery_default connection
- special docker image that includes google cloud SDK and R with tidyverse/ggplot2
- sendgrid enabled to allow warning and other messages to be sent.

## Getting Started
To access "Cloud Composer" (the Google branding for Airflow), visit https://console.cloud.google.com/composer/environments
From this page you can access the Airflow webserver and the DAGs folder

Read https://cloud.google.com/composer/docs/ for more information.

## How to write a new workflow

### DAGs
Each pipeline is defined in a DAG file. (A Directed Acyclical Graph is a graph describing a process that goes step by step forward only with no infinite recursion or "cycles".)
DAG files are technically Python code but use some special keywords and operators to describe data processes. Each pipeline can have a schedule and a SLA (maximum expected run time).

DAG files are made up of Tasks that run Operators and can draw from Connections (via Hooks) and Variables. Definitions @ http://airflow.apache.org/concepts.html

Tutorials http://airflow.apache.org/tutorial.html and https://cloud.google.com/composer/docs/how-to/using/writing-dags and https://cloud.google.com/blog/products/gcp/how-to-aggregate-data-for-bigquery-using-apache-airflow

Tips for designing a workflow: https://en.wikipedia.org/wiki/SOLID

### Header


Set email_on_failure to True to send an email notification when an operator in the DAG fails.
```
default_dag_args = {

# Email whenever an Operator in the DAG fails.
'email': models.Variable.get('email'),
'email_on_failure': True,
'email_on_retry': False,
'retries': 1,
'retry_delay': datetime.timedelta(minutes=5)
}
```

Schedule, start date and SLA

**WARNING: if the start date is in the past, it will try to catch up running jobs for the schedule period (eg. daily) the first time the DAG is loaded **

If the task takes longer than the SLA, an alert email is triggered.

```
with models.DAG(
'ga_quarterly_reporter',
schedule_interval=datetime.timedelta(days=90),
sla=datetime.timedelta(hours=1),
default_args=default_dag_args) as dag:
```

### Variables

Variables are configured via the webserver under Admin -> Variables. A variable can be a string, Python list/array or Python dict.
The second parameter of the .get() function is the default value if the variable isn't found.
You can use variables in the python string formatting functions https://docs.python.org/3/library/string.html#formatexamples
```
from airflow import models

'dataflow_default_options': {
'project': models.Variable.get('GCP_PROJECT','dta-ga-bigquery'),
'tempLocation': 'gs://staging.%s.appspot.com/' % models.Variable.get('GCP_PROJECT','dta-ga-bigquery')
}
```

### Operators

Full listing at http://airflow.apache.org/_api/airflow/operators/index.html and http://airflow.apache.org/_api/airflow/contrib/operators/index.html includes operators for Bash scripts, JIRA, S3, SQL databases etc.

**Our favourite operators:**

- PythonOperator
http://airflow.apache.org/howto/operator/python.html

- BigQueryOperator and BigQueryToCloudStorageOperator

Our environment automatically has a connection to bigquery so no credentials are needed.

http://airflow.apache.org/_api/airflow/contrib/operators/bigquery_operator/index.html

- KubernetesPodOperator
Perfect for running an R script or a Python script that needs system packages like chart/graph rendering.

We run a custom docker image with extra R packages described in docker/Dockerfile

https://airflow.apache.org/_api/airflow/contrib/operators/kubernetes_pod_operator/index.html#airflow.contrib.operators.kubernetes_pod_operator.KubernetesPodOperator

**Honorable Mentions:**

- DataFlowOperator

uses the google cloud branded implementation of Apache Beam, another
- SlackWebHookOperator

- EmailOperator
Gmail seems to take 5 or 6 minutes to virus scan attachments before they appear.


## Dependencies and Deployment

At the end of the file, in the indented "with DAG:" section you can define dependencies between operators (else they will all run concurrently):
```
A >> B >> C

or
A >> B
B >> C

or

A >> B
C << B

```

Once you have a DAG, can drag drop it into the folder via the web browser and soon it will be visible in the webserver. When updating a DAG, there is also a Refresh (recycling icon) button.
You can either trigger the whole DAG or "clear" a task to make that task and all dependent tasks be retried.

Once it is good, check it into Git!
# Observatory
To run the HTML version, download augov.gexf from /data on Google Cloud Storage, put it in html/observatory/data and run html/observatory/run.sh
8 changes: 3 additions & 5 deletions dags/ga_benchmark.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,15 +14,12 @@
'start_date': yesterday,
}


with models.DAG(
'ga_benchmark',
schedule_interval=datetime.timedelta(days=1),
default_args=default_dag_args) as dag:
project_id = models.Variable.get('GCP_PROJECT','dta-ga-bigquery')
project_id = models.Variable.get('GCP_PROJECT', 'dta-ga-bigquery')

view_id = '69211100'
timestamp = '20190425'
temp_table = 'benchmark_%s_%s' % (view_id, timestamp)
query = """
CREATE TABLE `{{params.project_id}}.tmp.{{ params.temp_table }}`
Expand Down Expand Up @@ -58,6 +55,7 @@
export_benchmark_to_gcs = bigquery_to_gcs.BigQueryToCloudStorageOperator(
task_id='export_benchmark_to_gcs',
source_project_dataset_table="%s.tmp.%s" % (project_id, temp_table),
destination_cloud_storage_uris=["gs://us-central1-maxious-airflow-64b78389-bucket/data/%s.csv" % (temp_table,)],
destination_cloud_storage_uris=["gs://%s/data/%s.csv" % (
models.Variable.get('AIRFLOW_BUCKET', 'us-east1-dta-airflow-b3415db4-bucket'), temp_table)],
export_format='CSV')
query_benchmark >> export_benchmark_to_gcs
Loading