Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental time series #115

Merged
merged 3 commits into from
Jul 7, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion crontab
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,4 @@
0 10 * * * /bin/bash -l -c 'cd /home/igrigorik/code && ./sync_har.sh chrome' >> /var/log/HA-import-har-chrome.log 2>&1
0 11 * * * /bin/bash -l -c 'cd /home/igrigorik/code && ./sync_har.sh android' >> /var/log/HA-import-har-android.log 2>&1

0 7 15 * * /bin/bash -l -c 'cd /home/igrigorik/code && sql/generate_reports.sh -fth `date -d "-1 month" "+\%Y_\%m_01"` -r "*crux*"' >> /var/log/crux_reruns.log 2>&1
0 7 15 * * /bin/bash -l -c 'cd /home/igrigorik/code && sql/generate_reports.sh -th `date -d "-1 month" "+\%Y_\%m_01"` -r "*crux*"' >> /var/log/crux_reruns.log 2>&1
5 changes: 4 additions & 1 deletion docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,10 @@ Sometimes it's necessary to manually run this process, for example if a new metr
# -h: Whether to generate histograms. Must be accompanied by the date to query.
#
# -f: Whether to force histogram querying and updating even if the data exists.
# Timeseries are always overwritten.
# Timeseries are usually appended to from last date, but this flag forces a complete rerun
#
# -r: Optional pattern match for reports to be run. Use quotes to avoid the shell expanding names
# (e.g. "*crux*")
```

You can omit one of the `-t` or -h` flags to focus only on histogram or timeseries generation. The `-f` flag ensures that histogram data gets overwritten. Omit this flag to skip queries for dates that already exist (much faster for batch jobs, see below).
Expand Down
80 changes: 68 additions & 12 deletions sql/generate_reports.sh
Original file line number Diff line number Diff line change
Expand Up @@ -146,34 +146,84 @@ else
# Extract the metric name from the file path.
metric=$(echo $(basename $query) | cut -d"." -f1)

date_join=""
max_date=""
current_contents=""
gs_url="gs://httparchive/reports/$gs_lens_dir${metric}.json"
gsutil ls $gs_url &> /dev/null
if [ $? -eq 0 ] && [ $FORCE -eq 0 ]; then
# The file already exists, so skip the query.
echo -e "Skipping $metric timeseries"
continue
if [ $? -eq 0 ]; then
# The file already exists, so check max date
if [ $FORCE -eq 0 ]; then
current_contents=$(gsutil cat $gs_url)
max_date=$(echo $current_contents | jq -r '[ .[] | .date ] | max')

if [[ "${max_date}" == "${YYYY_MM_DD}" || "${max_date}" > "${YYYY_MM_DD}" ]]; then
echo -e "Skipping $metric timeseries"
continue
elif [[ $metric != crux* ]]; then # CrUX is quick and join is more compilicated so just do a full run of that
date_join="SUBSTR(_TABLE_SUFFIX, 0, 10) > \"$max_date\""
if [[ -n "$YYYY_MM_DD" ]]; then
date_join="${date_join} AND SUBSTR(_TABLE_SUFFIX, 0, 10) <= \"$YYYY_MM_DD\""
fi
fi
fi
fi

echo -e "Generating $metric timeseries"
if [[ -n "${date_join}" && -n "${max_date}" ]]; then
echo -e "Generating $metric timeseries in incremental mode from ${max_date} to ${YYYY_MM_DD}"
else
echo -e "Generating $metric timeseries from start"
fi

# Run the query on BigQuery.
START_TIME=$SECONDS
if [[ $LENS != "" ]]; then

if [[ $(grep "httparchive.blink_features.usage" $query) ]]; then
echo "blink_features.usage queries do not support lens's so skipping lens"
continue
fi

lens_join="JOIN ($(cat sql/lens/$LENS/timeseries.sql | tr '\n' ' ')) USING (url, _TABLE_SUFFIX)"
if [[ $metric == crux* ]]; then
echo "CrUX query so using alternative lens join"
lens_join="JOIN ($(cat sql/lens/$LENS/timeseries.sql | tr '\n' ' ')) ON (origin || '\/' = url AND REGEXP_REPLACE(CAST(yyyymm AS STRING), '(\\\\\\\\d{4})(\\\\\\\\d{2})', '\\\\\\\\1_\\\\\\\\2_01') || '_' || IF(device = 'phone', 'mobile', device) = _TABLE_SUFFIX)"
fi
if [[ $(grep "httparchive.blink_features.usage" $query) ]]; then
echo "blink_features.usage queries do not support lens's so skipping lens"
continue

if [[ -n "${date_join}" ]]; then
if [[ $(grep -i "WHERE" $query) ]]; then
# If WHERE clause already exists then add to it, before GROUP BY
result=$(sed -e "s/\(GROUP BY\)/AND $date_join \1/" $query \
| sed -e "s/\(\`[^\`]*\`)*\)/\1 $lens_join/" \
| $BQ_CMD)
else
# If WHERE clause doesn't exists then add it, before GROUP BY
result=$(sed -e "s/\(GROUP BY\)/WHERE $date_join \1/" $query \
| sed -e "s/\(\`[^\`]*\`)*\)/\1 $lens_join/" \
| $BQ_CMD)
fi
else
result=$(sed -e "s/\(\`[^\`]*\`)*\)/\1 $lens_join/" $query \
| $BQ_CMD)
fi
result=$(sed -e "s/\(\`[^\`]*\`)*\)/\1 $lens_join/" $query \
| $BQ_CMD)

else
result=$(cat $query \
| $BQ_CMD)
# blink_features do not support date_join so do full run for them
if [[ -z "${date_join}" || $(grep "httparchive.blink_features.usage" $query) ]]; then
date_join=""
result=$(cat $query \
| $BQ_CMD)
elif [[ $(grep -i "WHERE" $query) ]]; then
# If WHERE clause already exists then add to it, before GROUP BY
result=$(sed -e "s/\(GROUP BY\)/AND $date_join \1/" $query \
| $BQ_CMD)
else
# If WHERE clause doesn't exists then add it, before GROUP BY
result=$(sed -e "s/\(GROUP BY\)/WHERE $date_join \1/" $query \
| $BQ_CMD)
fi
fi

# Make sure the query succeeded.
if [ $? -eq 0 ]; then
ELAPSED_TIME=$(($SECONDS - $START_TIME))
Expand All @@ -182,6 +232,12 @@ else
else
echo "$metric took $ELAPSED_TIME seconds"
fi

# If it's a partial run, then combine with the current results.
if [[ -n "${date_join}" ]]; then
result=$(echo ${result} ${current_contents} | jq '.+= input')
fi

# Upload the response to Google Storage.
echo $result \
| gsutil -h "Content-Type:application/json" cp - $gs_url
Expand Down
4 changes: 2 additions & 2 deletions sync_csv.sh
Original file line number Diff line number Diff line change
Expand Up @@ -135,8 +135,8 @@ cd $HOME/code

gsutil -q stat gs://httparchive/reports/$table/*
if [ $? -eq 1 ]; then
. sql/generate_reports.sh -fth $table
ls -1 sql/lens | xargs -I lens sql/generate_reports.sh -fth $table -l lens
. sql/generate_reports.sh -th $table
ls -1 sql/lens | xargs -I lens sql/generate_reports.sh -th $table -l lens
else
echo -e "Reports for ${table} already exist, skipping."
fi
Expand Down
4 changes: 2 additions & 2 deletions sync_har.sh
Original file line number Diff line number Diff line change
Expand Up @@ -87,8 +87,8 @@ cd $HOME/code

gsutil -q stat gs://httparchive/reports/$table/*
if [ $? -eq 1 ]; then
. sql/generate_reports.sh -fth $table
ls -1 sql/lens | xargs -I lens sql/generate_reports.sh -fth $table -l lens
. sql/generate_reports.sh -th $table
ls -1 sql/lens | xargs -I lens sql/generate_reports.sh -th $table -l lens
else
echo -e "Reports for ${table} already exist, skipping."
fi
Expand Down