Skip to content

Latest commit

 

History

History
1059 lines (832 loc) · 36.1 KB

changelog.md

File metadata and controls

1059 lines (832 loc) · 36.1 KB

v0.2.55

  • Refactoring: Extract RawRefineDataReader from RefineHelper used in refine to hive
  • Refactoring: Extract SparkEventSchemaLoader from RefineHelper used in refine to hive
  • Refine: Add an option to ignore missing input folders

v0.2.54

  • Fix HdfsXMLFsImageConverter block reading
  • Fix mediawiki-jdbc that causes slower pull
  • build: add sdkman configuration

v0.2.53

  • Fix on TransformFunction deterministic behavior

v0.2.52

  • Update the smtp server settings for email from refine

v0.2.51

  • Add a new mediawiki-jdbc spark datasource to refinery-spark.

v0.2.50

  • Refactor Refine to make it work atomically from Airflow

v0.2.49

  • Make deduplication TransformFunction deterministic.

v0.2.48

  • Add Special:AllEvents to the PageviewDefinition

v0.2.47

  • No apparent updates?

v0.2.46

  • Enable pivoting with varied casing in DataPivoter

v0.2.45

  • refinery-job: add webrequest instrumentation

v0.2.44

  • Update eventutilities version to 1.3.6

v0.2.43

  • Add unit tests for Refinery*DatabaseResponse
  • Fix MediawikiHistory Checker Null Exceptions
  • Update column definition for uniqueness check.

v0.2.42

  • Fix NPE in creation of RefineryISPDatabaseResponse

v0.2.41

  • Update clickstream job - better joins
  • chore: remove leftover from refinery-cassandra
  • style(maxmind): fix checkstyle violations for MaxMind package.
  • fix(*DatabaseReader): avoid null pointer exception when reading MaxMind
  • Include subdivision ISO code in the geo response
  • Mediawkihistory: typesafe access to compliance value.
  • Update MediawikiXMLDumpsConverter
  • Refine DeequColumnAnalysis code

v0.2.40

  • Upgrade MediawikiHistory Checker to use AWS Deequ

v0.2.39

  • Update the ProduceCanaryEvents job

v0.2.38

v0.2.37

  • Correctly apply distanceToPrimary in CommonsCategoryGraphBuilder
  • Move version configuration of dependencies to main pom
  • Sort the dependencyManagement section according to sortPom
  • Remove duplication from parent pom
  • Start using wmf-jvm-parent-pom
  • Sort pom.xml according to standard sortpom order

v0.2.36

  • Add CommonsCategoryGraphBuilder for Commons Impact Metrics

v0.2.35

  • Extract RefineDataset

v0.2.34

v0.2.33

  • Update ProduceCanaryEvents job
  • Add DataPivoter job

v0.2.32

  • data-quality: rename source table column
  • Cleanup dependencies in refinery-tools module.
  • Cleanup of ISPDatabaseReader.

v0.2.31

  • IcebergWriter: don't create missing tables if absent
  • Simplify GeocodeDatabaseReader.
  • Simplifies CountryDatabaseReader.

v0.2.30

v0.2.29

v0.2.28

  • Switch to jdk17 for sonar.
  • refinery-job: log data quality alert severity level.
  • refinery-job: add WebrequestMetrics AWS Deequ data quality job.
  • refinery-spark: add APIs to export AWS Deequ analysis and verification suite results to a Wikimedia metrics and alerting data model.

v0.2.27

  • Fix recursion for Maps with Structs on SanitizeTransformation

v0.2.26

  • Bump eventutilities version to 1.3.2
  • ProduceCanaryEvents now will retry failed HTTP POSTs to event intake services.

v0.2.25

v0.2.24

  • Update project namespace map view
  • Improve fidelity of dumps import
  • Add siteinfo information to output XML
  • Create a job to dump XML/SQL MW history files to HDFS

v0.2.23

  • Remove unused cassandra module

v0.2.22

  • Make refine SchemaLoader main function thread safe
  • Remove special KaiOS App checks from pageview def

v0.2.21

v0.2.20

  • Add special=ViewObject to allowed pageviews

v0.2.19

  • Turn on pageviews for wikifunctions

v0.2.18

  • JsonSchemaConverter - log full JSONSchema when converting to Spark fails
  • Remove deprecated code for AppSessionMetrics

v0.2.17

  • Add explicit snapshot to HiveToDruid

v0.2.16

  • Use eventutilites v1.2.9 shaded jar to fix conflicts with guava included in hadoop classpath and that present in the refinery-job fat jar.
  • Update refine job to spark3 by fixing Factory already defined issue

v0.2.15

v0.2.14

  • Fix HiveToDruid to allow for non-partitioned source tables

v0.2.13

  • Replace Guava with Caffeine
  • Prepare refine_webrequest UDFs for Spark (multi-thread environment)

v0.2.12

  • Update MediasitesDefinition to remove github.io and thereby update GetRefererDataUDF

v0.2.11

  • Support snapshot partitioning in HiveToDruid and DataFrameToDruid
  • Refactor and Expand External referer classification

v0.2.10

  • Put wikihadoop into refinery/source
  • Add HdfsXMLFsImageConverter to refinery-job

v0.2.9

  • Update mediawiki-history page and user computation
  • Add Custom Authentication Configuration Class for Cassandra

v0.2.8

  • Fix mediawiki-history-denormalize for spark 3.
  • Add unit test for MediaWikiEvent.
  • Fix empty path bug of MediawikiHistoryDumper.

v0.2.7

  • Remove spark-cassandra-connector dependency in refinery-job

v0.2.6

  • Performance fixes for Array UDFs

v0.2.5

  • Bump eventutilities to 1.2.0 and remove duplicate dependency
  • Repurpose refinery-tools to contain code reused across other modules
  • Update search engine detection
  • Add ArrayAvgUDF

v0.2.4

  • Update UA-Parser to 1.5.3

v0.2.3

  • Spark JsonSchemaConverter - log when schema does not contain type field
  • Fix HDFSArchiver doneFilePath parameter

v0.2.2

  • UDF for testing uri_query for duplicate query parameters
  • Suppress useless GeocodeDatabaseReader log warn messages

v0.2.1 [post-release update]

  • Update SparkSQLNoCLIDriver to error correctly

v0.2.0

  • Update to spark-3 and scala-2.12
  • Fix returned error code in HDFSArchiver
  • Fix typo in MediaFileUrlParser

v0.1.27

  • Make caching mechanisms thread ready
  • Migrate wikidata/item_page_link/weekly

v0.1.26

  • Create a Hive to Graphite job
  • Add archiving job for Airflow

v0.1.25

  • Integrate SparkSQLNoCLIDriver and HiveToCassandra

v0.1.24

  • Update refine netflow_augment transform function
  • Update structured_data dumps parsing job

v0.1.23

  • Add SparkSQLNoCLIDriver job

v0.1.22

  • Simplify RSVD anomaly detection job for Airflow POC
  • Save commons json dumps as a table and add fields for wikidata

v0.1.21

  • Refine - don't remove records during deduplication if ids are null

v0.1.20

  • Fix bug in HDFSCleaner where directories with only directories would always be deleted.
  • Spark JsonSchemaConverter now treats an object additionalProperties as a MapType always.

v0.1.19

  • Remove /wmf/gobblin from HDFSCleaner disallow list

v0.1.18

  • Add num-partitions parameter to mediawiki-history checkers
  • Standard artifacts are no longer shaded. shaded versions are suffixed with -shaded. Production deployments that reference the shaded jars will have to be updated.

v0.1.17

  • Load cassandra3 from spark

v0.1.16

  • Remove refinery-camus module - T271232
  • Refine - replace default formatters with gobblin convention
  • Refine - default event transform functions now add normalized_host info

v0.1.15

  • Refine - explicitly uncache DataFrame when done
  • Fix UAParser initialization to re-use static CachingParser instance and synchronize its usage

v0.1.14

  • RefineTarget - support gzipped json input format

v0.1.13

  • ProduceCanaryEvents - fix exit val

v0.1.12

  • ProduceCanaryEvents - produce events one at a time for better error handling

v0.1.11

  • Add scala job for reliability metrics of Wikidata

v0.1.10

  • Fix com.criteo:rsvd dependency issue
  • Update refinery-cassandra to cassandra 3.11
  • Report on test coverage
  • Ensure that maven site generation works.

v0.1.9

  • Revert addition of maven doc site generation (somehow this is causing release to fail).

v0.1.8

  • Fix bug in RefineSanitizeMonitor when using keep_all_enabled

v0.1.7

  • Bump to eventutilities 1.0.6
  • (Six related commits on style checking and linting)

v0.1.6

  • SanitizeTransformation - Just some simple logging improvements

v0.1.5

  • Fix bug in Refine where table regexes were not matching properly
  • Factor our HiveExtensions.normalizeName to HivePartition.normalize.
  • ProduceCanaryEvents - include httpRequest body in failure message

v0.1.4

  • Include RefineFailuresChecker functionality into RefineMonitor and fix bug in Refine.Config
  • Switch to eventutilities 1.0.5

v0.1.3

  • Improve Refine failure report email
  • Add support for finding RefineTarget inputs from Hive
  • Refactor EventLoggingSanitization to a generic job: RefineSanitize
  • Rename whitelist to allowlist for Refine sanitization
  • Update WMF domain list with Cloud and toolforge
  • Fix failing sonar analysis due to JDK11 removing tools.jar

v0.1.2

  • Update UA-Parser to 1.5.2
  • Minimal configuration of Sonar maven plugin
  • Standardize CI builds on Maven Wrapper
  • Make null result same shape as normal result
  • Fix wikitext history job

v0.1.1

  • Update hadoop and hive dependencies versions (BigTop upgrade)

v0.1.0

  • Exclude debug requests from pageviews

v0.0.146

  • Make HiveToDruid return exit code when deployMode=client

v0.0.145

  • Fix DataFrameToHive repartition-to-empty failure

v0.0.144

  • Fix DataFrameExtension.convertToSchema repartition
  • Change DataFrameToDruid base temporary path
  • refinery-core: iputils: refresh cloud addresses

v0.0.143

  • Update junit and netty versions for github security alert
  • Refine - Add TransformFunction is_wmf_domain
  • Refine - Add TransformFunction to remove canary events
  • Refine - use PERMISSIVE mode and log more info about corrupt records

v0.0.142

  • Upgrade maven configuration and plugins
  • Move pageview filters to PageviewDefinition; add Webrequest.isWMFHostname

v0.0.141

  • Update pageview title extraction for trailing EOL
  • Expand EZ project conversion to adapt to raw format

v0.0.140

  • Add datasource argument to HiveToDruid

v0.0.139

  • Add caching to maxmind readers in core package
  • Add Refine transform function for Netflow data set

v0.0.138

  • Fix maxmind UDFs for hive 2.3.3 (bigtop)
  • Update MediawikiXMLDumpsConverter repartitioning

v0.0.137

  • Use camus + EventStreamConfig integration in CamusPartitionChecker
  • Remove lat/long/postal code from geocoding

v0.0.136

  • Chopping timeseries for noise detection

v0.0.135

  • Add ProduceCanaryEvents job
  • Add dependency on wikimedia event-utilities and use schema loader classes from it

v0.0.134

  • Mediawiki History Dumps ordering fix

v0.0.133

  • Refine - Add legacy useragent column if field exists in event schema or in Hive
  • Pageview definition - Exclude requests with app user agents from web pageviews

v0.0.132

  • Refine - Quote SQL columns used in selectExpr in TransformFunctions
  • Remove outdated IOS pageview code
  • For Android and iOS we only count pageviews with x-Analytics marker

v0.0.131

  • Refine - Don't merge Hive schema by default when reading input data
  • Overloaded methods to make working with Refine easier
  • Remove unused custom avro camus classes
  • Fix mediawiki-history skewed join bug
  • Remove sysop domains from accepted pageviews

v0.0.130

  • Rename pageview_actor_hourly to pageview_actor in clickstream job

v0.0.129

  • Make mediawiki_history skewed join deterministic
  • Remove filter_allowed_domains from common event_transforms
  • Label mobile-html endpoint requests as app pageviews

v0.0.128

  • Add UDF that transforms Pagecounts-EZ projects into standard
  • Correct bug in webrequest host normalization
  • Add a corrected bzip2 codec for spark
  • Update clickstream to read from pageview_actor_houly instead of webrequest
  • Make ActorSignatureGenerator non-singleton
  • Add special explode UDTF that turns EZ-style hourly strings into rows

v0.0.127

  • Refine geocode_ip transform sets legacy EventLogging IP field

v0.0.126

  • Sort mediawiki history dumps by timestamp
  • DataFrameToHive - drop partition before writing output data
  • Make event transform functions smarter about choosing which possible column to use
  • RefineTarget - fix off by one bug in hoursInBetween used to find RefineTargets
  • Refactor JsonSchemaLoader and add JsonLoader
  • Make anomaly detection correctly handle holes in time-series
  • Add EvolveHiveTable tool

v0.0.125

  • Use page move events to improve joining to wikidata entity

v0.0.124

  • Fix snakeyaml upgrade issue in EL sanitization

v0.0.123

  • Fix RSVDAnomalyDetection using parameters for data-length validation
  • Unify Refine transform functions and add user agent parser transform
  • RefineTarget.shouldRefine now considers both table whitelist and blacklist

v0.0.122

  • Update hive geocoded-data udf
  • Allow pageview titles that include Unicode character values above 0xFFFF like emoji
  • Make RSVDAnomalyDetection ignore too short timeseries
  • Add check for corrupted (empty) flag files
  • Add MeetingRoomApp to the bot regex

v0.0.121

  • Add ActorSignatureGenerator and GetActorSignatureUDF
  • Add documentation to maven developerConnection parameter
  • Add RefineFailuresChecker in refinery-spark and fix documentation
  • Support multiple possible schema base URIs in EventSchemaLoader

v0.0.120

  • Add maven developerConnection parameter to allow CLI override

v0.0.119

  • Count pageviews to wikimania.wikimedia.org
  • Detect pageviews as requested by KaiOS

v0.0.118

  • Fix wikidata article-placeholder job

v0.0.117

  • Move wikidata jobs in the wikidata package
  • Fix WikidataArticlePlaceholderMetrics
  • Add wikidata item_page_link spark job

v0.0.116

  • Revert GetGeoDataUDF Fix from 114, hotfix

v0.0.115

  • Fix webrequest host normalization
  • Refine - Warn when merging incompatible types; FAILFAST when reading JSON data with a schema

v0.0.114

  • Fix GetGeoDataUDF and underlying function
  • Remove BannerImpressions streaming job and deps
  • Add spark code for wikidata json dumps parsing

v0.0.113

  • Change format of data_quality_stats to parquet
  • Update mediawiki-history dumper
  • Enforce distinct revision in xml-dumps converter

v0.0.112

  • Add Spark/Scala module for time series anomaly detection

v0.0.111

  • Modify external webrequest search engine classification

v0.0.110

  • Correct MW XML dumps converter parameter parsing
  • Fix WikidataArticlePlaceholderMetrics query

v0.0.109

  • Document JDK version requirement
  • Add Spark job to update data quality table with incoming data

v0.0.108

  • Fix user agent for WDQS updater counter

v0.0.107

  • Update UA parser to add kaiOS
  • Add query to track WDQS updater hitting Special:EntityData
  • MAke HDFSCleaner robust to external file deletions

v0.0.106

  • HDFSCleaner Improvements

v0.0.105

  • Upgrade Spark to 2.4.4
  • Update HDFSCleaner logging

v0.0.104

  • Add HDFSCleaner to aid in cleaning HDFS tmp directories

v0.0.103

  • Update mediawiki-history-dumper (file names and future date events)

v0.0.102

  • Fix refine wikipedia.org eventlogging data

v0.0.101

  • Update subnet lists for IpUtil

v0.0.100

  • Update ua-parser dependency and related functions and tests
  • Add mediawiki-history-dumper spark job

v0.0.99

  • Third party data should not get refined, fixing typo

v0.0.98

  • media info UDF now provide literal transcoding field

v0.0.97

  • Now refine infers hiveServerUrl from config, no --hive_server_url necessary.

v0.0.96

  • Making RefineMonitor error message more clear
  • Adding UDF to get wiki project from referrer string, not used
  • Add new mediatypes to media classification refinery code - T225911

v0.0.95

Version skipped due to deployment problems

v0.0.94

  • Pageview Definition. Most special pages should not be counted - T226730
  • EventSchemaLoader uses JsonParser for event data rather than YAMLParser - T227484
  • EventSparkSchemaLoader now merges input JSONSchema with Hive schema before loading - T227088
  • Added whitelist to eventlogging filtering of webhost domains so data from google translate apps is accepted - T227150

v0.0.93

  • Refactor mediawiki-page-history computation + fix
  • Mediawiki-history: Handle dropping of user fields in labs views
  • Update mediawiki_history checker to historical values
  • Update pageview definition to exclude non wiki sites
  • Add entropy UDAF to refinery-hive

v0.0.92

  • Fix wrongly getting the yarn user name in DataFrameToHive
  • Fix transform function for NULL values and for dataframes without the webHost column

v0.0.91

  • Update CirrusRequestDeser.java to use new schema of mediawiki/cirrussearch/request event
  • Add refine transform function to filter our non-wiki hostnames
  • Allow for plus signs in the article titles in the PageviewDefinition
  • Reduce the size limit of user agent strings in the UAParser

v0.0.90

  • Fix javax.mail dependency conflict introduced by including json-schema-validator
  • Improve CamusPartitionChecker error output

v0.0.89

  • Fix wikidata-coeditor job after MWH-refactor
  • ClickstreamBuilder: Decode refferer url to utf-8
  • Fix EventLoggingSchemaLoader to properly set useragent is_bot and is_mediawiki fields as booleans
  • Fix EventLoggingSchemaLoader to not include depcrecated timestamp in capsule schema
  • RefineTarget - allow missing required fields when reading textual (e.g. JSON) data using JSONSchemas.
  • Filter out 15.wikipedia.org and query.wikidata.org from pageview definition

v0.0.88

  • Fix mediawiki_page_history userId and anonymous
  • Fix mediawiki_history_reduced checker
  • Fix mediawiki-history user event join

v0.0.87

  • Add EventSparkSchemaLoader support to Refine
  • Add jsonschema loader and spark converter classes
  • Adapt EventLogging/WhiteListSanitization to new way of storing
  • Add change_tags and revision_deleted_parts to mediawiki history
  • Fix EventLogging schema URI to include format=json
  • Reject invalid page titles from pageview dumps
  • Correct names in mediawiki-history sql package
  • Update mw user-history timestamps
  • Fix mediawiki-history-checker after field renamed
  • Fix null-timestamps in checker
  • Fix mediawiki-user-history writing filter
  • Update mediawiki-history user bot fields

v0.0.86

-- skipped due to deployment complications https://phabricator.wikimedia.org/T221466 --

v0.0.85

v0.0.84

  • Update mediawiki-history comment and actor joins
  • Update mediawiki-history joining to new actor and comment tables

v0.0.83

  • Add --ignore_done_flag option to Refine
  • Add wikitech to pageview definition
  • HiveExtensions field name normalize now replaces bad SQL characters with "_", not just hyphens.
  • Add new Cloud VPS ip addresses to network origin UDF
  • Correct typo in refinery-core for Maxmind, getNetworkOrigin and IpUtil
  • Allow for custom transforms in DataFrameToDruid

v0.0.82

v0.0.81

  • Use "SORT BY" instead of "ORDER BY" in mediawiki_history_checker job
  • Correctly pass input_path_regex to Refine from EventLoggingSanitization
  • HiveExtensions schema merge now better support schema changes of complex Array element and Map value types. https://phabricator.wikimedia.org/T210465
  • HiveExtensions findIncompatibleFields was unused and is removed.
  • Upgrade profig lib to 2.3.3 after bug fix upstream
  • Upgrade spark-avro to 4.0.0 to match new spark versions

v0.0.80

  • Update DataFrameToHive and PartitionedDataFrame to support dynamic partitioning and correct some bugs
  • Add WebrequestSubsetPartitioner spark job actually launching a job partitioning webrequest using DataFrameToHive and a transform function

v0.0.79

  • Upgrade camus-wmf dependency to camus-wmf9
  • Fix bug in EventLoggingToDruid, add time measures as dimensions

v0.0.78

  • Rename start_date and end_date to since until in EventLoggingToDruid.scala

v0.0.77

  • Add spark job converting mediawiki XML-dumps to parquet
  • Default value of hive_server_url updated in Refine.scala job
  • Refactor EventLoggingToDruid to use whitelists and ConfigHelper

v0.0.76

  • Refine Config removes some potential dangerous defaults, forcing users to set them
  • EventLoggingToDruid now can bucket time measures into ingestable dimensions

v0.0.75

  • Refine and EventloggingSanitization jobs now use ConfigHelper instead of scopt

v0.0.74

  • Add --table-whitelist flag to EventLoggingSanitization job
  • Add ConfigHelper to assist in configuring scala jobs with properties files and CLI overrides
  • RefineMonitor now uses ConfigHelper instead of scopt

v0.0.73

  • Add usability, advisory and strategy wikimedia sites to pageview definition

v0.0.72

  • Correct MediawikiHistoryChecker for reduced

v0.0.71

  • Update MediawikiHistoryChecker adding reduced
  • Add MediawikiHistoryChecker spark job
  • Update mediawiki-user-history empty-registration Drop user-events for users having no registration date (i.e. no edit activity nor registration date in DB)
  • Correct mediawiki-history user registration date Use MIN(DB-registration-date, first-edit-date) instead of COALESCE(DB-registration-date, first-edit-date)

v0.0.70

  • Fix for WhitelistSanitization.scala, allowing null values for struct fields

v0.0.69

  • Fix for CamusPartitionChecker to only send email if errors are encountered

v0.0.68

  • Fix case insensibility for MapMaskNodes in WhitelistSanitization
  • Add ability to salt and hash to eventlogging sanitization
  • Add --hive-server-url flag to Refine job
  • CamusPartitionChecker can send error email reports and override Camus properties from System properties.

v0.0.67

  • Add foundation.wikimedia to pageviews
  • Track number of editors from Wikipedia who also edit on Wikidata over time
  • Update user-history job from username to userText
  • Add inline comments to WhitelistSanitization

v0.0.66

  • Add a length limit to webrequest user-agent parsing
  • Allow partial whitelisting of map fields in Whitelist sanitization

v0.0.65

  • Update mediawiki-history statistics for better names and more consistent probing
  • Fix RefineTarget.inferInputFormat filtering out fiels starting with _

v0.0.64

  • Update regular expressions used to parse User Agent Strings
  • Add PartitionedDataFrame to Spark refine job
  • Fix bug when merging partition fields in WhitelistSanitization.scala
  • Update pageview regex to accept more characters (previously restricted to 2)

v0.0.63

  • Make mediawiki-history statistics generation optional
  • Modify output defaults for EventLoggingSanitization
  • Correct default EL whitelist path in EventLoggingSanitization
  • Correct mediawiki-history job bugs and add unittest
  • Add defaults section to WhitelistSanitization
  • Identify new search engines and refactor Referer parsing code

v0.0.62

  • Fix MediawikiHistory OOM issue in driver
  • Update MediawikiHistory for another performance optimization
  • Rename SparkSQLHiveExtensions to just HiveExtensions
  • Include applicationId in Refine email failure report
  • DataFrameToHive - Use df.take(1).isEmpty rather than exception
  • RefineTarget - Use Hadoop FS to infer input format rather than Spark
  • DataFrameToHive - Use DataFrame .write.parquet instead of .insertInto
  • Correct wikidata-articleplaceholder job SQL RLIKE expression

v0.0.61

  • Fix sys.exit bug in Refine
  • Fix LZ4 version bug with maven exclusion in refinery-spark and refinery-job

v0.0.60

  • Big refactor of scala and spark code ** add refinery-spark module for spark oriented libs ** Move non-spark dependent code to refinery-core
  • Tweak Mediawiki-history job for performance (mostly partitioning)
  • Update Mediawiki-history job to use accumulator to gather stats
  • Add Hive JDBC connection to Refine for it to work with Spark 2
  • Update spark code to use Spark 2.3.0
  • Add new wikidata and pageview tags to webrequest
  • Update Refine to use SQL-casting instead of row-conversion

v0.0.59

  • JsonRefine has been made data source agnostic, and now lives in a refine module in refinery-job. The Spark job is now just called 'Refine'.
  • Add Whitelist Sanitization code and an EventLogging specific job using it
  • Add some handling to Refine for cast-able types, e.g. String -> Long, if possible.
  • Added RefineMonitor job to alert if Refine targets are not present.

v0.0.58

  • Refactor geo-coding function and add ISP
  • Update camus part checker topic name normalization
  • Update RefineTarget inputBasePath matches
  • Add GetMediawikiTimestampUDF to refinery-hive
  • Factor out RefineTarget from JsonRefine for use with other jobs
  • Add configurable transform function to JSONRefine
  • Fix JsonRefine so that it respects --until flag
  • Clean refinery-job from BannerImpressionStream job
  • Add core class and job to import EL hive tables to Druid

v0.0.57

  • Add new package refinery-job-spark-2.1
  • Add spark-streaming job for banner-activity

v0.0.56

  • JsonRefine improvements: ** Use _REFINE_FAILED flag to indicate previous failure, so we don't re-refine the same bad data over and over again ** Don't fail the entire partition refinement if Spark can resolve (and possibly throw out) records with non-critical type changes. I.e. don't throw the entire hour away if just a couple records have floats instead of ints. See: https://phabricator.wikimedia.org/T182000

v0.0.55

  • something/something_latest fields change to something_historical/something
  • UDF for extracting primary full-text search request
  • Fix Clickstream job
  • Change Cassandra loader to local quorum write
  • Add Mediawiki API to RestbaseMetrics
  • Fix mediawiki history reconstruction

v0.0.54

  • refinery-core now builds scala.
  • Add JsonRefine job

v0.0.53

  • Correct field names in mediawiki-history spark job (time since previous revision)
  • Add PhantomJS to the bot_flagging regex
  • Correct mobile-apps-sessions spark job (filter out ts is null)

v0.0.52

  • Add Clickstream builder spark job to refinery-job
  • Move GraphiteClient from refinery-core to refinery-job

v0.0.51

  • Correct bug in host normalization function - make new field be at the end of the struct

v0.0.50

  • Update host normalization function to return project_family in addition to project_class (with same value) in preparation to remove (at some point) the project_class field.

v0.0.49

v0.0.48

  • Update mediawiki_history job with JDBC compliant timestamps and per-user and per-page new fields (revision-count and time-from-previous-revision)
  • Removed unused and deprecated ClientIpUDF. See also https://phabricator.wikimedia.org/T118557
  • Mark Legacy Pageview code as deprecated.

v0.0.47

  • Update tests and their dependencies to make them work on Mac and for any user.

v0.0.46

  • Add small cache to avoid repeating normalization in Webrequest.normalizeHost
  • Refactor PageviewDefinition to add RedirectToPageviewUDF
  • Add support for both cs and cz as Czech Wiki Abbreviations to StemmerUDF

v0.0.45

  • Remove is_productive and update time to revert from MediaWiki history denormalizer
  • Add revision_seconds_to_identity_revert to MediaWiki history denormalizer
  • Use hive query instead of parsing non existent sampled TSV files for guard settings

v0.0.44

  • Update mediawiki history jobs to overwrite result folders

v0.0.43

  • Add mediawiki history spark jobs to refinery-job
  • Add spark job to aggregate historical projectviews
  • Do not filter test[2].wikipedia.org from pageviews

v0.0.42

  • Upgrade hadoop, hive and spark version after CDH upgrade. Hadoop and hive just have very minor upgrades, spark has a more import one (from 1.5.0 to 1.6.0.)
  • Change the three spark jobs in refinery-job to have them working with the new installation (this new installation has a bug preventing using HiveContext in oozie).

v0.0.41

v0.0.40

v0.0.39

v0.0.38

v0.0.37

v0.0.36

v0.0.35

  • Classify DuckDuckGo as a search engine
  • Make camus paritition checker continue checking other topics if it encounters errors

v0.0.34

  • Update maven jar building in refinery (refinery-core is not uber anymore)
  • Create WikidataSpecialEntityDataMetrics
  • Fix WikidataArticlePlaceholderMetrics class doc

v0.0.33

  • Correct WikidataArticlePlaceholderMetrics

v0.0.32

  • Add WikidataArticlePlaceholderMetrics

v0.0.31

  • Fixes Prefix API request detection
  • Refactor pageview definition for mobile apps
  • Remove IsAppPageview UDF

v0.0.30

  • Add pageview definition special case for iOs App
  • Correct CqlRecordWriter in cassandra module
  • Evaluate Pageview tagging only for apps requests

v0.0.29 [SKIPPED]

v0.0.28

  • Update mediawiki/event-schemas submodule to include information about search results in CirrusSearchRequestSet
  • Drop support for message without rev id in avro decoders and make latestRev mandatory
  • Upgrade to latest UA-Parser version
  • Update mediawiki/event-schemas submodule to include 3dd6ee3 "Rename ApiRequest to ApiAction".
  • Google Search Engine referer detection bug fix
  • Upgrade camus-wmf dependency to camus-wmf7
  • Requests that come tagged with pageview=1 in x-analytics header are considered pageviews

v0.0.27

  • Upgrade CDH dependencies to 5.5.2
  • Implement the Wikimedia User Agent policy in setting agent type.
  • Remove WikimediaBot tagging.
  • Add ApiAction avro schema.
  • Add functions for categorizing search queries.
  • Update CamusPartitionChecker not to hard-failing on errors.
  • Add CamusPartitionChecker the possibility to rewind to last N runs instead of just one.
  • Update AppSession Metrics with explicit typing and sorting improvement
  • Ensure that the schema_repo git submodules are available before packaging

v0.0.26

  • REALLY remove mobile partition use. This was reverted and never deployed in 0.0.25
  • Add split-by-os argument to AppSessionMetrics job

v0.0.25

  • Change/remove mobile partition use
  • Add Functions for identifying search engines as referers
  • Update avro schemas to use event-schema repo as submodule

v0.0.24

  • Implement ArraySum UDF
  • Clean refinery-camus from unnecessary avro files
  • Add UDF that turns a country code into a name

v0.0.23

  • Expand the prohibited uri_paths in the Pageview definition
  • Make maven include avro schema in refinery-camus jar
  • Add a Hive UDF for network origin updating existing IP code
  • Correct CamusPartitionChecker unit test
  • Add refinery-cassandra module, containing the necessary code for loading separated value data from hadoop to cassandra
  • Update refinery-camus adding support for avro messages
  • Add CirrusSearchRequestSet avro schema to refinery camus
  • Update webrequest with an LRUCache to prevent recomputing agentType for recurrent user agents values

v0.0.22

  • Correct CamusPartitionChecker bug.
  • Expand the prohibited URI paths in the pageview definition.

v0.0.21

  • Update CirrusSearchRequestSet avro schema.
  • Update CamusPartitionChecker with more parameters.

v0.0.20

  • Add refinery-camus module
  • Add Camus decoders and schema registry to import Mediawiki Avro Binary data into Hadoop
  • Add camus helper functions and job that reads camus offset files to check if an import is finished or not.

v0.0.19

  • Update regexp filtering bots and rename Webrequest.isCrawler to Webrequest.isSpider for consitency.
  • Update ua-parser dependency version to a more recent one.
  • Update PageviewDefinition so that if x-analytics header includes tag preview the request should not be counted as pageview.

v0.0.18

  • Add scala GraphiteClient in core package.
  • Add spark job computing restbase metrics and sending them to graphite in job package.

v0.0.17

Correct bug in PageviewDefinition removing arbcom-*.wikipedia.org

v0.0.16

  • Correct bug in PageviewDefinition removing outreach.wikimedia.org and donate.wikipedia.org as pageview hosts.
  • Correct bug in page_title extraction, ensuring spaces are always converted into underscores.

v0.0.15

  • Correct bug in PageviewDefinition ensuring correct hosts only can be flagged as pageviews.

v0.0.14

  • Add Spark mobile_apps sessions statistics job.

v0.0.13

  • Fix bug in Webrequest.normalizeHost when uri_host is empty string.

v0.0.12

  • wmf_app_version field now in map returned by UAParserUDF.
  • Added GetPageviewInfoUDF that returns a map of information about Pageviews as defined by PageviewDefinition.
  • Added SearchRequestUDF for classifying search requests.
  • Added HostNormalizerUDF to normalize uri_host fields to regular WMF URIs formats. This returns includes a nice map of normalized host info.

v0.0.11

  • Build against CDH 5.4.0 packages.

v0.0.10

  • Maven now builds non-uber jars by having hadoop and hive in provided scope. It also takes advantage of properties to propagate version numbers.
  • PageView Class has a function to extract project from uri. Bugs have been corrected on how to handle mobile uri.
  • Referer classification now outputs a string instead of a map.

v0.0.9

  • Generic functions used in multiple classes now live in a single "utilities" class.
  • Pageview and LegacyPageview have been renamed to PageviewDefinition and LegacyPageviewDefinition, respectively. These also should now use the singleton design pattern, rather than employing static methods everywhere.
  • renames isAppRequest to isAppPageview (since that's what it does) and exposes publicly in a new UDF.
  • UAParser usage is now wrapped in a class in refinery-core.

v0.0.8

  • Stop counting edit attempts as pageviews
  • Start counting www.wikidata.org hits
  • Start counting www.mediawiki.org hits
  • Consistently count search attempts
  • Make custom file ending optional for thumbnails in MediaFileUrlParser
  • Fail less hard for misrepresented urls in MediaFileUrlParser
  • Ban dash from hex digits in MediaFileUrlParser
  • Add basic guard framework
  • Add guard for MediaFileUrlParser

v0.0.7

  • Add Referer classifier
  • Add parser for media file urls
  • Fix some NPEs around GeocodeDataUDF

v0.0.6

  • Add custom percent en-/decoders to ease URL normalization.
  • Add IpUtil class and ClientIP UDF to extract request IP given IP address and X-Forwarded-For.

v0.0.5

  • For geocoding, allow to specify the MaxMind databases that should get used.

v0.0.4

  • Pageview definition counts 304s.
  • refinery-core now contains a LegacyPageview class with which to classify legacy pageviews from webrequest data
  • refinery-hive includes IsLegacyPageviewUDF to use legacy Pageview classification logic in hive. Also UDFs to get webrequest's access method, extract values from the X-Analytics header and determine whether or not the request came from a crawler, and geocoding UDFs got added.

v0.0.3

  • refinery-core now contains a Pageview class with which to classify pageviews from webrequest data
  • refinery-hive includes IsPageviewUDF to use Pageview classification logic in hive