- Refactoring: Extract RawRefineDataReader from RefineHelper used in refine to hive
- Refactoring: Extract SparkEventSchemaLoader from RefineHelper used in refine to hive
- Refine: Add an option to ignore missing input folders
- Fix HdfsXMLFsImageConverter block reading
- Fix mediawiki-jdbc that causes slower pull
- build: add sdkman configuration
- Fix on TransformFunction deterministic behavior
- Update the smtp server settings for email from refine
- Add a new mediawiki-jdbc spark datasource to refinery-spark.
- Refactor Refine to make it work atomically from Airflow
- Make deduplication TransformFunction deterministic.
- Add Special:AllEvents to the PageviewDefinition
- No apparent updates?
- Enable pivoting with varied casing in DataPivoter
- refinery-job: add webrequest instrumentation
- Update eventutilities version to 1.3.6
- Add unit tests for Refinery*DatabaseResponse
- Fix MediawikiHistory Checker Null Exceptions
- Update column definition for uniqueness check.
- Fix NPE in creation of RefineryISPDatabaseResponse
- Update clickstream job - better joins
- chore: remove leftover from refinery-cassandra
- style(maxmind): fix checkstyle violations for MaxMind package.
- fix(*DatabaseReader): avoid null pointer exception when reading MaxMind
- Include subdivision ISO code in the geo response
- Mediawkihistory: typesafe access to compliance value.
- Update MediawikiXMLDumpsConverter
- Refine DeequColumnAnalysis code
- Upgrade MediawikiHistory Checker to use AWS Deequ
- Update the ProduceCanaryEvents job
- Modify ClickStreamBuilder pipeline to cope with pagelinks schema changes https://phabricator.wikimedia.org/T355588
- Correctly apply distanceToPrimary in CommonsCategoryGraphBuilder
- Move version configuration of dependencies to main pom
- Sort the dependencyManagement section according to sortPom
- Remove duplication from parent pom
- Start using wmf-jvm-parent-pom
- Sort pom.xml according to standard sortpom order
- Add CommonsCategoryGraphBuilder for Commons Impact Metrics
- Extract RefineDataset
- Update default from_email to [email protected]
- Mediawiki History Data Quality Metrics
- Update ProduceCanaryEvents job
- Add DataPivoter job
- data-quality: rename source table column
- Cleanup dependencies in refinery-tools module.
- Cleanup of ISPDatabaseReader.
- IcebergWriter: don't create missing tables if absent
- Simplify GeocodeDatabaseReader.
- Simplifies CountryDatabaseReader.
- Second fix to webrequest x-analytics field parsing bug https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/992752
- Switch to jdk17 for sonar: https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/991010
- Fix code serialization for MediawikiDumper.scala job. https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/991795
- Fix of webrequest x-analytics field parsing https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/992475
- Switch to jdk17 for sonar.
refinery-job
: log data quality alert severity level.refinery-job
: add WebrequestMetrics AWS Deequ data quality job.refinery-spark
: add APIs to export AWS Deequ analysis and verification suite results to a Wikimedia metrics and alerting data model.
- Fix recursion for Maps with Structs on SanitizeTransformation
- Bump eventutilities version to 1.3.2
- ProduceCanaryEvents now will retry failed HTTP POSTs to event intake services.
- Use eventutilities-spark JsonSchemaSparkConverter and remove our custom spark JsonSchemaConverter since it is no longer needed. https://phabricator.wikimedia.org/T321854
- Update project namespace map view
- Improve fidelity of dumps import
- Add siteinfo information to output XML
- Create a job to dump XML/SQL MW history files to HDFS
- Remove unused cassandra module
- Make refine SchemaLoader main function thread safe
- Remove special KaiOS App checks from pageview def
- Adapt to nulls in rev_actor and rev_comment on RevisionViewRegistrar.scala https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/944974
- Add special=ViewObject to allowed pageviews
- Turn on pageviews for wikifunctions
- JsonSchemaConverter - log full JSONSchema when converting to Spark fails
- Remove deprecated code for AppSessionMetrics
- Add explicit snapshot to HiveToDruid
- Use eventutilites v1.2.9 shaded jar to fix conflicts with guava included in hadoop classpath and that present in the refinery-job fat jar.
- Update refine job to spark3 by fixing Factory already defined issue
- ProduceCanaryEvents now uses a default 10 second http request timeout. https://phabricator.wikimedia.org/T330236
- Fix HiveToDruid to allow for non-partitioned source tables
- Replace Guava with Caffeine
- Prepare refine_webrequest UDFs for Spark (multi-thread environment)
- Update MediasitesDefinition to remove github.io and thereby update GetRefererDataUDF
- Support snapshot partitioning in HiveToDruid and DataFrameToDruid
- Refactor and Expand External referer classification
- Put wikihadoop into refinery/source
- Add HdfsXMLFsImageConverter to refinery-job
- Update mediawiki-history page and user computation
- Add Custom Authentication Configuration Class for Cassandra
- Fix mediawiki-history-denormalize for spark 3.
- Add unit test for MediaWikiEvent.
- Fix empty path bug of MediawikiHistoryDumper.
- Remove spark-cassandra-connector dependency in refinery-job
- Performance fixes for Array UDFs
- Bump eventutilities to 1.2.0 and remove duplicate dependency
- Repurpose refinery-tools to contain code reused across other modules
- Update search engine detection
- Add ArrayAvgUDF
- Update UA-Parser to 1.5.3
- Spark JsonSchemaConverter - log when schema does not contain type field
- Fix HDFSArchiver doneFilePath parameter
- UDF for testing uri_query for duplicate query parameters
- Suppress useless GeocodeDatabaseReader log warn messages
- Update SparkSQLNoCLIDriver to error correctly
- Update to spark-3 and scala-2.12
- Fix returned error code in HDFSArchiver
- Fix typo in MediaFileUrlParser
- Make caching mechanisms thread ready
- Migrate wikidata/item_page_link/weekly
- Create a Hive to Graphite job
- Add archiving job for Airflow
- Integrate SparkSQLNoCLIDriver and HiveToCassandra
- Update refine netflow_augment transform function
- Update structured_data dumps parsing job
- Add SparkSQLNoCLIDriver job
- Simplify RSVD anomaly detection job for Airflow POC
- Save commons json dumps as a table and add fields for wikidata
- Refine - don't remove records during deduplication if ids are null
- Fix bug in HDFSCleaner where directories with only directories would always be deleted.
- Spark JsonSchemaConverter now treats an object additionalProperties as a MapType always.
- Remove /wmf/gobblin from HDFSCleaner disallow list
- Add num-partitions parameter to mediawiki-history checkers
- Standard artifacts are no longer shaded. shaded versions are suffixed with -shaded. Production deployments that reference the shaded jars will have to be updated.
- Load cassandra3 from spark
- Remove refinery-camus module - T271232
- Refine - replace default formatters with gobblin convention
- Refine - default event transform functions now add normalized_host info
- Refine - explicitly uncache DataFrame when done
- Fix UAParser initialization to re-use static CachingParser instance and synchronize its usage
- RefineTarget - support gzipped json input format
- ProduceCanaryEvents - fix exit val
- ProduceCanaryEvents - produce events one at a time for better error handling
- Add scala job for reliability metrics of Wikidata
- Fix com.criteo:rsvd dependency issue
- Update refinery-cassandra to cassandra 3.11
- Report on test coverage
- Ensure that maven site generation works.
- Revert addition of maven doc site generation (somehow this is causing release to fail).
- Fix bug in RefineSanitizeMonitor when using keep_all_enabled
- Bump to eventutilities 1.0.6
- (Six related commits on style checking and linting)
- SanitizeTransformation - Just some simple logging improvements
- Fix bug in Refine where table regexes were not matching properly
- Factor our HiveExtensions.normalizeName to HivePartition.normalize.
- ProduceCanaryEvents - include httpRequest body in failure message
- Include RefineFailuresChecker functionality into RefineMonitor and fix bug in Refine.Config
- Switch to eventutilities 1.0.5
- Improve Refine failure report email
- Add support for finding RefineTarget inputs from Hive
- Refactor EventLoggingSanitization to a generic job: RefineSanitize
- Rename whitelist to allowlist for Refine sanitization
- Update WMF domain list with Cloud and toolforge
- Fix failing sonar analysis due to JDK11 removing tools.jar
- Update UA-Parser to 1.5.2
- Minimal configuration of Sonar maven plugin
- Standardize CI builds on Maven Wrapper
- Make null result same shape as normal result
- Fix wikitext history job
- Update hadoop and hive dependencies versions (BigTop upgrade)
- Exclude debug requests from pageviews
- Make HiveToDruid return exit code when deployMode=client
- Fix DataFrameToHive repartition-to-empty failure
- Fix DataFrameExtension.convertToSchema repartition
- Change DataFrameToDruid base temporary path
- refinery-core: iputils: refresh cloud addresses
- Update junit and netty versions for github security alert
- Refine - Add TransformFunction is_wmf_domain
- Refine - Add TransformFunction to remove canary events
- Refine - use PERMISSIVE mode and log more info about corrupt records
- Upgrade maven configuration and plugins
- Move pageview filters to PageviewDefinition; add Webrequest.isWMFHostname
- Update pageview title extraction for trailing EOL
- Expand EZ project conversion to adapt to raw format
- Add datasource argument to HiveToDruid
- Add caching to maxmind readers in core package
- Add Refine transform function for Netflow data set
- Fix maxmind UDFs for hive 2.3.3 (bigtop)
- Update MediawikiXMLDumpsConverter repartitioning
- Use camus + EventStreamConfig integration in CamusPartitionChecker
- Remove lat/long/postal code from geocoding
- Chopping timeseries for noise detection
- Add ProduceCanaryEvents job
- Add dependency on wikimedia event-utilities and use schema loader classes from it
- Mediawiki History Dumps ordering fix
- Refine - Add legacy useragent column if field exists in event schema or in Hive
- Pageview definition - Exclude requests with app user agents from web pageviews
- Refine - Quote SQL columns used in selectExpr in TransformFunctions
- Remove outdated IOS pageview code
- For Android and iOS we only count pageviews with x-Analytics marker
- Refine - Don't merge Hive schema by default when reading input data
- Overloaded methods to make working with Refine easier
- Remove unused custom avro camus classes
- Fix mediawiki-history skewed join bug
- Remove sysop domains from accepted pageviews
- Rename pageview_actor_hourly to pageview_actor in clickstream job
- Make mediawiki_history skewed join deterministic
- Remove filter_allowed_domains from common event_transforms
- Label mobile-html endpoint requests as app pageviews
- Add UDF that transforms Pagecounts-EZ projects into standard
- Correct bug in webrequest host normalization
- Add a corrected bzip2 codec for spark
- Update clickstream to read from pageview_actor_houly instead of webrequest
- Make ActorSignatureGenerator non-singleton
- Add special explode UDTF that turns EZ-style hourly strings into rows
- Refine geocode_ip transform sets legacy EventLogging IP field
- Sort mediawiki history dumps by timestamp
- DataFrameToHive - drop partition before writing output data
- Make event transform functions smarter about choosing which possible column to use
- RefineTarget - fix off by one bug in hoursInBetween used to find RefineTargets
- Refactor JsonSchemaLoader and add JsonLoader
- Make anomaly detection correctly handle holes in time-series
- Add EvolveHiveTable tool
- Use page move events to improve joining to wikidata entity
- Fix snakeyaml upgrade issue in EL sanitization
- Fix RSVDAnomalyDetection using parameters for data-length validation
- Unify Refine transform functions and add user agent parser transform
- RefineTarget.shouldRefine now considers both table whitelist and blacklist
- Update hive geocoded-data udf
- Allow pageview titles that include Unicode character values above 0xFFFF like emoji
- Make RSVDAnomalyDetection ignore too short timeseries
- Add check for corrupted (empty) flag files
- Add MeetingRoomApp to the bot regex
- Add ActorSignatureGenerator and GetActorSignatureUDF
- Add documentation to maven developerConnection parameter
- Add RefineFailuresChecker in refinery-spark and fix documentation
- Support multiple possible schema base URIs in EventSchemaLoader
- Add maven developerConnection parameter to allow CLI override
- Count pageviews to wikimania.wikimedia.org
- Detect pageviews as requested by KaiOS
- Fix wikidata article-placeholder job
- Move wikidata jobs in the wikidata package
- Fix WikidataArticlePlaceholderMetrics
- Add wikidata item_page_link spark job
- Revert GetGeoDataUDF Fix from 114, hotfix
- Fix webrequest host normalization
- Refine - Warn when merging incompatible types; FAILFAST when reading JSON data with a schema
- Fix GetGeoDataUDF and underlying function
- Remove BannerImpressions streaming job and deps
- Add spark code for wikidata json dumps parsing
- Change format of data_quality_stats to parquet
- Update mediawiki-history dumper
- Enforce distinct revision in xml-dumps converter
- Add Spark/Scala module for time series anomaly detection
- Modify external webrequest search engine classification
- Correct MW XML dumps converter parameter parsing
- Fix WikidataArticlePlaceholderMetrics query
- Document JDK version requirement
- Add Spark job to update data quality table with incoming data
- Fix user agent for WDQS updater counter
- Update UA parser to add kaiOS
- Add query to track WDQS updater hitting Special:EntityData
- MAke HDFSCleaner robust to external file deletions
- HDFSCleaner Improvements
- Upgrade Spark to 2.4.4
- Update HDFSCleaner logging
- Add HDFSCleaner to aid in cleaning HDFS tmp directories
- Update mediawiki-history-dumper (file names and future date events)
- Fix refine wikipedia.org eventlogging data
- Update subnet lists for IpUtil
- Update ua-parser dependency and related functions and tests
- Add mediawiki-history-dumper spark job
- Third party data should not get refined, fixing typo
- media info UDF now provide literal transcoding field
- Now refine infers hiveServerUrl from config, no --hive_server_url necessary.
- Making RefineMonitor error message more clear
- Adding UDF to get wiki project from referrer string, not used
- Add new mediatypes to media classification refinery code - T225911
Version skipped due to deployment problems
- Pageview Definition. Most special pages should not be counted - T226730
- EventSchemaLoader uses JsonParser for event data rather than YAMLParser - T227484
- EventSparkSchemaLoader now merges input JSONSchema with Hive schema before loading - T227088
- Added whitelist to eventlogging filtering of webhost domains so data from google translate apps is accepted - T227150
- Refactor mediawiki-page-history computation + fix
- Mediawiki-history: Handle dropping of user fields in labs views
- Update mediawiki_history checker to historical values
- Update pageview definition to exclude non wiki sites
- Add entropy UDAF to refinery-hive
- Fix wrongly getting the yarn user name in DataFrameToHive
- Fix transform function for NULL values and for dataframes without the webHost column
- Update CirrusRequestDeser.java to use new schema of mediawiki/cirrussearch/request event
- Add refine transform function to filter our non-wiki hostnames
- Allow for plus signs in the article titles in the PageviewDefinition
- Reduce the size limit of user agent strings in the UAParser
- Fix javax.mail dependency conflict introduced by including json-schema-validator
- Improve CamusPartitionChecker error output
- Fix wikidata-coeditor job after MWH-refactor
- ClickstreamBuilder: Decode refferer url to utf-8
- Fix EventLoggingSchemaLoader to properly set useragent is_bot and is_mediawiki fields as booleans
- Fix EventLoggingSchemaLoader to not include depcrecated
timestamp
in capsule schema - RefineTarget - allow missing required fields when reading textual (e.g. JSON) data using JSONSchemas.
- Filter out 15.wikipedia.org and query.wikidata.org from pageview definition
- Fix mediawiki_page_history userId and anonymous
- Fix mediawiki_history_reduced checker
- Fix mediawiki-history user event join
- Add EventSparkSchemaLoader support to Refine
- Add jsonschema loader and spark converter classes
- Adapt EventLogging/WhiteListSanitization to new way of storing
- Add change_tags and revision_deleted_parts to mediawiki history
- Fix EventLogging schema URI to include format=json
- Reject invalid page titles from pageview dumps
- Correct names in mediawiki-history sql package
- Update mw user-history timestamps
- Fix mediawiki-history-checker after field renamed
- Fix null-timestamps in checker
- Fix mediawiki-user-history writing filter
- Update mediawiki-history user bot fields
-- skipped due to deployment complications https://phabricator.wikimedia.org/T221466 --
- Update big spark job settings following advices from https://towardsdatascience.com/how-does-facebook-tune-apache-spark-for-large-scale-workloads-3238ddda0830
- Update graphframes to 0.7.0 in refinery-spark
- Update mediawiki-history comment and actor joins
- Update mediawiki-history joining to new actor and comment tables
- Add --ignore_done_flag option to Refine
- Add wikitech to pageview definition
- HiveExtensions field name normalize now replaces bad SQL characters with "_", not just hyphens.
- Add new Cloud VPS ip addresses to network origin UDF
- Correct typo in refinery-core for Maxmind, getNetworkOrigin and IpUtil
- Allow for custom transforms in DataFrameToDruid
- Update hadoop, hive and spark dependency versions
- Fix field name casing bug in DataFrame .convertToSchema https://phabricator.wikimedia.org/T211833
- Use "SORT BY" instead of "ORDER BY" in mediawiki_history_checker job
- Correctly pass input_path_regex to Refine from EventLoggingSanitization
- HiveExtensions schema merge now better support schema changes of complex Array element and Map value types. https://phabricator.wikimedia.org/T210465
- HiveExtensions findIncompatibleFields was unused and is removed.
- Upgrade profig lib to 2.3.3 after bug fix upstream
- Upgrade spark-avro to 4.0.0 to match new spark versions
- Update DataFrameToHive and PartitionedDataFrame to support dynamic partitioning and correct some bugs
- Add WebrequestSubsetPartitioner spark job actually launching a job partitioning webrequest using DataFrameToHive and a transform function
- Upgrade camus-wmf dependency to camus-wmf9
- Fix bug in EventLoggingToDruid, add time measures as dimensions
- Rename start_date and end_date to since until in EventLoggingToDruid.scala
- Add spark job converting mediawiki XML-dumps to parquet
- Default value of hive_server_url updated in Refine.scala job
- Refactor EventLoggingToDruid to use whitelists and ConfigHelper
- Refine Config removes some potential dangerous defaults, forcing users to set them
- EventLoggingToDruid now can bucket time measures into ingestable dimensions
- Refine and EventloggingSanitization jobs now use ConfigHelper instead of scopt
- Add --table-whitelist flag to EventLoggingSanitization job
- Add ConfigHelper to assist in configuring scala jobs with properties files and CLI overrides
- RefineMonitor now uses ConfigHelper instead of scopt
- Add usability, advisory and strategy wikimedia sites to pageview definition
- Correct MediawikiHistoryChecker for reduced
- Update MediawikiHistoryChecker adding reduced
- Add MediawikiHistoryChecker spark job
- Update mediawiki-user-history empty-registration Drop user-events for users having no registration date (i.e. no edit activity nor registration date in DB)
- Correct mediawiki-history user registration date Use MIN(DB-registration-date, first-edit-date) instead of COALESCE(DB-registration-date, first-edit-date)
- Fix for WhitelistSanitization.scala, allowing null values for struct fields
- Fix for CamusPartitionChecker to only send email if errors are encountered
- Fix case insensibility for MapMaskNodes in WhitelistSanitization
- Add ability to salt and hash to eventlogging sanitization
- Add --hive-server-url flag to Refine job
- CamusPartitionChecker can send error email reports and override Camus properties from System properties.
- Add foundation.wikimedia to pageviews
- Track number of editors from Wikipedia who also edit on Wikidata over time
- Update user-history job from username to userText
- Add inline comments to WhitelistSanitization
- Add a length limit to webrequest user-agent parsing
- Allow partial whitelisting of map fields in Whitelist sanitization
- Update mediawiki-history statistics for better names and more consistent probing
- Fix RefineTarget.inferInputFormat filtering out fiels starting with _
- Update regular expressions used to parse User Agent Strings
- Add PartitionedDataFrame to Spark refine job
- Fix bug when merging partition fields in WhitelistSanitization.scala
- Update pageview regex to accept more characters (previously restricted to 2)
- Make mediawiki-history statistics generation optional
- Modify output defaults for EventLoggingSanitization
- Correct default EL whitelist path in EventLoggingSanitization
- Correct mediawiki-history job bugs and add unittest
- Add defaults section to WhitelistSanitization
- Identify new search engines and refactor Referer parsing code
- Fix MediawikiHistory OOM issue in driver
- Update MediawikiHistory for another performance optimization
- Rename SparkSQLHiveExtensions to just HiveExtensions
- Include applicationId in Refine email failure report
- DataFrameToHive - Use df.take(1).isEmpty rather than exception
- RefineTarget - Use Hadoop FS to infer input format rather than Spark
- DataFrameToHive - Use DataFrame .write.parquet instead of .insertInto
- Correct wikidata-articleplaceholder job SQL RLIKE expression
- Fix sys.exit bug in Refine
- Fix LZ4 version bug with maven exclusion in refinery-spark and refinery-job
- Big refactor of scala and spark code ** add refinery-spark module for spark oriented libs ** Move non-spark dependent code to refinery-core
- Tweak Mediawiki-history job for performance (mostly partitioning)
- Update Mediawiki-history job to use accumulator to gather stats
- Add Hive JDBC connection to Refine for it to work with Spark 2
- Update spark code to use Spark 2.3.0
- Add new wikidata and pageview tags to webrequest
- Update Refine to use SQL-casting instead of row-conversion
- JsonRefine has been made data source agnostic, and now lives in a refine module in refinery-job. The Spark job is now just called 'Refine'.
- Add Whitelist Sanitization code and an EventLogging specific job using it
- Add some handling to Refine for cast-able types, e.g. String -> Long, if possible.
- Added RefineMonitor job to alert if Refine targets are not present.
- Refactor geo-coding function and add ISP
- Update camus part checker topic name normalization
- Update RefineTarget inputBasePath matches
- Add GetMediawikiTimestampUDF to refinery-hive
- Factor out RefineTarget from JsonRefine for use with other jobs
- Add configurable transform function to JSONRefine
- Fix JsonRefine so that it respects --until flag
- Clean refinery-job from BannerImpressionStream job
- Add core class and job to import EL hive tables to Druid
- Add new package refinery-job-spark-2.1
- Add spark-streaming job for banner-activity
- JsonRefine improvements: ** Use _REFINE_FAILED flag to indicate previous failure, so we don't re-refine the same bad data over and over again ** Don't fail the entire partition refinement if Spark can resolve (and possibly throw out) records with non-critical type changes. I.e. don't throw the entire hour away if just a couple records have floats instead of ints. See: https://phabricator.wikimedia.org/T182000
- something/something_latest fields change to something_historical/something
- UDF for extracting primary full-text search request
- Fix Clickstream job
- Change Cassandra loader to local quorum write
- Add Mediawiki API to RestbaseMetrics
- Fix mediawiki history reconstruction
- refinery-core now builds scala.
- Add JsonRefine job
- Correct field names in mediawiki-history spark job (time since previous revision)
- Add PhantomJS to the bot_flagging regex
- Correct mobile-apps-sessions spark job (filter out
ts is null
)
- Add Clickstream builder spark job to refinery-job
- Move GraphiteClient from refinery-core to refinery-job
- Correct bug in host normalization function - make new field be at the end of the struct
- Update host normalization function to return project_family in addition to project_class (with same value) in preparation to remove (at some point) the project_class field.
- Add webrequest tagging (UDF to tag requests) https://phabricator.wikimedia.org/T164021
- Tagger can return several tags (same task as above)
- Correct null pointer exception (same task as above)
- Add webrequest tagger for Wikidata Query Service https://phabricator.wikimedia.org/T169798
- Update mediawiki_history job with JDBC compliant timestamps and per-user and per-page new fields (revision-count and time-from-previous-revision)
- Removed unused and deprecated ClientIpUDF. See also https://phabricator.wikimedia.org/T118557
- Mark Legacy Pageview code as deprecated.
- Update tests and their dependencies to make them work on Mac and for any user.
- Add small cache to avoid repeating normalization in Webrequest.normalizeHost
- Refactor PageviewDefinition to add RedirectToPageviewUDF
- Add support for both cs and cz as Czech Wiki Abbreviations to StemmerUDF
- Remove is_productive and update time to revert from MediaWiki history denormalizer
- Add revision_seconds_to_identity_revert to MediaWiki history denormalizer
- Use hive query instead of parsing non existent sampled TSV files for guard settings
- Update mediawiki history jobs to overwrite result folders
- Add mediawiki history spark jobs to refinery-job
- Add spark job to aggregate historical projectviews
- Do not filter test[2].wikipedia.org from pageviews
- Upgrade hadoop, hive and spark version after CDH upgrade. Hadoop and hive just have very minor upgrades, spark has a more import one (from 1.5.0 to 1.6.0.)
- Change the three spark jobs in refinery-job to have them working with the new installation (this new installation has a bug preventing using HiveContext in oozie).
- Update pageview definition to remove previews https://phabricator.wikimedia.org/T156628
- Add spark streaming job for banner impressions https://phabricator.wikimedia.org/T155141
- Add DSXS (self-identified bot) to bot regex https://phabricator.wikimedia.org/T157528
- Add comment to action=edit filter in pageview definition: https://phabricator.wikimedia.org/T156629
- Standarize UDF Naming: https://phabricator.wikimedia.org/T120131
- Lucene Stemmer UDF https://phabricator.wikimedia.org/T148811
- WikidataArticlePlaceholderMetrics also send search referral data https://phabricator.wikimedia.org/T142955
- Adding self-identified bot to bot regex https://phabricator.wikimedia.org/T150990
- Modify user agent regexes to identify iOS pageviews on PageviewDefinition https://phabricator.wikimedia.org/T148663
- Count pageviews for more wikis, https://phabricator.wikimedia.org/T130249
- Classify DuckDuckGo as a search engine
- Make camus paritition checker continue checking other topics if it encounters errors
- Update maven jar building in refinery (refinery-core is not uber anymore)
- Create WikidataSpecialEntityDataMetrics
- Fix WikidataArticlePlaceholderMetrics class doc
- Correct WikidataArticlePlaceholderMetrics
- Add WikidataArticlePlaceholderMetrics
- Fixes Prefix API request detection
- Refactor pageview definition for mobile apps
- Remove IsAppPageview UDF
- Add pageview definition special case for iOs App
- Correct CqlRecordWriter in cassandra module
- Evaluate Pageview tagging only for apps requests
- Update mediawiki/event-schemas submodule to include information about search results in CirrusSearchRequestSet
- Drop support for message without rev id in avro decoders and make latestRev mandatory
- Upgrade to latest UA-Parser version
- Update mediawiki/event-schemas submodule to include 3dd6ee3 "Rename ApiRequest to ApiAction".
- Google Search Engine referer detection bug fix
- Upgrade camus-wmf dependency to camus-wmf7
- Requests that come tagged with pageview=1 in x-analytics header are considered pageviews
- Upgrade CDH dependencies to 5.5.2
- Implement the Wikimedia User Agent policy in setting agent type.
- Remove WikimediaBot tagging.
- Add ApiAction avro schema.
- Add functions for categorizing search queries.
- Update CamusPartitionChecker not to hard-failing on errors.
- Add CamusPartitionChecker the possibility to rewind to last N runs instead of just one.
- Update AppSession Metrics with explicit typing and sorting improvement
- Ensure that the schema_repo git submodules are available before packaging
- REALLY remove mobile partition use. This was reverted and never deployed in 0.0.25
- Add split-by-os argument to AppSessionMetrics job
- Change/remove mobile partition use
- Add Functions for identifying search engines as referers
- Update avro schemas to use event-schema repo as submodule
- Implement ArraySum UDF
- Clean refinery-camus from unnecessary avro files
- Add UDF that turns a country code into a name
- Expand the prohibited uri_paths in the Pageview definition
- Make maven include avro schema in refinery-camus jar
- Add a Hive UDF for network origin updating existing IP code
- Correct CamusPartitionChecker unit test
- Add refinery-cassandra module, containing the necessary code for loading separated value data from hadoop to cassandra
- Update refinery-camus adding support for avro messages
- Add CirrusSearchRequestSet avro schema to refinery camus
- Update webrequest with an LRUCache to prevent recomputing agentType for recurrent user agents values
- Correct CamusPartitionChecker bug.
- Expand the prohibited URI paths in the pageview definition.
- Update CirrusSearchRequestSet avro schema.
- Update CamusPartitionChecker with more parameters.
- Add refinery-camus module
- Add Camus decoders and schema registry to import Mediawiki Avro Binary data into Hadoop
- Add camus helper functions and job that reads camus offset files to check if an import is finished or not.
- Update regexp filtering bots and rename Webrequest.isCrawler to Webrequest.isSpider for consitency.
- Update ua-parser dependency version to a more recent one.
- Update PageviewDefinition so that if x-analytics header includes tag preview the request should not be counted as pageview.
- Add scala GraphiteClient in core package.
- Add spark job computing restbase metrics and sending them to graphite in job package.
Correct bug in PageviewDefinition removing arbcom-*.wikipedia.org
- Correct bug in PageviewDefinition removing outreach.wikimedia.org and donate.wikipedia.org as pageview hosts.
- Correct bug in page_title extraction, ensuring spaces are always converted into underscores.
- Correct bug in PageviewDefinition ensuring correct hosts only can be flagged as pageviews.
- Add Spark mobile_apps sessions statistics job.
- Fix bug in Webrequest.normalizeHost when uri_host is empty string.
- wmf_app_version field now in map returned by UAParserUDF.
- Added GetPageviewInfoUDF that returns a map of information about Pageviews as defined by PageviewDefinition.
- Added SearchRequestUDF for classifying search requests.
- Added HostNormalizerUDF to normalize uri_host fields to regular WMF URIs formats. This returns includes a nice map of normalized host info.
- Build against CDH 5.4.0 packages.
- Maven now builds non-uber jars by having hadoop and hive in provided scope. It also takes advantage of properties to propagate version numbers.
- PageView Class has a function to extract project from uri. Bugs have been corrected on how to handle mobile uri.
- Referer classification now outputs a string instead of a map.
- Generic functions used in multiple classes now live in a single "utilities" class.
- Pageview and LegacyPageview have been renamed to PageviewDefinition and LegacyPageviewDefinition, respectively. These also should now use the singleton design pattern, rather than employing static methods everywhere.
- renames isAppRequest to isAppPageview (since that's what it does) and exposes publicly in a new UDF.
- UAParser usage is now wrapped in a class in refinery-core.
- Stop counting edit attempts as pageviews
- Start counting www.wikidata.org hits
- Start counting www.mediawiki.org hits
- Consistently count search attempts
- Make custom file ending optional for thumbnails in MediaFileUrlParser
- Fail less hard for misrepresented urls in MediaFileUrlParser
- Ban dash from hex digits in MediaFileUrlParser
- Add basic guard framework
- Add guard for MediaFileUrlParser
- Add Referer classifier
- Add parser for media file urls
- Fix some NPEs around GeocodeDataUDF
- Add custom percent en-/decoders to ease URL normalization.
- Add IpUtil class and ClientIP UDF to extract request IP given IP address and X-Forwarded-For.
- For geocoding, allow to specify the MaxMind databases that should get used.
- Pageview definition counts 304s.
- refinery-core now contains a LegacyPageview class with which to classify legacy pageviews from webrequest data
- refinery-hive includes IsLegacyPageviewUDF to use legacy Pageview classification logic in hive. Also UDFs to get webrequest's access method, extract values from the X-Analytics header and determine whether or not the request came from a crawler, and geocoding UDFs got added.
- refinery-core now contains a Pageview class with which to classify pageviews from webrequest data
- refinery-hive includes IsPageviewUDF to use Pageview classification logic in hive