Skip to content

prinam/wrangler

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Prep

cm-available cdap-transform Join CDAP community Build Status Coverity Scan Build Status License

A collection of libraries, a pipeline plugin, and a CDAP service for performing data cleansing, transformation, and filtering using a set of data manipulation instructions (directives). These instructions are either generated using an interative visual tool or are manually created.

Data Prep defines few concepts that might be useful if you are just getting started with it. Learn about them here

The Data Prep Transform is separately documented.

Demo Videos and Recipes

More Videos here

Available Directives

These directives are currently available:

Directive Description
Parsers
JSON Path Uses a DSL (a JSON path expression) for parsing JSON records
Parse as AVRO Parsing an AVRO encoded message - either as binary or json
Parse as AVRO File Parsing an AVRO data file
Parse as CSV Parsing an input record as comma-separated values
Parse as Date Parsing dates using natural language processing
Parse as Excel Parsing excel file.
Parse as Fixed Length Parses as a fixed length record with specified widths
Parse as HL7 Parsing Health Level 7 Version 2 (HL7 V2) messages
Parse as JSON Parsing a JSON object
Parse as Log Parses access log files as from Apache HTTPD and nginx servers
Parse as Protobuf Parses an Protobuf encoded in-memory message using descriptor
Parse as Simple Date Parses date strings
Parse as XML Parses an XML document
Parse XML To JSON Parses an XML document into a JSON structure
XPath Navigate the XML elements and attributes of an XML document
Output Formatters
Write as CSV Converts a record into CSV format
Write as JSON Converts the record into a JSON map
Write JSON Object Composes a JSON object based on the fields specified.
Transformations
Changing Case Changes the case of column values
Cut Character Selects parts of a string value
Set Column Sets the column value to the result of an expression execution
Find and Replace Transforms string column values using a "sed"-like expression
Index Split (Deprecated)
Invoke HTTP Invokes an HTTP Service (Experimental, potentially slow)
Quantization Quantizes a column based on specified ranges
Regex Group Extractor Extracts the data from a regex group into its own column
Setting Character Set Sets the encoding and then converts the data to a UTF-8 String
Setting Record Delimiter Sets the record delimiter
Split by Separator Splits a column based on a separator into two columns
Split Email Address Splits an email ID into an account and its domain
Split URL Splits a URL into its constituents
Text Distance (Fuzzy String Match) Measures the difference between two sequences of characters
Text Metric (Fuzzy String Match) Measures the difference between two sequences of characters
URL Decode Decodes from the application/x-www-form-urlencoded MIME format
URL Encode Encodes to the application/x-www-form-urlencoded MIME format
Trim Functions for trimming white spaces around string data
Encoders and Decoders
Decode Decodes a column value as one of base32, base64, or hex
Encode Encodes a column value as one of base32, base64, or hex
Unique ID
UUID Generation Generates a universally unique identifier (UUID)
Date Transformations
Diff Date Calculates the difference between two dates
Format Date Custom patterns for date-time formatting
Format Unix Timestamp Formats a UNIX timestamp as a date
Lookups
Catalog Lookup Static catalog lookup of ICD-9, ICD-10-2016, ICD-10-2017 codes
Table Lookup Performs lookups into Table datasets
Hashing & Masking
Message Digest or Hash Generates a message digest
Mask Number Applies substitution masking on the column values
Mask Shuffle Applies shuffle masking on the column values
Row Operations
Filter Row if Matched (Deprecated)
Filter Row if True (Deprecated)
Filter Rows On Filters records based on a condition
Flatten Separates the elements in a repeated field
Fail on condition Fails processing when the condition is evaluated to true.
Send to Error Filtering of records to an error collector
Split to Rows Splits based on a separator into multiple records
Column Operations
Change Column Case Changes column names to either lowercase or uppercase
Changing Case Change the case of column values
Cleanse Column Names Sanatizes column names, following specific rules
Columns Replace Alters column names in bulk
Copy Copies values from a source column into a destination column
Drop Column Drops a column in a record
Fill Null or Empty Columns Fills column value with a fixed value if null or empty
Keep Columns Keeps specified columns from the record
Merge Columns Merges two columns by inserting a third column
Rename Column Renames an existing column in the record
Set Column Names Sets the names of columns, in the order they are specified
Split to Columns Splits a column based on a separator into multiple columns
Swap Columns Swaps column names of two columns
Set Column Data Type Convert data type of a column
NLP
Stemming Tokenized Words Applies the Porter stemmer algorithm for English words
Transient Aggregators & Setters
Increment Variable Increments a transient variable with a record of processing.
Set Variable Sets a transient variable with a record of processing.
Functions
Data Quality Data quality check functions. Checks for date, time, etc.
Date Manipulations Functions that can manipulate date
DDL Functions that can manipulate definition of data
JSON Functions that can be useful in transforming your data
Types Functions for detecting the type of data

Restricting and Aliasing

A new capability that allows CDAP Administrators to restrict the directives that are accessible to their users. More information on configuring can be found here

Performance

Initial performance tests show that with a set of directives of medium complexity for transforming data, DataPrep is able to process at about 60K records per second. The rates below are specified as records/second. Additional details and test results are available.

Directive Complexity Column Count Records Size Mean Rate 1 Minute Rate 5 Minute Rate 15 Minute Rate
Medium 18 13,499,973 4,499,534,313 64,998.50 64,921.29 46,866.70 36,149.86
Medium 18 80,999,838 26,997,205,878 62,465.93 62,706.39 60,755.41 56,673.32

Contact

Mailing Lists

CDAP User Group and Development Discussions:

The cdap-user mailing list is primarily for users using the product to develop applications or building plugins for appplications. You can expect questions from users, release announcements, and any other discussions that we think will be helpful to the users.

IRC Channel

CDAP IRC Channel: #cdap on irc.freenode.net

Slack Team

CDAP Users on Slack: cdap-users team

License and Trademarks

Copyright © 2016-2017 Cask Data, Inc.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Cask is a trademark of Cask Data, Inc. All rights reserved.

Apache, Apache HBase, and HBase are trademarks of The Apache Software Foundation. Used with permission. No endorsement by The Apache Software Foundation is implied by the use of these marks.

About

Wrangler Transform: A DMD system for transforming Big Data

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Java 100.0%