A collection of libraries, a pipeline plugin, and a CDAP service for performing data cleansing, transformation, and filtering using a set of data manipulation instructions (directives). These instructions are either generated using an interative visual tool or are manually created.
Data Prep defines few concepts that might be useful if you are just getting started with it. Learn about them here
The Data Prep Transform is separately documented.
More Videos here
-
Videos
- [SCREENCAST] Creating Lookup Dataset and Joining
- [SCREENCAST] Restricted Directives
- [SCREENCAST] Parse Excel files in CDAP
- [SCREENCAST] Parse File As AVRO File
- [SCREENCAST] Parsing Binary Coded AVRO Messages
- [SCREENCAST] Parsing Binary Coded AVRO Messages & Protobuf messages using schema registry
- [SCREENCAST] Quantize a column - Digitize
- [SCREENCAST] Data Cleansing capability with send-to-error directive
- [SCREENCAST] Building Data Prep from the GitHub source
- [VOICE-OVER] End-to-End Demo Video
- [SCREENCAST] Ingesting into Kudu
- [SCREENCAST] Realtime HL7 CCDA XML from Kafka into Time Parititioned Parquet
- [SCREENCAST] Parsing JSON file
- [SCREENCAST] Flattening arrays
- [SCREENCAST] Data cleansing with send-to-error directive
- [SCREENCAST] Publishing to Kafka
- [SCREENCAST] Fixed length to JSON
-
Recipes
These directives are currently available:
Directive | Description |
---|---|
Parsers | |
JSON Path | Uses a DSL (a JSON path expression) for parsing JSON records |
Parse as AVRO | Parsing an AVRO encoded message - either as binary or json |
Parse as AVRO File | Parsing an AVRO data file |
Parse as CSV | Parsing an input record as comma-separated values |
Parse as Date | Parsing dates using natural language processing |
Parse as Excel | Parsing excel file. |
Parse as Fixed Length | Parses as a fixed length record with specified widths |
Parse as HL7 | Parsing Health Level 7 Version 2 (HL7 V2) messages |
Parse as JSON | Parsing a JSON object |
Parse as Log | Parses access log files as from Apache HTTPD and nginx servers |
Parse as Protobuf | Parses an Protobuf encoded in-memory message using descriptor |
Parse as Simple Date | Parses date strings |
Parse as XML | Parses an XML document |
Parse XML To JSON | Parses an XML document into a JSON structure |
XPath | Navigate the XML elements and attributes of an XML document |
Output Formatters | |
Write as CSV | Converts a record into CSV format |
Write as JSON | Converts the record into a JSON map |
Write JSON Object | Composes a JSON object based on the fields specified. |
Transformations | |
Changing Case | Changes the case of column values |
Cut Character | Selects parts of a string value |
Set Column | Sets the column value to the result of an expression execution |
Find and Replace | Transforms string column values using a "sed"-like expression |
Index Split | (Deprecated) |
Invoke HTTP | Invokes an HTTP Service (Experimental, potentially slow) |
Quantization | Quantizes a column based on specified ranges |
Regex Group Extractor | Extracts the data from a regex group into its own column |
Setting Character Set | Sets the encoding and then converts the data to a UTF-8 String |
Setting Record Delimiter | Sets the record delimiter |
Split by Separator | Splits a column based on a separator into two columns |
Split Email Address | Splits an email ID into an account and its domain |
Split URL | Splits a URL into its constituents |
Text Distance (Fuzzy String Match) | Measures the difference between two sequences of characters |
Text Metric (Fuzzy String Match) | Measures the difference between two sequences of characters |
URL Decode | Decodes from the application/x-www-form-urlencoded MIME format |
URL Encode | Encodes to the application/x-www-form-urlencoded MIME format |
Trim | Functions for trimming white spaces around string data |
Encoders and Decoders | |
Decode | Decodes a column value as one of base32 , base64 , or hex |
Encode | Encodes a column value as one of base32 , base64 , or hex |
Unique ID | |
UUID Generation | Generates a universally unique identifier (UUID) |
Date Transformations | |
Diff Date | Calculates the difference between two dates |
Format Date | Custom patterns for date-time formatting |
Format Unix Timestamp | Formats a UNIX timestamp as a date |
Lookups | |
Catalog Lookup | Static catalog lookup of ICD-9, ICD-10-2016, ICD-10-2017 codes |
Table Lookup | Performs lookups into Table datasets |
Hashing & Masking | |
Message Digest or Hash | Generates a message digest |
Mask Number | Applies substitution masking on the column values |
Mask Shuffle | Applies shuffle masking on the column values |
Row Operations | |
Filter Row if Matched | (Deprecated) |
Filter Row if True | (Deprecated) |
Filter Rows On | Filters records based on a condition |
Flatten | Separates the elements in a repeated field |
Fail on condition | Fails processing when the condition is evaluated to true. |
Send to Error | Filtering of records to an error collector |
Split to Rows | Splits based on a separator into multiple records |
Column Operations | |
Change Column Case | Changes column names to either lowercase or uppercase |
Changing Case | Change the case of column values |
Cleanse Column Names | Sanatizes column names, following specific rules |
Columns Replace | Alters column names in bulk |
Copy | Copies values from a source column into a destination column |
Drop Column | Drops a column in a record |
Fill Null or Empty Columns | Fills column value with a fixed value if null or empty |
Keep Columns | Keeps specified columns from the record |
Merge Columns | Merges two columns by inserting a third column |
Rename Column | Renames an existing column in the record |
Set Column Names | Sets the names of columns, in the order they are specified |
Split to Columns | Splits a column based on a separator into multiple columns |
Swap Columns | Swaps column names of two columns |
Set Column Data Type | Convert data type of a column |
NLP | |
Stemming Tokenized Words | Applies the Porter stemmer algorithm for English words |
Transient Aggregators & Setters | |
Increment Variable | Increments a transient variable with a record of processing. |
Set Variable | Sets a transient variable with a record of processing. |
Functions | |
Data Quality | Data quality check functions. Checks for date, time, etc. |
Date Manipulations | Functions that can manipulate date |
DDL | Functions that can manipulate definition of data |
JSON | Functions that can be useful in transforming your data |
Types | Functions for detecting the type of data |
A new capability that allows CDAP Administrators to restrict the directives that are accessible to their users. More information on configuring can be found here
Initial performance tests show that with a set of directives of medium complexity for transforming data, DataPrep is able to process at about 60K records per second. The rates below are specified as records/second. Additional details and test results are available.
Directive Complexity | Column Count | Records | Size | Mean Rate | 1 Minute Rate | 5 Minute Rate | 15 Minute Rate |
---|---|---|---|---|---|---|---|
Medium | 18 | 13,499,973 | 4,499,534,313 | 64,998.50 | 64,921.29 | 46,866.70 | 36,149.86 |
Medium | 18 | 80,999,838 | 26,997,205,878 | 62,465.93 | 62,706.39 | 60,755.41 | 56,673.32 |
CDAP User Group and Development Discussions:
The cdap-user mailing list is primarily for users using the product to develop applications or building plugins for appplications. You can expect questions from users, release announcements, and any other discussions that we think will be helpful to the users.
CDAP IRC Channel: #cdap on irc.freenode.net
CDAP Users on Slack: cdap-users team
Copyright © 2016-2017 Cask Data, Inc.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Cask is a trademark of Cask Data, Inc. All rights reserved.
Apache, Apache HBase, and HBase are trademarks of The Apache Software Foundation. Used with permission. No endorsement by The Apache Software Foundation is implied by the use of these marks.