Data Cleaning Operations Using SQL and ETL DataFlows
Intro
This topic lists a number of data clean-up operations available through SQL and ETL DataFlows and shows how data will look both before and after using an operation.
Changing Your Data Type
Example: Changing the data type of an ID number from a numeric to a text field
In MySQL... | In ETL... | Before | After |
---|---|---|---|
CAST(`id` AS CHAR) AS `id_cast_datatype` |
Use the "Set Column Type" action | 1 | 1 (looks the same but behaves a text dimension) |
Concatenating Columns to Create a Compound Field
Example: Concatenating "First Name" and "Last Name" columns to create a "Full Name" column
In MySQL... | In ETL... | Before | After |
---|---|---|---|
CONCAT(`first_name`, ' ', `last_name`) AS `full_name` |
Use the "Combine Columns" action | ‘John’ | ‘Smith’ (first and last name in two separate columns) | ‘John Smith’ (a single column containing the complete name) |
Extracting a Portion of a Text String
Example: Extracting the first part of an email address to use as a user ID
In MySQL... | In ETL... | Before | After |
---|---|---|---|
SUBSTRING_INDEX(`email`,'@', 1) AS `user_id` |
Use the "Replace Text" action on that column to specify which portion of the string should be replaced with an empty string: @.+ | userid@email.com | userid |
Reformatting a Date
Example: Formatting a non-standard date string into a date type format
In MySQL... | In ETL... | Before | After |
---|---|---|---|
STR_TO_DATE(`send_date`, '%d.%m.%Y') AS date_formatted |
Use the "Set Column Type" action. | 23.01.2017 (string data type) | 01/23/2017 (date data type) |
Deriving Date Attributes from a Date Column
Example: Extracting the day of the week from a date column
In MySQL... | In ETL... | Before | After |
---|---|---|---|
DAYNAME(`receive_date`) AS receive_date_name |
Use the "Date Operations" action. | 01/23/2017 | Monday |
Splitting a Column into Two Columns Based on a Character in the Column
Example: Dividing a "Status Code" column into status code parts based on the / delimiter found within the column
In MySQL... | In ETL... | Before | After |
---|---|---|---|
SUBSTRING_INDEX(`status_code`, '/', 1) AS status_code_p1 |
Use the "Replace Text" action on that column to specify which portion of the string should be replaced with an empty string: \/.+ for the first part and .+\/ for the second. | SHI/DELV | SHI | DELV |
Trimming Erroneous Spaces from a Column
Example: Trimming the leading and trailing spaces from the "Department" column
In MySQL... | In ETL... | Before | After |
---|---|---|---|
TRIM(`department`) AS department_trimmed |
Use a regular expression within the "Replace Text" action to pinpoint the leading and training spaces and replace them with nothing. | ' department ' | 'department' |
Changing the Case of an Entire Column
Example: Changing the "Category" column to uppercase letters
In MySQL... | In ETL... | Before | After |
---|---|---|---|
UPPER(`category`) AS category_change_case |
Use the All upper case option within the "Text Formatting" action. | health | HEALTH |
Capitalizing the First Letter of a Column
Example: Capitalizing the first letter of the first word in the "Category" column
In MySQL... | In ETL... | Before | After |
---|---|---|---|
CONCAT(UPPER(LEFT(`category`, 1)), SUBSTRING(`category`, 2, LENGTH(`category`))) AS category_cap_first |
Use the Capitalize first letter option within the "Text Formatting" action. | health | Health |
Categorizing Rows Based off the Value in a Specific Column
Example: Assigning a region to each row based on a store number
In MySQL... | In ETL... | Before | After |
---|---|---|---|
(CASE WHEN `store_number` IN ('100', '101', '104', '109') THEN 'region_1' |
Use the Filter Rows action to separate into distinct groups based on store number, use Add Constants to add a category to each group, then use Append Rows to stitch the rows back together. | store_number = 100 | store_number = 100 | region = ‘region_1’ |
Categorizing Rows Based off the Value in a Specific Column
Example: Assigning a category to the animal name based on the first letter of the name
In MySQL... | In ETL... | Before | After |
---|---|---|---|
(CASE WHEN `animal_names` LIKE 'H%' THEN 'Hs' |
Use a regular expression within the "Replace Text" action to identify the patterns and assign a category. For example ^H.+ finds names starting with H, ^B.+ finds names starting with B, ^A.+|^E.+ finds names starting with either A or E, and ^[^H,B,A,E].+ finds names that do not start with H, B, A, or E. | animal_names = ‘Horse’ | animal_names = ‘Horse’ | animal_names_categories = ‘Hs’ |
Categorizing Rows Based off a Numeric Value Threshold
Example: Categorizing rows based on salary amount
In MySQL... | In ETL... | Before | After |
---|---|---|---|
(CASE WHEN `salary` < 50000 THEN '<$50,000' |
Use the "Filter Rows" action to separate into distinct groups based on store number, use "Add Constants" to add a category to each group, then use "Append Rows" to stitch the rows back together. | salary = $45,000 | salary = $45,000 | salary_bucket = ‘<$50,000’ |