Data Cleaning: Outlier Detection and Imputation of Missing Values

MAFS6010R - Portfolio Optimization with R
MSc in Financial Mathematics
The Hong Kong University of Science and Technology (HKUST)
Fall 2019-20

Outline

What Is Data Cleaning?
Data Wrangling (aka Data Munging)
Outlier Detection
- Statistical and likelihood approaches
- Depth-based, deviation-based, distance-based, density-based approaches
- High-dimensional approaches
Imputation of Missing Values
Data Cleaning of Financial Time Series
- Robust model fitting
- Outlier detection
- Imputation of missing values

What Is Data Cleaning?

What is data cleaning?

In the field known as “big data”, far too much handcrafted work —what data scientists call “data wrangling,” “data munging,” and “data janitor work”— is still required.
Data cleaning is one those things that everyone does but no one really talks about. Sure, it’s not the “sexiest” part of machine learning.
Data scientists, according to interviews and expert estimates, spend 80% of their time cleaning and manipulating data and only 20% of their time actually analyzing it.
Proper data cleaning can make or break your project. Professional data scientists usually spend a very large portion of their time on this step. Why? Because of a simple truth in machine learning: “better data beats fancier algorithms.”
In other words: “garbage in, garbage out.”
Data experts try to automate as many steps in the process as possible. “But practically, because of the diversity of data, you spend a lot of your time being a data janitor, before you can get to the cool, sexy things that got you into the field in the first place.”
Several start-ups are trying to break through these big data bottlenecks by developing software to automate the gathering, cleaning and organizing of disparate data, which is plentiful but messy.

📰 For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights. The New York Times. 2014.

Data cleaning

Some steps in data cleaning:

Remove unwanted observations: duplicate observations (mostly arising during data collection combining datasets from multiple places/clients/departments, scraping data, etc.) and irrelevant observations (those that don’t actually fit the specific problem that you’re trying to solve.)
Fix structural errors: arise during measurement, data transfer, or other types of “poor housekeeping.” For instance, you can check for typos or inconsistent capitalization. This is mostly a concern for categorical features, and you can look at your bar plots to check: Also, check for mislabeled classes, e.g., “NA”, “N/A”, “Not Applicable”.

Data cleaning

Filter unwanted outliers: Outliers can cause problems with certain types of models. For example, linear regression models are less robust to outliers than decision tree models.
- In general, if you have a legitimate reason to remove an outlier, it will help your model performance.
- However, outliers are innocent until proven guilty. You should never remove an outlier just because it’s a “big number.” That big number could be very informative for your model.
Handle missing data: Missing data is a deceptively tricky issue in applied machine learning.
- First, just to be clear, you cannot simply ignore missing values in your dataset. You must handle them in some way for the very practical reason that most algorithms do not accept missing values.
- “Common sense” is not sensible here, some recommended methods that are very dangerous include dropping observations that have missing values or imputing the missing values based on other observations.
- When you drop observations, you drop information. The fact that the value was missing may be informative in itself. Plus, in the real world, you often need to make predictions on new data even if some of the features are missing!
- If possible, design an algorithm that can accept missing values. Otherwise, you may use some imputation method to fill in the values.

Data Wrangling (aka Data Munging)

Data wrangling (aka data munging)

What are some common things you like to do with your data? Maybe remove rows or columns, do calculations and maybe add new columns? This is called data wrangling or data munging.
In R there are two alternatives:
- The tidyverse is a collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.
- The package data.table is designed to conveniently manipulate huge amounts of data. It is much faster than the tidyverse in R and than pandas in Python (see benchmark).
Some good books include:

📘 Garrett Grolemund and Hadley Wickham, R for Data Science. O’Reilly Media. 2017.

📘 The Ocean Health Index Team. Introduction to Open Data Science. 2019.

Data wrangling: Mundane tasks

Let’s consider in more detail one of the most mundane aspects of data cleaning: unwanted observations and structural errors.
The following errors are common across many different types of datasets:
- Duplications: where one row contains identical/similar information to another row.
- ‘NA’ misclassifications: where empty values are misclassified as known values or vice-versa.
- Erroneous answers: where incorrect values are entered, either accidentally or deliberately (a common effect of compulsory questions that cannot be answered).
- White space and alphabetical case errors: where values appear to mean the same thing but are classified differently due to white space or alphabetical case.
- Spelling mistakes, typing errors, or special character errors: where values appear to mean the same thing but are classified differently.
- Inconsistent time formats: where dates/times have been inputted in several different time formats, making it impossible to make time-based calculations.
- Data merging ambiguity: where information that could be recorded in different columns is merged into one column, making the information difficult to interpret.
- Data recording ambiguity: where information is recorded in different ways (e.g. giving a range of values rather than a single value).

R session: Cleaning dirty data with the tidyverse

Let’s consider the kitchen data that contains info on people sharing a kitchen:

their name
the date when they last cleaned the kitchen
whether the dishwasher was full at the time
the number of mugs they found on the side
whether they wiped the sides or sink

# Import the libraries needed
library(tidyverse)
library(lubridate)  # to manipulate dates

# Read in the csv with the "read.csv" function from base R
kitchen_data <- read.csv("kitchen_data.csv")

We can see that the data in the table is quite inconsistent, messy, and full of issues:

date_cleaned	name	sides_and_sink	dishwasher_full	no_of_mugs	notes
16/7/2018	Charlotte	sides and sink	no	16	new cleaning schedule begins today!
17/7/2018	David E			2-Jun
18/7/2018	Hannah	sides	yes	15(ish)	all fine
19 07 18	Dave	sink, side	Yes!	18	so many mugs
20/7/2018	Charlie	didn’t need doing	YES	12- 14 (I think!)	can’t remember very well
23/7/2018	Davida	N/A	NA	0	All of us were at conference but I still had to put 0 in form!
24/7/2018	David A	both	No	Apr-14
25 7 2018	Hanson	just the side	NO	3	not v messy
260718	Davie	sink only	yes	6
260718	Davie	sink only	yes	6

Let’s fix the issues one by one.

Removing NA misclassifications and white space: First of all, it looks like some ‘NAs’ (blank spaces, ‘N/A’, ‘NA’) were not recognised when the data was read in. Also, we suspect there might be some leading and trailing white space because there were free text boxes in the survey.

# Read in the csv with the alternative "write_csv" function from tidyverse (or readr)
#   and the following options to remove NA misclassifications and white space
kitchen_data <- read_csv("kitchen_data.csv", na = c("", "N/A", "NA"), trim_ws = TRUE)

R>> Parsed with column specification:
R>> cols(
R>>   date_cleaned = col_character(),
R>>   name = col_character(),
R>>   sides_and_sink = col_character(),
R>>   dishwasher_full = col_character(),
R>>   no_of_mugs = col_character(),
R>>   notes = col_character()
R>> )

date_cleaned	name	sides_and_sink	dishwasher_full	no_of_mugs	notes
16/7/2018	Charlotte	sides and sink	no	16	new cleaning schedule begins today!
17/7/2018	David E	NA	NA	2-Jun	NA
18/7/2018	Hannah	sides	yes	15(ish)	all fine
19 07 18	Dave	sink, side	Yes!	18	so many mugs
20/7/2018	Charlie	didn’t need doing	YES	12- 14 (I think!)	can’t remember very well
23/7/2018	Davida	NA	NA	0	All of us were at conference but I still had to put 0 in form!
24/7/2018	David A	both	No	Apr-14	NA
25 7 2018	Hanson	just the side	NO	3	not v messy
260718	Davie	sink only	yes	6	NA
260718	Davie	sink only	yes	6	NA

Removing duplicate data entries: We can also see the final row is duplicated, where “Davie” accidentally copied and pasted a row!

# The %>% function pipes (transfers along) the data to the next function
# The "distinct" function removes duplicated rows
kitchen_data <- kitchen_data %>%
  distinct()
# this is equivalent to: kitchen_data <- distinct(kitchen_data)

Re-classifying dates in different time formats: The ‘date_cleaned’ column was recorded in a free text box so is classed as a character string rather than a date and has been inputted in lots of different time formats, making date-based calculations impossible.

# The "mutate" function is used to make new columns
# The "parse_date_time" function converts dates written in different formats as strings into a 
#   UTC format. The "dmy" option tells R that the day comes first, month second, and year last.
kitchen_data <- kitchen_data %>%
      mutate(date_cleaned = parse_date_time(date_cleaned, "dmy"))

date_cleaned	name	sides_and_sink	dishwasher_full	no_of_mugs	notes
2018-07-16	Charlotte	sides and sink	no	16	new cleaning schedule begins today!
2018-07-17	David E	NA	NA	2-Jun	NA
2018-07-18	Hannah	sides	yes	15(ish)	all fine
2018-07-19	Dave	sink, side	Yes!	18	so many mugs
2018-07-20	Charlie	didn’t need doing	YES	12- 14 (I think!)	can’t remember very well
2018-07-23	Davida	NA	NA	0	All of us were at conference but I still had to put 0 in form!
2018-07-24	David A	both	No	Apr-14	NA
2018-07-25	Hanson	just the side	NO	3	not v messy
2018-07-26	Davie	sink only	yes	6	NA

Separating information that has been merged together in one column: The ‘sides_and_sink’ column is difficult to interpret because it contains information about whether both the sides and the sink were cleaned, which would be easier to analyze if it was in two separate columns.

# The "case_when" function allows to vectorize if-else
# The "str_detect" function allows chosen words to be detected within a string
# The "is.na" function identifies if a value is NA
kitchen_data <- kitchen_data %>%
      mutate(clean_sides = case_when(str_detect(sides_and_sink, c("side|both")) == TRUE ~ TRUE,
                                     is.na(sides_and_sink) == TRUE ~ NA,
                                     TRUE ~ FALSE),
             clean_sink = case_when(str_detect(sides_and_sink,   c("sink|both")) ~ TRUE,
                                    is.na(sides_and_sink) ~ NA,
                                    TRUE ~ FALSE))

date_cleaned	name	sides_and_sink	dishwasher_full	no_of_mugs	notes	clean_sides	clean_sink
2018-07-16	Charlotte	sides and sink	no	16	new cleaning schedule begins today!	TRUE	TRUE
2018-07-17	David E	NA	NA	2-Jun	NA	NA	NA
2018-07-18	Hannah	sides	yes	15(ish)	all fine	TRUE	FALSE
2018-07-19	Dave	sink, side	Yes!	18	so many mugs	TRUE	TRUE
2018-07-20	Charlie	didn’t need doing	YES	12- 14 (I think!)	can’t remember very well	FALSE	FALSE
2018-07-23	Davida	NA	NA	0	All of us were at conference but I still had to put 0 in form!	NA	NA
2018-07-24	David A	both	No	Apr-14	NA	TRUE	TRUE
2018-07-25	Hanson	just the side	NO	3	not v messy	TRUE	FALSE
2018-07-26	Davie	sink only	yes	6	NA	FALSE	TRUE

Changing alphabetic case and removing special characters: The ‘dishwasher_full’ column is difficult to analyze because the character cases are inconsistent and the column contains a special character (‘!’).

# The "toupper" function converts a string to upper case
# The "str_replace_all" function with the option "[[:punct:]]" removes special characters including "!"
kitchen_data <- kitchen_data %>%
      mutate(dishwasher_full = str_replace_all(toupper(dishwasher_full), "[[:punct:]]", ""))

date_cleaned	name	sides_and_sink	dishwasher_full	no_of_mugs	notes	clean_sides	clean_sink
2018-07-16	Charlotte	sides and sink	NO	16	new cleaning schedule begins today!	TRUE	TRUE
2018-07-17	David E	NA	NA	2-Jun	NA	NA	NA
2018-07-18	Hannah	sides	YES	15(ish)	all fine	TRUE	FALSE
2018-07-19	Dave	sink, side	YES	18	so many mugs	TRUE	TRUE
2018-07-20	Charlie	didn’t need doing	YES	12- 14 (I think!)	can’t remember very well	FALSE	FALSE
2018-07-23	Davida	NA	NA	0	All of us were at conference but I still had to put 0 in form!	NA	NA
2018-07-24	David A	both	NO	Apr-14	NA	TRUE	TRUE
2018-07-25	Hanson	just the side	NO	3	not v messy	TRUE	FALSE
2018-07-26	Davie	sink only	YES	6	NA	FALSE	TRUE

Re-coding erroneous answers and inconsistent data recording: In the ‘notes’ column, we can see that “Davida” records that she filled in the ‘no_of_mugs’ column with ‘0’ because the question was compulsory but the value should be ‘NA’ because everyone was at a company conference that day. We can also see that the ‘no_of_mugs’ column contains text in brackets, which we need to remove so we can analyse the numbers. Additionally, some of the numbers given are a range of numbers instead of an exact number. We ideally want the mean of the range of numbers to make analysis easier.

# The function "gsub" with the regular expression ""\\(.*","""
#   converts values to character strings and removes all of the string after the first bracket
# The function "as.Date" converts a character string to a date (so it can be recognised)
# The "separate" function with the "-" option splits a column into two separate columns
kitchen_data <- kitchen_data %>%
      mutate(no_of_mugs = case_when(date_cleaned == as.Date("2018-07-23") ~ NA_character_,
                                    TRUE ~ gsub("\\(.*","",no_of_mugs)))  %>%
      separate(no_of_mugs, c("lower", "upper"), "-") %>%
      mutate(no_of_mugs = case_when(!is.na(upper) ~ (as.numeric(lower) + as.numeric(upper))/2,
                                    TRUE ~ as.numeric(lower)))

date_cleaned	name	sides_and_sink	dishwasher_full	lower	upper	notes	clean_sides	clean_sink	no_of_mugs
2018-07-16	Charlotte	sides and sink	NO	16	NA	new cleaning schedule begins today!	TRUE	TRUE	16
2018-07-17	David E	NA	NA	2	Jun	NA	NA	NA	NA
2018-07-18	Hannah	sides	YES	15	NA	all fine	TRUE	FALSE	15
2018-07-19	Dave	sink, side	YES	18	NA	so many mugs	TRUE	TRUE	18
2018-07-20	Charlie	didn’t need doing	YES	12	14	can’t remember very well	FALSE	FALSE	13
2018-07-23	Davida	NA	NA	NA	NA	All of us were at conference but I still had to put 0 in form!	NA	NA	NA
2018-07-24	David A	both	NO	Apr	14	NA	TRUE	TRUE	NA
2018-07-25	Hanson	just the side	NO	3	NA	not v messy	TRUE	FALSE	3
2018-07-26	Davie	sink only	YES	6	NA	NA	FALSE	TRUE	6

Removing and re-ordering columns: The final step is to remove and re-order any columns that we generated or rearranged during the cleaning process or that we no longer need.

# The "select" function chooses specific columns and put them in a given order
kitchen_data <- kitchen_data %>%
      select(date_cleaned, name, clean_sides, clean_sink, dishwasher_full, no_of_mugs, notes)

date_cleaned	name	clean_sides	clean_sink	dishwasher_full	no_of_mugs	notes
2018-07-16	Charlotte	TRUE	TRUE	NO	16	new cleaning schedule begins today!
2018-07-17	David E	NA	NA	NA	NA	NA
2018-07-18	Hannah	TRUE	FALSE	YES	15	all fine
2018-07-19	Dave	TRUE	TRUE	YES	18	so many mugs
2018-07-20	Charlie	FALSE	FALSE	YES	13	can’t remember very well
2018-07-23	Davida	NA	NA	NA	NA	All of us were at conference but I still had to put 0 in form!
2018-07-24	David A	TRUE	TRUE	NO	NA	NA
2018-07-25	Hanson	TRUE	FALSE	NO	3	not v messy
2018-07-26	Davie	FALSE	TRUE	YES	6	NA

R session: Cleaning dirty data with data.table

First recall what we did using the tidyverse:

library(tidyverse)
library(lubridate)

read_csv("kitchen_data.csv", na = c("", "N/A", "NA"), trim_ws = TRUE) %>%
  distinct() %>%
  mutate(date_cleaned = parse_date_time(date_cleaned, "dmy")) %>%
  mutate(clean_sides = case_when(str_detect(sides_and_sink, c("side|both")) == TRUE ~ TRUE,
                                 is.na(sides_and_sink) == TRUE ~ NA,
                                 TRUE ~ FALSE),
         clean_sink = case_when(str_detect(sides_and_sink,   c("sink|both")) ~ TRUE,
                                is.na(sides_and_sink) ~ NA,
                                TRUE ~ FALSE)) %>%
  mutate(dishwasher_full = str_replace_all(toupper(dishwasher_full), "[[:punct:]]", "")) %>%
  mutate(no_of_mugs = case_when(date_cleaned == as.Date("2018-07-23") ~ NA_character_,
                                TRUE ~ gsub("\\(.*","",no_of_mugs)))  %>%
  separate(no_of_mugs, c("lower", "upper"), "-") %>%
  mutate(no_of_mugs = case_when(!is.na(upper) ~ (as.numeric(lower) + as.numeric(upper))/2,
                                TRUE ~ as.numeric(lower))) %>%
  select(date_cleaned, name, clean_sides, clean_sink, dishwasher_full, no_of_mugs, notes) %>%
  print()

R>> # A tibble: 9 x 7
R>>   date_cleaned        name     clean_sides clean_sink dishwasher_full no_of_mugs notes                                  
R>>   <dttm>              <chr>    <lgl>       <lgl>      <chr>                <dbl> <chr>                                  
R>> 1 2018-07-16 00:00:00 Charlot… TRUE        TRUE       NO                      16 new cleaning schedule begins today!    
R>> 2 2018-07-17 00:00:00 David E  NA          NA         <NA>                    NA <NA>                                   
R>> 3 2018-07-18 00:00:00 Hannah   TRUE        FALSE      YES                     15 all fine                               
R>> 4 2018-07-19 00:00:00 Dave     TRUE        TRUE       YES                     18 so many mugs                           
R>> 5 2018-07-20 00:00:00 Charlie  FALSE       FALSE      YES                     13 can't remember very well               
R>> 6 2018-07-23 00:00:00 Davida   NA          NA         <NA>                    NA All of us were at conference but I sti…
R>> 7 2018-07-24 00:00:00 David A  TRUE        TRUE       NO                      NA <NA>                                   
R>> 8 2018-07-25 00:00:00 Hanson   TRUE        FALSE      NO                       3 not v messy                            
R>> 9 2018-07-26 00:00:00 Davie    FALSE       TRUE       YES                      6 <NA>

We will now implement the very same operations usint data.table:

library(data.table)

dt <- fread("kitchen_data.csv", na.strings = c("", "N/A", "NA"), header = TRUE) 
dt <- unique(dt)
dt[, date_cleaned := parse_date_time(date_cleaned, "dmy")]
dt[, clean_sides := ifelse(str_detect(sides_and_sink, c("side|both")), TRUE, FALSE)]
dt[, clean_sink := ifelse(str_detect(sides_and_sink, c("sink|both")), TRUE, FALSE)]
dt[, dishwasher_full := str_replace_all(toupper(dishwasher_full), "[[:punct:]]", "")]
dt[, no_of_mugs := ifelse(date_cleaned == as.Date("2018-07-23"), NA, gsub("\\(.*","",no_of_mugs))]
dt[, c("lower", "upper") := tstrsplit(no_of_mugs, "-")]
dt[, no_of_mugs := ifelse(!is.na(upper), (as.numeric(lower) + as.numeric(upper))/2, as.numeric(lower))]
dt <- dt[, c("date_cleaned", "name", "clean_sides", "clean_sink", "dishwasher_full", "no_of_mugs", "notes")]
print(dt)

R>>    date_cleaned      name clean_sides clean_sink dishwasher_full no_of_mugs
R>> 1:   2018-07-16 Charlotte        TRUE       TRUE              NO         16
R>> 2:   2018-07-17   David E          NA         NA            <NA>         NA
R>> 3:   2018-07-18    Hannah        TRUE      FALSE             YES         15
R>> 4:   2018-07-19      Dave        TRUE       TRUE             YES         18
R>> 5:   2018-07-20   Charlie       FALSE      FALSE             YES         13
R>> 6:   2018-07-23    Davida          NA         NA            <NA>         NA
R>> 7:   2018-07-24   David A        TRUE       TRUE              NO         NA
R>> 8:   2018-07-25    Hanson        TRUE      FALSE              NO          3
R>> 9:   2018-07-26     Davie       FALSE       TRUE             YES          6
R>>                                                             notes
R>> 1:                            new cleaning schedule begins today!
R>> 2:                                                           <NA>
R>> 3:                                                       all fine
R>> 4:                                                   so many mugs
R>> 5:                                       can't remember very well
R>> 6: All of us were at conference but I still had to put 0 in form!
R>> 7:                                                           <NA>
R>> 8:                                                    not v messy
R>> 9:                                                           <NA>

You decide which philosopy looks less messy!

Tidy data

There is a concept called tidy data that dictates how to organize your data.
In R, the tidyverse has been developed with this concept in mind.
For more details see this chapter 2 of this book.
Getting your data into this format requires some upfront work, but that work pays off in the long term. Once you have tidy data and the tidy tools provided by packages in the tidyverse, you will spend much less time munging data from one representation to another, allowing you to spend more time on the analytic questions at hand.
There are three interrelated rules which make a dataset tidy:
1. Each variable must have its own column.
2. Each observation must have its own row.
3. Each value must have its own cell.

From wide to long data format

A common problem is a dataset where some of the column names are not names of variables, but values of a variable.
To tidy a dataset like this, we need to gather those columns into a new pair of variables.

👉 Gathering transforms data from wide to long format.

From long to wide data format

Spreading is the opposite of gathering. You use it when an observation is scattered across multiple rows.

👉 Spreading transforms data from long to wide format.

R session: Wide and long data formats

We are going to explore four functions from the package tidyr, which belongs to the tidyverse:

gather() (to be replaced by pivot_longer()): Gather takes multiple columns and collapses into key-value pairs, duplicating all other columns as needed.
spread() (to be replaced by pivot_wider()): Spread a key-value pair across multiple columns. It takes two columns (key & value) and spreads into multiple columns. It makes “long” data wider.
separate(): Turns a single character column into multiple columns.
unite(): Convenience function to paste multiple columns into one.

For more details see this chapter on tidy data. More functions are described here.

The following example is taken from financetrain.com.

Function gather(): It takes multiple columns as input which are selected and transforms them into key-value pairs. It makes “wide” data longer. Here’s an example of how you might use gather() on a sample dataset. In this experiment we’ve given sample data for people who applied for loan in a bank.

library(tidyverse)

personalDetails <- data.frame(
  name = c("Shankar", "Shashi"),
  age = c(30,45),
  job_status = c("employed","unemployed"),
  education = c("masters","tertiary")
)
personalDetails

R>>      name age job_status education
R>> 1 Shankar  30   employed   masters
R>> 2  Shashi  45 unemployed  tertiary

personalDetails %>%
  gather(c("job_status", "education"), key = "key", value = "value")

R>> Warning: attributes are not identical across measure variables;
R>> they will be dropped

R>>      name age        key      value
R>> 1 Shankar  30 job_status   employed
R>> 2  Shashi  45 job_status unemployed
R>> 3 Shankar  30  education    masters
R>> 4  Shashi  45  education   tertiary

Function gather() – Stock Data Example: Let us consider a sample dataframe of stocks. In the dataframe, we have 3 stocks A, B, and C with their date-wise stock closing prices.

stocks <- data.frame(
  Date = as.Date('2009-01-01') + 0:6,
  A = runif(7, 100, 145),
  B = runif(7, 350, 560),
  C = runif(7, 34, 89)
)
stocks

R>>         Date        A        B        C
R>> 1 2009-01-01 113.3674 514.9123 82.01581
R>> 2 2009-01-02 125.1541 443.9335 37.97838
R>> 3 2009-01-03 104.4765 476.1542 64.51968
R>> 4 2009-01-04 113.1934 530.6312 85.11715
R>> 5 2009-01-05 136.0120 354.9082 57.81039
R>> 6 2009-01-06 143.7152 477.2381 86.57311
R>> 7 2009-01-07 102.2434 378.0041 62.67538

In some situations, we may want to have one column called “stock” and another column called “prices” and then each row containing the observations. This can be done as follows:

stocks_long <- stocks %>%
  gather(-"Date", key = "stock", value = "price")
stocks_long

R>>          Date stock     price
R>> 1  2009-01-01     A 113.36737
R>> 2  2009-01-02     A 125.15412
R>> 3  2009-01-03     A 104.47649
R>> 4  2009-01-04     A 113.19337
R>> 5  2009-01-05     A 136.01199
R>> 6  2009-01-06     A 143.71522
R>> 7  2009-01-07     A 102.24340
R>> 8  2009-01-01     B 514.91229
R>> 9  2009-01-02     B 443.93352
R>> 10 2009-01-03     B 476.15420
R>> 11 2009-01-04     B 530.63124
R>> 12 2009-01-05     B 354.90823
R>> 13 2009-01-06     B 477.23814
R>> 14 2009-01-07     B 378.00411
R>> 15 2009-01-01     C  82.01581
R>> 16 2009-01-02     C  37.97838
R>> 17 2009-01-03     C  64.51968
R>> 18 2009-01-04     C  85.11715
R>> 19 2009-01-05     C  57.81039
R>> 20 2009-01-06     C  86.57311
R>> 21 2009-01-07     C  62.67538

Function spread(): It spreads a key-value pair across multiple columns. It takes two columns (key & value) and spreads into multiple columns. It makes “long” data wider. We can use our stock prices data to demonstrate the spread() function. Consider the long-form data that we got after applying the gather() function. We can bring it back to its original form using the spread() function.

stocks_long %>% 
  spread("stock", "price")

R>>         Date        A        B        C
R>> 1 2009-01-01 113.3674 514.9123 82.01581
R>> 2 2009-01-02 125.1541 443.9335 37.97838
R>> 3 2009-01-03 104.4765 476.1542 64.51968
R>> 4 2009-01-04 113.1934 530.6312 85.11715
R>> 5 2009-01-05 136.0120 354.9082 57.81039
R>> 6 2009-01-06 143.7152 477.2381 86.57311
R>> 7 2009-01-07 102.2434 378.0041 62.67538

Alternatively, we can also spread it with another column such as “Date”, which will create a column for each date.

stocks_long %>% 
  spread("Date", "price")

R>>   stock 2009-01-01 2009-01-02 2009-01-03 2009-01-04 2009-01-05 2009-01-06 2009-01-07
R>> 1     A  113.36737  125.15412  104.47649  113.19337  136.01199  143.71522  102.24340
R>> 2     B  514.91229  443.93352  476.15420  530.63124  354.90823  477.23814  378.00411
R>> 3     C   82.01581   37.97838   64.51968   85.11715   57.81039   86.57311   62.67538

Function separate(): It turns a single character column into multiple columns.

In some cases a single column contains multiple variables in the data separated by a special character, if we want to split them into two columns we use separate() function.

df <- data.frame(Names = c("koti.minnun", "AB.Diwillars", "ST.Joesph"))
df

R>>          Names
R>> 1  koti.minnun
R>> 2 AB.Diwillars
R>> 3    ST.Joesph

df_sep <- df %>% 
  separate("Names", c("firstName", "lastName"))
df_sep

R>>   firstName  lastName
R>> 1      koti    minnun
R>> 2        AB Diwillars
R>> 3        ST    Joesph

Function unite(): It combines multiple columns into single column. This function is the opposite of separate(), which merges multiple columns into single column.

Consider our earlier example:

df_sep %>%
  unite("Name", c("firstName", "lastName"), sep = "_")

R>>           Name
R>> 1  koti_minnun
R>> 2 AB_Diwillars
R>> 3    ST_Joesph

Outlier Detection

Outliers

What is an outlier? The set of data points that are considerably different than rest of data…
When do they appear in data mining tasks?
- Given a data matrix $\mathbf{X}$, find all the cases $\mathbf{x}_i \in \mathbf{X}$ with anomaly/outlier scores greater than some threshold $t$ or the top $n$ outlier scores.
- Given a data matrix $\mathbf{X}$, containing mostly normal (but unlabeled) data points, and a test case $\mathbf{x}_{new}$, compute an anomaly/outlier score of $\mathbf{x}_{new}$ with respect to $\mathbf{X}$.
Applications
- Credit card fraud detection: purchasing behavior of a credit card owner usually changes when the card is stolen;
- Medicine: unusual symptoms or test results may indicate potential health problems of a patient;
- Public health: the occurrence of a particular disease, e.g., tetanus, scattered across various hospitals of a city indicate problems with the corresponding vaccination program in that city;
- Sports statistics;
- Detecting measurement errors;
- Network intrusion detection.

What is an outlier?

General application scenarios

Supervised scenario
- In some applications, training data with normal and abnormal data objects are provided.
- There may be multiple normal and/or abnormal classes.
- Often, the classification problem is highly imbalanced.

Semi-supervised Scenario
- In some applications, only training data for the normal class(es) (or only the abnormal class(es)) are provided.

Unsupervised Scenario
- In most applications there are no training data available.

R session: Regression with outliers

Consider a simple linear regression:

set.seed(42)
x <- runif(40)
y <- 1 + 3*x + 0.1*rnorm(20)
res <- lm(y ~ x)
res

R>> 
R>> Call:
R>> lm(formula = y ~ x)
R>> 
R>> Coefficients:
R>> (Intercept)            x  
R>>      0.9818       2.9849

# plot
plot(x, y, main = "Regression without outliers")
abline(res, col = "red")

Now let’s see what happens with a couple of outliers:

# add two outliers
y[35] <- y[35] + 3
y[23] <- y[23] - 4
res <- lm(y ~ x)
res

R>> 
R>> Call:
R>> lm(formula = y ~ x)
R>> 
R>> Coefficients:
R>> (Intercept)            x  
R>>       1.506        2.054

We can see that the intercept is estimated as 1.5 instead of 1 and the slope is estimated as 2 instead of 3!

The plot shows how detrimental the effect is:

# plot
plot(x, y, main = "Regression with outliers")
abline(res, col = "red")

Outlier correction: Trimming and Winsorization

Extreme values or outliers sometimes have a big effect on statistical operations, which is generally detrimental
Trimming and Winsorization are two simple techniques to deal with outliers by altering the data.
Trimming consists of simply throwing away the extreme values. It removes a certain fraction of the data from each tail.
Winsorization refines the trimming by instead replacing the extreme values.
Extreme values:
- one way to choose the extreme values is simply by taking some fraction of both tails of the distribution. In practice, one can compute the histogram and find the tails from there.
- anothe way is to touch the points that are likely to be troublesome, i.e., only replace data that are too far from the rest. In practice, this can be done by focusing on the points that are outside the center of the distribution plus/minus some measure of dispersion. Using the mean and the standard deviation as measures of location and dispersion is not recommended since their computation will be affected by the same outliers! Instead one can use the median and the median absolute deviation (MAD): ${\sf mad}(\{x_i\}) = {\sf median}(\{|x_i - x_{\sf med}|\})$.

Winsorization

Example of the Winsorization effect on some normally distributed data:

R session: Trimming and Winsorization

Let’s first create simple data with one large outlier:

# create data with one large outlier
data_with_oulier <- c(1:10, 300)

# mean of clean data
mean(data_with_oulier[1:10])

R>> [1] 5.5

# mean of data with outlier
mean(data_with_oulier)

R>> [1] 32.27273

Trimming:

The mean() function in R has a trim argument so that you can easily get trimmed means. To be exaxt, trim denotes the fraction (0 to 0.5) of observations to be trimmed from each end of the data before the mean is computed.

# trimming 5% of the data is not enough to get rid of the outlier
mean(data_with_oulier, trim = .05)

R>> [1] 32.27273

# trimming 10% does the job
mean(data_with_oulier, trim = .10)

R>> [1] 6

# trimming removes both extremes, so it's like doing:
mean(data_with_oulier[2:10])

R>> [1] 6

Winsorization:

We will now define a function to Winsorize:

winsorize <- function (x, fraction = .05) {
  lim <- quantile(x, probs = c(fraction, 1-fraction))
  x[x < lim[1]] <- lim[1]
  x[x > lim[2]] <- lim[2]
  return(x)
}

winsorize(data_with_oulier, fraction = .10)

R>>  [1]  2  2  3  4  5  6  7  8  9 10 10

mean(winsorize(data_with_oulier, fraction = .10))

R>> [1] 6

Statistical approaches

General idea:
- Given a certain kind of statistical distribution (e.g., Gaussian)
- Compute the parameters assuming all data points have been generated by such a statistical distribution (e.g., mean and standard deviation)
- Outliers are points that have a low probability to be generated by the overall distribution (e.g., deviate more than 3 times the standard deviation from the mean)
- See later discussion of Hadlum vs. Hadlum
Basic assumption:
- Normal data objects follow a (known) distribution and occur in a high probability region of this model
- Outliers deviate strongly from this distribution
A huge number of different tests are available differing in
- Type of data distribution (e.g. Gaussian)
- Number of variables, i.e., dimensions of the data objects (univariate/multivariate)
- Number of distributions (mixture models)
- Parametric versus non-parametric (e.g. histogram-based)

Statistical approach: Grubbs’ test

Suppose we have a sample of $n$ numbers $\mathbf{X} = \{x_1, \ldots, x_n\}$, i.e., an $n \times 1$ data matrix.
Assuming data is from normal distribution, Grubbs’ tests uses distribution of \[\max_{1\le i\le n} \frac{x_i - \bar{x}}{\sigma_x}\] to search for outlying large values.
Lower tail variant: \[\min_{1\le i\le n} \frac{x_i - \bar{x}}{\sigma_x}\]
Two-sided variant: \[\max_{1\le i\le n} \frac{|x_i - \bar{x}|}{\sigma_x}\]

Statistical approach: Grubbs’ test

Having chosen a test-statistic, we must determine a threshold that sets our “threshold” rule.
Often this is set via a hypothesis test to control Type I error.
For large positive outlier, threshold is based on choosing some acceptable Type I error $\alpha$ and finding $c_{\alpha}$ so that \[P_0\left(\max_{1\le i\le n} \frac{x_i - \bar{x}}{\sigma_x} \ge c_{\alpha}\right) \approx \alpha\] where $P_0$ denotes the distribution of $\mathbf{X}$ under the assumption there are no outliers.
For Gaussian data, the two-sided critical level has the form \[c_{\alpha} = \frac{n-1}{\sqrt{n}}\sqrt{\frac{t^2_{\alpha/(2n),n-2}}{n-2+t^2_{\alpha/(2n),n-2}}}\] where $P(T_k \ge t_{\gamma,k}) = \gamma$ is the upper tail quantile of $T_k$.
In R, you can use the functions pnorm, qnorm, pt, qt for these quantities.

Example: Hadlum vs. Hadlum (1949)

This case was analyzed in [Barnett 1978].
The birth of a child to Mrs. Hadlum happened 349 days after Mr. Hadlum left for military service.
Average human gestation period is 280 days (40 weeks).
Statistically, 349 days is an outlier.
Very low probability for the birth of Mrs. Hadlums child for being generated by this nominal process. 🤔

R session: Outlier detection in pregnancy duration

For this study on detecting outliers in R, we will follow the example in RPubs. We will be using the csv file “PregnancyDuration.csv” available for download at www.wiley.com/go/datasmart.

pregnancy.df <- read.csv("PregnancyDuration.csv")
str(pregnancy.df)

R>> 'data.frame':   1000 obs. of  1 variable:
R>>  $ GestationDays: int  349 278 266 265 269 263 278 257 268 260 ...

head(pregnancy.df)

R>>   GestationDays
R>> 1           349
R>> 2           278
R>> 3           266
R>> 4           265
R>> 5           269
R>> 6           263

summary(pregnancy.df)

R>>  GestationDays  
R>>  Min.   :240.0  
R>>  1st Qu.:260.0  
R>>  Median :267.0  
R>>  Mean   :266.6  
R>>  3rd Qu.:272.0  
R>>  Max.   :349.0

From the output, we see the minimum gestation days is 240 days or 8 months, while the maximum gestation days is 349 days or about 11 months. Therefore, the range of days between gestation is 109 days or about 3 months. This may suggest lots of variability in gestation days. Plus, the 25th percentile of data is 260 days or about 8,66 months and the 75th percentile is 272 or about 9,06 months. Finally, the median gestation days is 267 and the “typical” average gestation days is about 267 days or exactly 9 months. Let’s check the histogram of gestation days.

Let’s plot the histogram to visualize the data better:

library(ggplot2)

# Histogram of gestation days
qplot(pregnancy.df$GestationDays, geom = "histogram", fill = I("cyan"), col = I("black"),
      main = "Histogram of Gestation Days", xlab = "Gestation Days")

The output has a peaked bar with gestation days at about 272 days. Most of the values are bunched up on the left side of the histogram, with few values spreading along the right tail. This tells us that the data points are not normally distributed and it has an asymmetric distribution (note that the mean and the median occur at different points). In addition, the plot reveals the presence of an outlier in the lower right bound of the plot. This outlier is completely apart from other observations.

Using the histogram is a good way to assess the shape and spread of the data and to identify any potential outliers. Similarly, the boxplot is also vital to evaluate the spread of the data. Let’s go back to the summary and get a good understanding of skewness.

Interquartile range: In the summary results, the interquartile range is equal to 272 minus 260, IQR = 3rd qu - 1st qu. Alternatively, you can call the built-in IQR() function on the GestationDays column to calculate the IQR:

summary(pregnancy.df)

R>>  GestationDays  
R>>  Min.   :240.0  
R>>  1st Qu.:260.0  
R>>  Median :267.0  
R>>  Mean   :266.6  
R>>  3rd Qu.:272.0  
R>>  Max.   :349.0

pregnancy_IQR <- IQR(pregnancy.df$GestationDays)
pregnancy_IQR

R>> [1] 12

# we can also find the quartiles directly
quantile(pregnancy.df$GestationDays, c(0.25, 0.75))

R>> 25% 75% 
R>> 260 272

We can then compute the lower and upper Tukey fences, thanks to John Wilder Tukey. John Wilder Tukey was an American mathematician best known for the development of the FFT algorithm and boxplot. The Tukey range test, the Tukey lambda distribution, the Tukey test of additivity, and the Teichmuller-Tukey lemma all bear his name.

lower_inner_fence <- 260 - 1.5 * pregnancy_IQR
lower_inner_fence

R>> [1] 242

upper_inner_fence <- 272 + 1.5 * pregnancy_IQR
upper_inner_fence

R>> [1] 290

Let’s see the points that violate the Tukey fences:

# values above the upper Tukey fence
pregnancy.df$GestationDays[pregnancy.df$GestationDays > upper_inner_fence]

R>> [1] 349 292 295 291 297 303 293 296

# values below the lower Tukey fence
pregnancy.df$GestationDays[pregnancy.df$GestationDays < lower_inner_fence]

R>> [1] 241 240

Two points are below the lower Tukey fence and eight values are above the upper Tukey fence.

Boxplot:

We are now ready to plot a boxplot using the function boxplot(). The boxplot() function will graph the median, first quartile and third quartile, Tukey fences, and any outliers.

boxplot(pregnancy.df$GestationDays)

The tukey fences can be modified to be “outlier” fences by changing the range flag in the boxplot call(it defaults to 1.5 times the IQR). If you set range = 3, then the tukey fences are drawn at the last point inside three times the IQR instead.

boxplot(pregnancy.df$GestationDays, range = 3)

We can improve the boxplot by using ggplot2. You can draw a boxplot and show the colored outliers.

# boxplot using ggplot2 with colored outliers
ggplot(data = pregnancy.df, mapping = aes(x = "", y = GestationDays)) +
  geom_boxplot(outlier.color = "red", outlier.shape = 1) +
  ggtitle("Boxplot of The Pregnancy Duration Data", subtitle = "With red outliers in Tukey fences using 1.5 times the IQR") +
  xlab("") +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(plot.subtitle = element_text(hjust = 0.5, face = "italic"))

In the output result, the tiny red cirles are outliers. You have 8 outliers above the box and 2 outliers below the box. Let’s expand our plot by adding the colored mean to the boxplot.

Statistical approach in multivariate case: Mahalanobis distance

For the univariate Gaussian case, Grubbs’ statistic was defined as \[\max_{1\le i\le n} \frac{|x_i - \bar{x}|}{\sigma_x}\]
What’s the equivalent in the multivariate case?
Consider a multivariate Gaussian distribution (single parametric model) with pdf: \[ f(\mathbf{x}) = \frac{1}{\sqrt{(2\pi)^d|\boldsymbol{\Sigma}|}}e^{-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^T\boldsymbol{\Sigma}^{-1}(\mathbf{x} - \boldsymbol{\mu})} \] where $\boldsymbol{\mu}$ is the mean vector and $\boldsymbol{\Sigma}$ is the covariance matrix.
The term $(\mathbf{x} - \boldsymbol{\mu})^T\boldsymbol{\Sigma}^{-1}(\mathbf{x} - \boldsymbol{\mu})$ is called Mahalanobis distance and it can be used to detect outliers.
The Mahalanobis distance follows a $\chi^2$-distribution with $d$ degrees of freedom, so we can declare an outlier if it satisfies $(\mathbf{x} - \boldsymbol{\mu})^T\boldsymbol{\Sigma}^{-1}(\mathbf{x} - \boldsymbol{\mu}) > \chi^2(0.975) \approx 3\sigma$.

Statistical approach in multivariate case

Visualization in 2D:

[Tan et. al 2006]

Robustness issue:

Mean and standard deviation are very sensitive to outliers
These values are computed for the complete dataset (including potential outliers)
The Mahalanobis distance is used to determine outliers although the computation itself is influenced by these outliers.

Likelihood approach

Assume data is a mixture $F = (1−\lambda)M +\lambda A$ where $M$ is the distribution of “most of the data” and $A$ is an “outlier” distribution, could be uniform on a bounding box for the data.
Any points assigned to $A$ are “outliers.”
Design an algorithm (we can estimate $\lambda$ or fix it) that tries to maximize the equivalent classification likelihood of the samples $\mathbf{x}_i, \forall i\in I$: \[ L(\theta_M,\theta_A;I) = \log\left( (1-\lambda)^{|I_M|}\prod_{i\in I_M}f_M(\mathbf{x}_i; \theta_M) \times \lambda^{|I_A|}\prod_{i\in I_A}f_A(\mathbf{x}_i; \theta_A)\right) \]
Algorithm: tries to maximize this likelihood by forming iterative estimates $(M_t,A_t)$ of “normal” and “outlying” data points.
1. At each stage, tries to place individual points of $M_t$ to $A_t$.
2. Find $(\theta_M,\theta_A)$ based on new partition (if necessary).
3. If increase in likelihood is large enough, call these new set $(M_{t+1}, A_{t+1})$.
4. Repeat until no further changes.

Depth-based approaches

General idea
- Search for outliers at the border of the data space but independent of statistical distributions
- Organize data objects in convex hull layers
- Outliers are objects on outer layers

Basic assumption
- Outliers are located at the border of the data space
- Normal objects are in the center of the data space

[Johnson et al. 1998]

Depth-based approaches

Approach [Tukey 1977]:
- Points on the convex hull of the full data space have depth = 1
- Points on the convex hull of the data set after removing all points with depth = 1 have depth = 2
- …
- Points having a depth ≤ k are reported as outliers

[Preparata and Shamos 1988]

Convex hull computation is usually only efficient in 2D/3D spaces.
Sample algorithms:
- ISODEPTH [Ruts and Rousseeuw 1996]
- FDC [Johnson et al. 1998]

Deviation-based approaches

General idea
- Outliers are points that do not fit to the general characteristics of the dataset, i.e., the variance of the set is minimized when removing the outliers.
Basic assumption
- Outliers are the outermost points of the dataset
Approach [Arning et al. 1996]:
- Given a smoothing factor $SF({\cal I})$ that computes for each ${\cal I} \subseteq {\cal S}$ how much the variance of $S$ is decreased when ${\cal I}$ is removed from it
- The outliers are the elements of the exception set ${\cal E} \subseteq {\cal S}$ for which the following holds: \[SF({\cal E}) ≥ SF({\cal I}) \text{ for all } {\cal I} \subseteq {\cal S}.\]

Distance-based approaches

General Idea
- Judge a point based on the distance(s) to its neighbors
- Several variants proposed
Basic Assumption
- Normal data objects have a dense neighborhood
- Outliers are far apart from their neighbors, i.e., have a less dense neighborhood
Methods:
- $DB(\epsilon,\pi)$-Outliers
- Outlier scoring based on $k$NN distances
- …

Distance-based approach: $DB(\epsilon,\pi)$-Outliers

Approach [Knorr and Ng 1997]:
- Given a radius $\epsilon$ and a percentage $\pi$
- A point $p$ is considered an outlier if at most $\pi$ percent of all other points have a distance to $p$ less than $\epsilon$.

\[OutlierSet(\epsilon,\pi) = \left\{p\mid \frac{Card(\{q\in{\cal S \mid dist(p,q)<\epsilon}\})}{Card({\cal S})}\right\} \le \pi\]

Density-based approaches

General idea
- Compare the density around a point with the density around its local neighbors
- The relative density of a point compared to its neighbors is computed as an outlier score
- Approaches essentially differ in how to estimate density
Basic assumption
- The density around a normal data object is similar to the density around its neighbors
- The density around an outlier is considerably different to the density around its neighbors

Density-based approaches

Local Outlier Factor (LOF) [Breunig et al. 1999], [Breunig et al. 2000]:

Motivation:
- Distance-based outlier detection models have problems with different densities
- How to compare the neighborhood of points from areas of different densities?
- Example
  - $DB(\epsilon,\pi)$-outlier model: Parameters $\epsilon$ and $\pi$ cannot be chosen so that $o_2$ is an outlier but none of the points in cluster $C_1$ (e.g. $q$) is an outlier
  - Outliers based on $k$NN-distance: $k$NN-distances of objects in $C_1$ (e.g. $q$) are larger than the $k$NN-distance of $o_2$
Solution: consider relative density

High-dimensional approaches

Why is that problem important?
- Some (ten) years ago:
  - Data recording was expensive
  - Variables (attributes) where carefully evaluated if they are relevant for the analysis task
  - Data sets usually contain only a few number of relevant dimensions
- Nowadays:
  - Data recording is easy and cheap
  - “Everyone measures everything”, attributes are not evaluated just measured
  - Data sets usually contain a large number of features (e.g., in molecular biology gene expression data has >1,000 of genes per patient)

High-dimensional approaches

Challenges: Curse of dimensionality
- Relative contrast between distances decreases with increasing dimensionality
- Data are very sparse, almost all points are outliers
- Concept of neighborhood becomes meaningless
Solutions
- Use more robust distance functions and find full-dimensional outliers
- Find outliers in projections (subspaces) of the original feature space
Methods:
- ABOD – angle-based outlier degree [Kriegel et al. 2008]
- Grid-based subspace outlier detection [Aggarwal and Yu 2000]
- SOD – subspace outlier degree [Kriegel et al. 2009]
- …

High-dimensional approaches: Angle-based

Rationale:
- Angles are more stable than distances in high dimensional spaces (e.g., the popularity of cosine-based similarity measures for text data)
- Object $o$ is an outlier if most other objects are located in similar directions
- Object $o$ is no outlier if many other objects are located in varying directions

High-dimensional approaches: Angle-based

Basic assumption
- Outliers are at the border of the data distribution
- Normal points are in the center of the data distribution
Model
- Consider for a given point $p$ the angle between $\vec{px}$ and $\vec{py}$ for any two other points $x,y$ in the database
- Consider the spectrum of all these angles
- The broadness of this spectrum is a score for the outlierness of a point

R session: Multidimensional outlier detection

We will explore the detection of outliers in a multivariate setting based on Grubbs’ test using Mahalanobis distance.

Let’s generate multivariate data and add some outliers:

library(mvtnorm)  # for multivariate Gaussian and Student t generation
library(ggplot2)

# create a mean vector and covariance matrix
p <- 2   # dimension: we use 2 so that we can plot and visualize
n <- 30  # number of points
set.seed(42)
mu <- rep(0, p)
U <- t(rmvnorm(n = ceiling(0.1*p), sigma = 0.1*diag(p)))
Sigma <- U %*% t(U) + 0.1*diag(p)

# generate n x p Gaussian data matrix and outliers
X <- rmvnorm(n = n, mean = mu, sigma = Sigma)
# add four outliers (in the first four data points)
X[1, 2] <- 8
X[2, 2] <- -9
X[3, 2] <- 7
X[4, 2] <- -7
X_clean    <- X[-c(1:4), ]
X_outliers <- X[ c(1:4), ]

We can now plot the generated points together with the ellipsoid of the clean data (black) and the ellipsoid including the outliers (red):

# 2D plots from the first two components of data matrix
ggplot(data = as.data.frame(X_clean)) +
  geom_point(aes(V1, V2)) +
  stat_ellipse(aes(V1, V2)) +
  geom_point(data = as.data.frame(X_outliers), aes(V1, V2), col = "red") +
  stat_ellipse(data = as.data.frame(X), aes(V1, V2), col = "red") +
  ggtitle("Gaussian data with four outliers") + xlab(NULL) + ylab(NULL)

Indeed, the estimated covariance matrix with outliers (red ellipsoid) is severely affected by the outliers.

Let’s try to detect them based on the Mahalanobis distance:

Sigma <- cov(X)
inv_Sigma <- solve(Sigma)
Mahalanobis_distance <- colSums(t(X) * (inv_Sigma %*% t(X)))
qplot(1:n, Mahalanobis_distance, main = "Mahalanobis distances of datapoints", xlab = "", ylab = "")

We can see that the four outliers are clearly identified, but other points come close too. To be more accurate, after we remove the outliers in the estimation of the covariance matrix, we should re-compute again the Mahalanobis distances:

# re-compute Sigma without the identified outliers
Sigma <- cov(X[-c(1:4), ])
inv_Sigma <- solve(Sigma)
Mahalanobis_distance <- colSums(t(X) * (inv_Sigma %*% t(X)))
qplot(1:n, Mahalanobis_distance, main = "Mahalanobis distances of datapoints", xlab = "", ylab = "")

Now the outliers are clearly identified. In practice, one can remove the outliers one by one.

R session: Estimation of mean vector and covariance matrix with outliers

In this session, we will illustrate the detrimental effects of outliers in the estimation of the mean vector $\boldsymbol{\mu}$ and covariance matrix $\boldsymbol{\Sigma}$.

The naive benchmark estimators are the sample mean and sample covariance matrix: \[\hat{\boldsymbol{\mu}} = \frac{1}{T}\sum_{t=1}^T \mathbf{x}_t\] and \[\hat{\boldsymbol{\Sigma}} = \frac{1}{T}\sum_{t=1}^T (\mathbf{x}_t - \hat{\boldsymbol{\mu}})(\mathbf{x}_t - \hat{\boldsymbol{\mu}})^T.\]

For the robust estimators, we will use the R package covHeavyTail that assumes the data follows a heavy-tailed distribution (which is a way to model naturally outliers).

library(covHeavyTail)  # for robust estimation
library(mvtnorm)  # for multivariate Gaussian and Student t generation
library(ggplot2)
library(reshape2)  # for melt()

# create a mean vector and covariance matrix
N <- 50  # dimension
set.seed(42)
mu <- runif(N)
U <- t(rmvnorm(n = ceiling(0.1*N), sigma = 0.1*diag(N)))
Sigma <- U %*% t(U) + 0.1*diag(N)
nu <- 4  # for heavy-tailed distribution
Sigma_scale <- (nu-2)/nu*Sigma   # Sigma = nu/(nu-2)*Sigma_scale

# loop over different values of the number of observations T
T_sweep <- round(seq(from = N+1, to = 5*N, length.out = 50))
MSE_mu_robust_T <- MSE_Sigma_robust_T <- MSE_sm_T <- MSE_scm_T <- NULL
for(T in T_sweep) {
  # generate data matrix
  #X <- rmvnorm(n = T, mean = mu, sigma = Sigma)  # Gaussian data
  X <- rmvt(n = T, delta = mu, sigma = Sigma_scale, df = nu)  # heavy-tailed data

  # naive estimation (sample estimates)
  mu_sm <- colMeans(X)
  Sigma_scm <- cov(X)

  # robust estimation against outliers and heavy tails
  moments_Studentt <- momentsStudentt(X)
  mu_robust <- moments_Studentt$mu
  Sigma_robust <- moments_Studentt$cov
  
  # compute MSEs
  MSE_sm_T <- c(MSE_sm_T,  sum((mu_sm - mu)^2))
  MSE_scm_T <- c(MSE_scm_T, norm(Sigma_scm - Sigma, "F")^2)
  MSE_mu_robust_T <- c(MSE_mu_robust_T,  sum((mu_robust - mu)^2))
  MSE_Sigma_robust_T <- c(MSE_Sigma_robust_T, norm(Sigma_robust - Sigma, "F")^2)
}

# MSE plots
MSE_Sigma_T <- cbind("SCM"     = MSE_scm_T, 
                     "robust"  = MSE_Sigma_robust_T)
MSE_mu_T <- cbind("sample mean" = MSE_sm_T, 
                  "robust"      = MSE_mu_robust_T)
rownames(MSE_mu_T) <- rownames(MSE_Sigma_T) <- T_sweep

ggplot(melt(MSE_mu_T), aes(x = Var1, y = value, col = Var2, shape = Var2)) + 
  geom_line() + geom_point() +
  theme(legend.title = element_blank()) +
  ggtitle(sprintf("MSE of the estimation of mu (N = %d)", N)) + xlab("T") + ylab("MSE")

ggplot(melt(MSE_Sigma_T), aes(x = Var1, y = value, col = Var2, shape = Var2)) + 
  geom_line() + geom_point() +
  theme(legend.title = element_blank()) +
  ggtitle(sprintf("MSE of the estimation of Sigma (N = %d)", N)) + xlab("T") + ylab("MSE")

R session: Cleaning of outliers in time series

Let’s start by loading the prices of the S&P 500:

library(xts)
library(quantmod)

# load data
prices_SP500 <- Ad(getSymbols("^HSI", from = "2017-01-01", to = "2017-08-01", auto.assign = FALSE))
y <- log(prices_SP500)

# introduce some artificial outliers
y["2017-04-03"] <- 1.05*y["2017-04-03"]
y["2017-07-03"] <- 0.97*y["2017-07-03"]
plot(y, main = "Log-prices of SP500")

Let’s try to clean the outliers with the R package tsoutliers:

library(tsoutliers)

R>> Registered S3 methods overwritten by 'forecast':
R>>   method             from    
R>>   fitted.fracdiff    fracdiff
R>>   residuals.fracdiff fracdiff

library(ggplot2)
library(broom)

# clean outliers
tso_output <- tsoutliers::tso(as.ts(y), types = "AO", maxit = 1,
                              tsmethod = "arima", args.tsmethod = list(order=c(0, 1, 0)) )
print(tso_output)

R>> 
R>> Call:
R>> list(method = NULL)
R>> 
R>> Coefficients:
R>>         AO63    AO122
R>>       0.5051  -0.2966
R>> s.e.  0.0046   0.0046
R>> 
R>> sigma^2 estimated as 4.239e-05:  log likelihood = 513.38,  aic = -1020.76
R>> 
R>> Outliers:
R>>   type ind time coefhat  tstat
R>> 1   AO  63   63  0.5051 109.69
R>> 2   AO 122  122 -0.2966 -64.42

# plot
y_tso <- xts(tso_output$yadj, index(y))
outliers <- y[tso_output$outliers$ind]
ggplot() +
  geom_line(data = tidy(y), aes(x = index, y = value), col = "grey") +
  geom_line(data = tidy(y_tso), aes(x = index, y = value), col = "black", size = 0.6) +
  geom_point(data = tidy(outliers), aes(x = index, y = value), col = "red") +
  ggtitle("Cleaning of outliers for SP500") + xlab(NULL) + ylab("log-price") +
  scale_x_date(date_labels = "%b %Y", date_breaks = "1 month") +
  theme_bw()

It works very well.

We can also try the function Return.clean() from the package library(PerformanceAnalytics) (although the results are not satisfactory):

library(PerformanceAnalytics)

# we need to compute the returns first (but any outlier or NA will duplicate!)
logreturns <- Return.calculate(y)
plot(logreturns, main = "Log-returns of SP500")

# clean outliers
logreturns_boudt <- as.xts(Return.clean(logreturns, method = "boudt", alpha = 0.01))
logreturns_geltner <- as.xts(Return.clean(logreturns, method = "geltner", alpha = 0.01))  #geltner works better and leaves the NAs

# plots
outliers <- logreturns[c(tso_output$outliers$ind, 1 + tso_output$outliers$ind)]
ggplot() +
  geom_line(data = tidy(logreturns), aes(x = index, y = value), col = "grey") +
  geom_line(data = tidy(logreturns_boudt), aes(x = index, y = value), col = "black", size = 0.6) +
  geom_point(data = tidy(outliers), aes(x = index, y = value), col = "red") +
  ggtitle("Cleaning of outliers for SP500 (with Boudt method)") + xlab(NULL) + ylab("log-return") +
  scale_x_date(date_labels = "%b %Y", date_breaks = "1 month") +
  theme_bw()

ggplot() +
  geom_line(data = tidy(logreturns), aes(x = index, y = value), col = "grey") +
  geom_line(data = tidy(logreturns_geltner), aes(x = index, y = value), col = "black", size = 0.6) +
  geom_point(data = tidy(outliers), aes(x = index, y = value), col = "red") +
  ggtitle("Cleaning of outliers for SP500 (with Gertner method)") + xlab(NULL) + ylab("log-return") +
  scale_x_date(date_labels = "%b %Y", date_breaks = "1 month") +
  theme_bw()

Imputation of Missing Values

Data with missing values

In theory, data is typically assumed complete and algorithms are designed for complete data.
In practice, however, often data has missing values, due to a variety of reasons.
Then the algorithms designed for complete data can be disastrous!
Missing values typically happen during the data observation or recording process (Little and Rubin 2002):
- values may not be measured,
- values may be measured but get lost, or
- values may be measured but are considered unusable.

Data with missing values

Some real-world cases where missing values occur:
- some stocks may suffer a lack of liquidity resulting in no transaction and hence no price recorded
- observation devices like sensors may break down during the measurement
- weather or other conditions may disturb sample taking schemes
- in industrial experiments some results may be missing because of mechanical breakdowns unrelated to the experimental process
- in an opinion survey some individuals may be unable to express a preference for one candidate over another
- respondents in a survey may not answer every question
- countries may not collect statistics every year
- archives may simply be incomplete
- subjects may drop out of panels.

What is imputation?

How can we cope with data with missing values?
One option is to design processing algorithms that can accept missing values, but has to be done in a case by case basis and is expensive.
Another option is imputation: filling in those missing values based on some properties of the data. After that, processing algorithms for complete data can be safely used.
However, magic cannot be done to impute missing values. One has to rely on some structural properties like some temporal structure.
There are many imputation techniques, many heuristic (can do more harm than good) and some with a sound statistical foundation.
Many works assume a Gaussian distribution, which doesn’t hold in many applications.

👉 We will focus on statistically sound methods for time series imputation under heavy-tailed distributions.

Netflix problem: Low-rank matrix completion

In big data system analytics, it is often the case that the high-dimensional data matrix lies in a low-dimensional subspace.
A popular example is the Netflix problem where the data matrix contains movie ratings by users and is extremely sparse: \[\begin{array}{cc} & {\rm Movies}\\ \mathbf{X} = & \left[\begin{array}{cccccc} 2 & 3 & {\color{red}?} & {\color{red}?} & 5 & {\color{red}?}\\ 1 & {\color{red}?} & {\color{red}?} & 4 & {\color{red}?} & 3\\ {\color{red}?} & {\color{red}?} & 3 & 2 & {\color{red}?} & 5\\ 4 & {\color{red}?} & 3 & {\color{red}?} & 2 & 4 \end{array}\right]\,{\rm Users} \end{array}\]
In 2009, the Netflix prize of US$1M was awarded to a method based, among other techniques, on the low-rank property: \[\mathbf{X} = \mathbf{A}\mathbf{B}^T\] where both $\mathbf{A}$ and $\mathbf{B}$ are extremely thin matrices.
This low-rank property can be used to impute missing values.

Inpainting in image processing

In image processing, a popular problem is that of images with missing blocks of pixels:

In this case, one can use the natural structure of images, e.g., small gradient or a dictionary of small structures commonly appearing.
Total variation is a common technique that imputes the missing pixels by ensuring a small $\ell_1$-norm of the gradient.
Learning an overcomplete dictionary allows for imputing blocks of pixels based on the dictionary.

Time series with structure

In some applications, one of the dimensions of the data matrix is time.
The time dimension sometimes has some specific structure on the distribution of the missing values, like the monotone missing pattern.
The time dimension can also follow some structural model that can be effectively used to fill in the missing values.
One simple example of time structure is the random walk, which is pervasive in financial applications (e.g., log-returns of stocks): \[y_t = \phi_0 + y_{t-1} + \varepsilon_t.\]
Another example of time structure is the AR($p$) model (e.g., traded log-volume of stocks): \[y_t = \phi_0 + \sum_{i=1}^p \phi_i y_{t-i} + \varepsilon_t.\]

Simple imputation methods

Some simple and naive approaches for imputation are:

hot deck imputation: recorded units in the sample are used to substitute values
mean imputation: means from sets of recorded values are substituted

👉 Sounds like a good idea but it distorts the empirical distribution of the sampled values (biases in variances and covariances) 😱 \[y_1, y_2, \ldots, y_{k1}, {\sf NA}_1, \ldots, {\sf NA}_{k2} \rightarrow \hat{\sigma}^2=\frac{1}{k1}\sum_{i=1}^{k1} (y_i-\hat{\mu})^2\] \[y_1, y_2, \ldots, y_{k1}, \hat{\mu}, \ldots, \hat{\mu} \rightarrow \hat{\sigma}^2=\frac{1}{k1+k2}\left(\sum_{i=1}^{k1} (y_i-\hat{\mu})^2 + \sum_{i=1}^{k2}0\right)\]
regression imputation: the missing variables for a unit are estimated by predicted values from the regression on the known variables for that unit.

Machine learning imputation methods

Machine learning has been extensively used for imputation of any mixed type of variables, nonlinear relations, complex interactions and high dimensionality. (Hastie et al. 2001)
The basic idea is (aka regression imputation): for each variable, to fit some black-box model on the observed part of the observation and then predict the missing part.
For example, the R package missForest uses random forests for this purpose.
Another typical method is based on the $k$-Nearest Neighbor.
Observe that there is no explicit structure exploited (like low-rank or temporal structure); instead, one relies on the machine learning black box to infer whatever structure may be present.

Naive vs sound imputation methods

Ad-hoc methods of imputation can lead to serious biases in variances and covariances.
Examples are:
- mean imputation
- constant interpolation
- linear interpolation
- polynomial interpolation
- spline interpolation
A sound imputation method should preserve the statistics of the observed values.

👉 The statistical way is to first estimate the distribution of the missing values conditional on the observed values $f\;(\mathbf{y}_{\sf miss} | \mathbf{y}_{\sf obs})$ and then impute based on that posterior distribution.

Naive vs sound time series imputation methods

Illustration of different naive imputation methods and a sound statistical method that preserves the statistics:

Single and multiple imputation

Suppose we have somehow estimated the conditional distribution $f\;(\mathbf{y}_{\sf miss} | \mathbf{y}_{\sf obs})$.
At this point it is trivial to randomly generate the missing values from that distribution: \[\mathbf{y}_{\sf miss} \sim f\;(\mathbf{y}_{\sf miss}|\mathbf{y}_{\sf obs}).\]
This only gives you one realization of the missing values.
In some applications, one would like to have multiple realizations of the missing values to properly test the performance of some subsequent methods or algorithms.
Multiple imputation (MI) consists of generating multiple realizations of the missing values:\[\mathbf{y}_{\sf miss}^{(k)} \sim f\;(\mathbf{y}_{\sf miss}|\mathbf{y}_{\sf obs}) \qquad \forall k=1, \ldots, K.\]

Univariate and multivariate imputation in R

R has several packages for the problem of imputation of missing values (CRAN Task View: Missing Data).
Most of them are for the multivariate data case:
- AMELIA (Honaker et al., 2011),
- mice (van Buuren and Groothuis-Oudshoorn, 2011),
- VIM (Kowarik and Templ, 2016),
- missForest (Stekhoven, 2011), and
- missMDA (Josse and Husson, 2016).
Multivariate imputation methods typically cannot work for univariate data because they impute one variable based on the other variables.
For the univariate data case:
- imputeTS (Moritz, 2017): it includes a variety of simple imputation schemes for univariate timse series (but the imputation does not preserve the original statistics of the data). See R Journal paper on the package.
- imputeFin (Liu and Palomar, 2019): it is specifically designed for time series that fit an AR model (so it is specialized in financial time series) and it preserves the original statistics of the data.

R session: Univariate imputation methods

In this session, we will use the package imputeTS to illustrate some simple univariate imputation methods:

library(imputeTS)

For this we first need to create an example input series with missing data:

# Create a short example time series with missing values
x <- ts(c(1, 2, 3, 4, 5, 6, 7, 8, NA, NA, 11, 12))

On this time series we can apply different imputation algorithms. We start with using na.mean(), which substitutes the NAs with mean values:

# Impute the missing values with na.mean()
na.mean(x)

R>> Time Series:
R>> Start = 1 
R>> End = 12 
R>> Frequency = 1 
R>>  [1]  1.0  2.0  3.0  4.0  5.0  6.0  7.0  8.0  5.9  5.9 11.0 12.0

Most of the functions also have additional options that provide further algorithms (of the same algorithm category). In the example below it can be seen that na.mean() can also be called with option = "median", which substitutes the NAs with median values:

# Impute the missing values with na.mean() using option median
na.mean(x, option = "median")

R>> Time Series:
R>> Start = 1 
R>> End = 12 
R>> Frequency = 1 
R>>  [1]  1.0  2.0  3.0  4.0  5.0  6.0  7.0  8.0  5.5  5.5 11.0 12.0

While na.interpolation() and all other imputation functions are used the same way, the results produced may be different. As can be seen below, for this series linear interpolation gives more reasonable results:

# Impute the missing values with na.interpolation()
na.interpolation(x)

R>> Time Series:
R>> Start = 1 
R>> End = 12 
R>> Frequency = 1 
R>>  [1]  1  2  3  4  5  6  7  8  9 10 11 12

For longer and more complex time series (with trend and seasonality) than in this example it is always a good idea to try na.kalman() and na.seadec(), since these functions very often produce the best results. These functions are called the same easy way as all other imputation functions. Consider the time series dataset tsAirgap provided in the package. We can use the functions plotNA.distribution() and plotNA.distributionBar() to visualize the distribution of missing values within a time series:

# Visualize the missing values in this time series
plotNA.distribution(tsAirgap)

# Visualize the missing values in this time series
plotNA.distributionBar(tsAirgap, breaks = 20)

We can also use the functino statsNA() to print summary stats about the distribution of missing values in a univariate time series:

statsNA(tsAirgap)

R>> [1] "Length of time series:"
R>> [1] 144
R>> [1] "-------------------------"
R>> [1] "Number of Missing Values:"
R>> [1] 13
R>> [1] "-------------------------"
R>> [1] "Percentage of Missing Values:"
R>> [1] "9.03%"
R>> [1] "-------------------------"
R>> [1] "Stats for Bins"
R>> [1] "  Bin 1 (36 values from 1 to 36) :      4 NAs (11.1%)"
R>> [1] "  Bin 2 (36 values from 37 to 72) :      1 NAs (2.78%)"
R>> [1] "  Bin 3 (36 values from 73 to 108) :      5 NAs (13.9%)"
R>> [1] "  Bin 4 (36 values from 109 to 144) :      3 NAs (8.33%)"
R>> [1] "-------------------------"
R>> [1] "Longest NA gap (series of consecutive NAs)"
R>> [1] "3 in a row"
R>> [1] "-------------------------"
R>> [1] "Most frequent gap size (series of consecutive NA series)"
R>> [1] "1 NA in a row (occuring 10 times)"
R>> [1] "-------------------------"
R>> [1] "Gap size accounting for most NAs"
R>> [1] "1 NA in a row (occuring 10 times, making up for overall 10 NAs)"
R>> [1] "-------------------------"
R>> [1] "Overview NA series"
R>> [1] "  1 NA in a row: 10 times"
R>> [1] "  3 NA in a row: 1 times"

Here is a usage example for the na.kalman() function:

# Impute the missing values with na.kalman
x_imp <- na.kalman(tsAirgap)

# Code for visualization
plotNA.imputations(tsAirgap, x_imp, tsAirgapComplete)

As can be seen, na.kalman() provides really good results for this series, which contains a strong seasonality and a strong trend.

However, for financial time series, the package imputeTS does not seem very appropriate, especially for consecutive NAs. Let’s download some financial data and artificially generate NAs:

library(quantmod)
# download data from YahooFinance
y <- log(Ad(getSymbols("^GSPC", from = "2012-01-01", to = "2015-07-08", auto.assign = FALSE)))
T <- nrow(y)

# create NA's
miss_pct <- 0.2
n_miss <- floor(miss_pct*T)
idx_miss <- round(T/2) + 1:n_miss
y[idx_miss] <- NA

# plot
plot(y, main = "Log-prices of SP500 with a block of NAs")

Let’s impute with package imputeTS:

y_locf <- na.locf(y)
y_linear <- na.interpolation(y, option = "linear")
y_spline <- na.interpolation(y, option = "spline")
y_stine <- na.interpolation(y, option = "stine")
y_kalman <- na.kalman(y)

# plots
idx_miss_bis <- (min(idx_miss)-1):(max(idx_miss)+1)

{ plot(y, main = "Imputation - LOCF")
  lines(y_locf[idx_miss_bis], col = "blue", lwd = 2) }

{ plot(y, main = "Imputation - linear")
  lines(y_linear[idx_miss_bis], col = "blue", lwd = 2) }

{ plot(y, main = "Imputation - spline")
  lines(y_spline[idx_miss_bis], col = "blue", lwd = 2) }

{ plot(y, main = "Imputation - stine")
  lines(y_stine[idx_miss_bis], col = "blue", lwd = 2) }

{ plot(y, main = "Imputation - Kalman")
  lines(y_kalman[idx_miss_bis], col = "blue", lwd = 2) }

For time series, it is better to use the package imputeFin:

library(imputeFin)

y_ar1 <- imputeAR1Gaussian(y)

# plot
{ plot(y, main = "Imputation - AR(1)")
  lines(y_ar1[idx_miss_bis], col = "blue", lwd = 2) }

R session: Multivariate imputation

In this session, we will use the package VIM since it provides many functions to explore and analyze the structure of missing or imputed values in data using graphical methods. Getting knowledge about the structure is helpful to identify the mechanism that generates the missing values and errors that may have happened in the imputation process. It implements three imputation methods:

$k$-Nearest Neighbor imputation
Hotdeck imputation
IRMI (iterative robust model-based imputation)

library(VIM)

# show the dataset with missing values
str(sleep)

R>> 'data.frame':   62 obs. of  10 variables:
R>>  $ BodyWgt : num  6654 1 3.38 0.92 2547 ...
R>>  $ BrainWgt: num  5712 6.6 44.5 5.7 4603 ...
R>>  $ NonD    : num  NA 6.3 NA NA 2.1 9.1 15.8 5.2 10.9 8.3 ...
R>>  $ Dream   : num  NA 2 NA NA 1.8 0.7 3.9 1 3.6 1.4 ...
R>>  $ Sleep   : num  3.3 8.3 12.5 16.5 3.9 9.8 19.7 6.2 14.5 9.7 ...
R>>  $ Span    : num  38.6 4.5 14 NA 69 27 19 30.4 28 50 ...
R>>  $ Gest    : num  645 42 60 25 624 180 35 392 63 230 ...
R>>  $ Pred    : int  3 3 1 5 3 4 1 4 1 1 ...
R>>  $ Exp     : int  5 1 1 2 5 4 1 5 2 1 ...
R>>  $ Danger  : int  3 3 1 3 4 4 1 4 1 1 ...

# plot missingness pattern (blue for observed and red for missing values)
aggr(sleep, numbers = TRUE, prop = c(TRUE, FALSE))

# impute via kNN
sleep_kNN <- kNN(sleep)
sleep_kNN_part <- kNN(sleep, variable = c("Sleep", "Dream", "NonD"), dist_var = colnames(sleep))

# plot missingness and imputation pattern (orange is for imputed values and red for missing values)
aggr(sleep_kNN_part, delimiter = "_imp", numbers = TRUE, prop = c(TRUE, FALSE))

We can also plot a modified boxplot with imputed missings:

sleep_vars <- c("Dream", "NonD", "Sleep", "Dream_imp", "NonD_imp", "Sleep_imp")
pbox(sleep_kNN_part[, sleep_vars], delimiter = "_imp", selection = "any", interactive = FALSE)

The marginplot is an enhancement to the normal scatterplot, where imputed values are highlighted for each variable:

str(tao)

R>> 'data.frame':   736 obs. of  8 variables:
R>>  $ Year            : int  1997 1997 1997 1997 1997 1997 1997 1997 1997 1997 ...
R>>  $ Latitude        : int  0 0 0 0 0 0 0 0 0 0 ...
R>>  $ Longitude       : int  -110 -110 -110 -110 -110 -110 -110 -110 -110 -110 ...
R>>  $ Sea.Surface.Temp: num  27.6 27.6 27.6 27.6 27.6 ...
R>>  $ Air.Temp        : num  27.1 27 27 26.9 26.8 ...
R>>  $ Humidity        : num  79.6 75.8 76.5 76.2 76.4 76.7 76.5 78.3 78.6 76.9 ...
R>>  $ UWind           : num  -6.4 -5.3 -5.1 -4.9 -3.5 -4.4 -2 -3.7 -4.2 -3.6 ...
R>>  $ VWind           : num  5.4 5.3 4.5 2.5 4.1 1.6 3.5 4.5 5 3.5 ...

aggr(tao, numbers = TRUE, prop = c(TRUE, FALSE))

tao_kNN <- kNN(tao)
tao_vars <- c("Air.Temp", "Humidity", "Air.Temp_imp", "Humidity_imp")
marginplot(tao_kNN[, tao_vars], delimiter = "imp", alpha = 0.6)

The variable Air.Temp is clearly separated in two clusters. Furthermore, it reveals that imputed values of Humidity only occur in the first cluster, which are values with low Air.Temp and high Humidity and that they are imputed with a value between 84 and 91. On the other hand, most of the imputed values of the variable Air.Temp have been imputed with a value between 26 and 27 and mostly occur in the second cluster, which are value with high Air.Temp.

We can plot marginplots of data imputed with different imputation methods:

library(e1071)
tao_kNN <- kNN(tao, k = 5)
tao_mean <- as.data.frame(impute(tao, what = "mean"))
tao_mean <- cbind(tao_mean, tao_kNN[, c("Air.Temp_imp","Humidity_imp")])
vars <- c("Air.Temp", "Humidity", "Air.Temp_imp", "Humidity_imp")

par(mfrow = c(1, 2))
marginplot(tao_kNN[, vars], delimiter = "imp", alpha = 0.6, main = "kNN")
marginplot(tao_mean[, vars], delimiter = "imp", alpha = 0.6, main = "mean")

We can also plot a scatterplot with imputed missings:

vars <- c("Air.Temp", "Humidity", "Air.Temp_imp", "Humidity_imp")
scattMiss(tao_kNN[, vars], delimiter = "imp", alpha = 0.6, interactive = FALSE)

We now show a marginplot matrix:

vars <- c("Air.Temp", "Humidity", "Sea.Surface.Temp", 
          "Air.Temp_imp", "Humidity_imp", "Sea.Surface.Temp_imp")
marginmatrix(tao_kNN[, vars], delimiter = "_imp", alpha=0.6)

as well as a scatterplot matrix with imputed missings:

vars <- c("Air.Temp", "Humidity", "Sea.Surface.Temp", 
          "Air.Temp_imp", "Humidity_imp", "Sea.Surface.Temp_imp")
scattmatrixMiss(tao_kNN[, vars], highlight = "Air.Temp", delimiter = "_imp", alpha=0.6, interactive = FALSE)

Finally, the matrix plot is a very useful multivariate plot, it helps to detect multivariate dependencies and patterns:

matrixplot(sleep_kNN, delimiter = "_imp", sortby = "Span", interactive = FALSE)

The mosaic plot is a graphical representation of multi-way contingency tables, there- fore it’s mainly intended for categorial variables. The frequencies of the different cells are visualized by area-proportional rectangles (tiles). For constructing a mosaic plot, a rectangle is first split vertically at positions corresponding to the relative frequencies of the categories of a corresponding variable. Then the resulting smaller rectangles are again subdivided according to the conditional probabilities of a second variable.

mosaicMiss(sleep_kNN, highlight=3, plotvars=8:10, delimiter="_imp", miss.labels=FALSE)

It may occur, that the structure, because of which the imputed values have been missing is affected by spatial patterns. The map of imputed missings method is developed to explore this matter:

vars <- c("As", "Bi", "Ca", "As_imp", "Bi_imp")
chorizon_kNN <- kNN(chorizonDL ,variable=c("Ca","As","Bi"), dist_var=c("Hg","K"))
coo <- chorizon_kNN[, c("XCOO", "YCOO")]
mapMiss(chorizon_kNN[, vars], coo, map = kola.background, delimiter = "_imp", alpha = 0.6, interactive = FALSE)

Like the map of imputed missing function, this method is also developed to analyze geographical data. However, the values of the variable of interest are represented by growing dots instead of normal points in this function. Imputed values are highlighted again.

growdotMiss(chorizon_kNN[, vars], coo, kola.background, delimiter = "_imp", alpha = 0.6, interactive = FALSE)

Estimation of parameters for iid Gaussian

Suppose a univariate random variable $y$ follows a Gaussian distribution: \[y\sim\mathcal{N}\left(\mu,\sigma^{2}\right).\]
We have $T$ incomplete samples $\left\{ y_t\right\}$ and the missing mechanism is ignorable (aka MAR), the ML estimation problem is formulated as
\[\begin{array}{cc} \underset{\mu,\sigma^{2}}{\sf{maximize}} & \log\left(\prod_{t\in C_{\sf obs}}f_G\left(y_{t};\mu,\sigma^{2}\right)\right) \end{array},\] where $C_{\sf obs}$ is the set of indexes of observed samples, and $f_G\left(\cdot\right)$ is the Gaussian pdf.
Closed-form solution: 😃 \[\hat{\mu}=\frac{1}{n_{obs}}\sum_{t\in C_{\sf obs}}y_{t}\] and \[\hat{\sigma}^{2}=\frac{1}{n_{\sf obs}}\sum_{t\in C_{\sf obs}}\left(y_{t}-\hat{\mu}\right)^{2}.\]

Estimation of parameters for iid Student’s $t$

In many applications, the Gaussian distribution is not appropriate and a more realistic heavy-tailed distribution is necessary.
An example is in the financial returns of stocks:

Estimation of parameters for iid Student’s $t$

The Student’s $t$-distribution is a widely used heavy-tailed distribution.
Suppose $y$ follows a Student’s $t$-distribution: $y\sim t\left(\mu,\sigma^{2},\nu\right)$ with pdf \[f_{t}\left(y;\mu,\sigma^{2},\nu\right)=\frac{\Gamma\left(\frac{\nu+1}{2}\right)}{\sqrt{\nu\pi}\sigma\Gamma\left(\frac{\nu}{2}\right)}\left(1+\frac{\left(y-\mu\right)^{2}}{\nu\sigma^{2}}\right).\]
Given the incomplete data set, the ML estimation problem for $\boldsymbol{\theta}=\left(\mu,\sigma^{2},\nu\right)$ can be formulated as \[\begin{array}{cc} \underset{\mu,\sigma^{2},\nu}{\sf{maximize}} & \log\left(\prod_{t\in C_{obs}}f_{t}\left(y_{t};\mu,\sigma^{2},\nu\right)\right)\end{array}.\] No closed-form solution. 😞
Interestingly, the Student’s $t$-distribution can be represented as a Gaussian mixture: \[y_{t}|\tau_{t}\sim\mathcal{N}\left(\mu,\frac{\sigma^{2}}{\tau_{t}}\right),\quad \tau_{t}\sim{\sf{Gamma}}\left(\frac{\nu}{2},\frac{\nu}{2}\right),\] where $\tau_t$ is the mixture weight.

Estimation of parameters for iid Student’s $t$

We can use the expectation-maximization (EM) algorithm to solve this ML estimation problem by regarding $\boldsymbol{\tau}_{\sf{obs}}=\left\{ \tau_{t}\right\} _{t\in C_{\sf{obs}}}$ as latent variables:

Expectation (E) - step: compute the expected complete data log-likelihood given the current estimates \[\begin{aligned} Q\left(\boldsymbol{\theta}|\boldsymbol{\theta}^{(k)}\right)= & {\sf{E}}_{f\;\left(\boldsymbol{\tau}_{\sf{obs}}\mid\mathbf{y}_{\sf{obs}},\boldsymbol{\theta}^{\left(k\right)}\right)}\left[\log\left(f\;\left(\mathbf{y}_{\sf{obs}},\boldsymbol{\tau}_{\sf{obs}}\mid\boldsymbol{\theta}\right)\right)\right]\\ = & -\sum_{t\in C_{\mathsf{\sf{obs}}}}\frac{w_{t}^{\left(k\right)}}{2\sigma^{2}}\left(y_{t}-\mu\right)^{2}-\frac{n_{\mathsf{\sf{obs}}}}{2}\log\left(\sigma^{2}\right)+n_{\mathsf{\sf{obs}}}\frac{\nu}{2}\log\left(\frac{\nu}{2}\right)\\ & -n_{\mathsf{\sf{obs}}}\log\left(\Gamma\left(\frac{\nu}{2}\right)\right)+\frac{\nu}{2}\sum_{t\in C_{\mathsf{\sf{obs}}}}\left(\delta_{t}^{\left(k\right)}-w_{t}^{\left(k\right)}\right)+const., \end{aligned}\] where $w_{t}^{\left(k\right)}=\mathsf{E}\left[\tau_{t}\right]=\frac{\nu^{\left(k\right)}+1}{\nu^{\left(k\right)}+\left(y_{t}-\mu^{\left(k\right)}\right)^{2}/\left(\sigma^{\left(k\right)}\right)^{2}},$ $\delta_{t}^{\left(k\right)}=\mathsf{E}\left[\log\left(\tau_{t}\right)\right]=\psi\left(\frac{\nu^{\left(k\right)}+1}{2}\right)-\log\left(\frac{\nu^{\left(k\right)}+\left(y_{t}-\mu^{\left(k\right)}\right)^{2}/\left(\sigma^{\left(k\right)}\right)^{2}}{2}\right).$

Estimation of parameters for iid Student’s $t$

Maximization (M) - step: update the estimates as \[\boldsymbol{\theta}^{(k+1)}=\underset{\boldsymbol{\theta}}{\textrm{argmax}}\thinspace\thinspace Q\left(\boldsymbol{\theta}|\boldsymbol{\theta}^{(k)}\right)\] and has closed-form solution: 😃 \[\mu^{\left(k+1\right)}=\frac{\sum_{t\in C_{\sf{obs}}}w_{t}^{\left(k\right)}y_{t}}{\sum_{t\in C_{\sf{obs}}}w_{t}^{\left(k\right)}},\] \[\left(\sigma^{(k+1)}\right)^{2}=\frac{\sum_{t\in C_{\sf{obs}}}w_{t}^{\left(k\right)}\left(y_{t}-\mu^{\left(k+1\right)}\right)^{2}}{n_{\sf{obs}}},\]
\[\begin{aligned} \nu^{(k+1)}=\underset{\nu>0}{\textrm{argmax}\thinspace\thinspace} & n_{\mathsf{\sf{obs}}}\left(\frac{\nu}{2}\log\left(\frac{\nu}{2}\right)-\log\left(\Gamma\left(\frac{\nu}{2}\right)\right)\right)\\ & +\frac{\nu}{2}\sum_{t\in C_{\mathsf{\sf{obs}}}}\left(\delta_{t}^{\left(k\right)}-w_{t}^{\left(k\right)}\right). \end{aligned}\]

Algorithm

Stochastic EM algorithm for iid Student’s $t$:

Initialize $\mu^{\left(0\right)}$, $\left(\sigma^{(0)}\right)^{2}$, and $\nu^{(0)}$. Set $k=0$.
repeat \[w_{t}^{\left(k\right)}=\frac{\nu^{\left(k\right)}+1}{\nu^{\left(k\right)}+\left(y_{t}-\mu^{\left(k\right)}\right)^{2}/\left(\sigma^{\left(k\right)}\right)^2},\] \[\mu^{\left(k+1\right)}=\frac{\sum_{t\in C_{\sf{obs}}}w_{t}^{\left(k\right)}y_{t}}{\sum_{t\in C_{\sf{obs}}}w_{t}^{\left(k\right)}},\] \[\left(\sigma^{(k+1)}\right)^{2}=\frac{\sum_{t\in C_{\sf{obs}}}w_{t}^{\left(k\right)}\left(y_{t}-\mu^{\left(k+1\right)}\right)^{2}}{n_{\sf{obs}}},\] \[\nu^{(k+1)}=\underset{\nu>0}{\textrm{argmax}}\thinspace\thinspace Q\left(\mu^{\left(k+1\right)},\left(\sigma^{(k+1)}\right)^{2},\nu|\mu^{k},\left(\sigma^{k}\right)^{2},\nu^{k}\right)\] $\qquad k \gets k+1$
until convergence

Estimation of parameters for AR(1) Student’s $t$

Consider an AR(1) time series with innovations following a Student’s $t$-distribution: \[y_{t}=\varphi_{0}+\varphi_{1}y_{t-1}+\varepsilon_{t},\] where $\varepsilon_t\sim t\left(0,\sigma^{2},\nu\right).$
The ML estimation problem for $\boldsymbol{\theta}=\left(\varphi_{0},\varphi_{1},\sigma^{2},\nu\right)$ is formulated as \[\begin{array}{cc} \underset{\phi_{0},\varphi_{1},\sigma^{2},\nu}{\sf{maximize}} & \log\left(\int\prod_{t=2}^{T}f_{t}\left(y_{t};\varphi_{0}+\varphi_{1}y_{t-1},\sigma^{2},\nu\right)\sf{d}\mathbf{y}_{\sf miss}\right)\end{array}\]
The objective function involves an integral, and we have no closed-form expression. 😱
Even if we consider the Gaussian mixture representation, we still cannot use the EM algorithm due to the complicated integrals appearing. 😱
The problem can be dealt with using a stochastic approximation of the expectation (stochastic EM) and then using a Markov chain Monte Carlo (MCMC) process (in particular, we can use the Gibbs sampling method to generate the Markov chain).
For details see the paper (Liu et al. 2019) which is implemented in the R package imputeFin.

Algorithm

Stochastic EM algorithm for AR(1) Student’s $t$:

Initialize latent variables and set $k=0$.
repeat

Simulation step: generate the samples $\left(\boldsymbol{\tau}^{\left(k,l\right)},\mathbf{y}_{\sf{m}}^{\left(k,l\right)}\right) \left(l=1,2\ldots,L\right)$ from $f\;\left(\mathbf{y}_{\sf miss},\boldsymbol{\tau}|\mathbf{y}_{\sf obs};\boldsymbol{\theta}^{\left(k\right)}\right)$ via Gibbs sampling ($L$ parallel chains).
Approximation step: \[\hat{\mathbf{s}}^{\left(k\right)}=\hat{\mathbf{s}}^{\left(k-1\right)}+\gamma^{\left(k\right)}\left(\frac{1}{L}\sum_{l=1}^{L}\mathbf{s}\left(\mathbf{y}_{\sf obs},\mathbf{y}_{\mathsf{\sf miss}}^{\left(k,l\right)},\boldsymbol{\tau}^{\left(k,l\right)}\right)-\hat{\mathbf{s}}^{\left(k-1\right)}\right).\]
Maximization step: $\boldsymbol{\theta}^{(k+1)}=\underset{\boldsymbol{\theta}}{\textrm{argmax}}\thinspace\thinspace\hat{Q}\left(\boldsymbol{\theta},\hat{\mathbf{s}}^{\left(k\right)}\right)$.

$\qquad k \gets k+1$
until convergence

Data Cleaning of Financial Time Series

Data cleaning of financial time series

Consider the S&P 500 between 2012-01-01 and 2015-07-08.
We will fit log-prices to an AR(1) model $y_{t}=\phi_0+\phi_1 y_{t-1}+\varepsilon_{t}$ with the innovations following a Student $t$ distribution.
In fact, we know that log-prices follow a random walk, so we expect $\phi_1=1$.
Also, we know daily data is heavy-tailed (rather than Gaussian), so we expect a small value of the degrees of freedom $\nu \approx 5$ (large values means that it becomes more Gaussian).
We will use the R package imputeFin.
Let’s see the previous stochastic EM algorithm in action…

Simple imputation methods

Recall the comparison between naive imputation methods and sound statistical methods:

Statistical parameter estimation for log-prices

Sound statistical method: estimation of parameters for AR(1) Student’s $t$ with real data (S&P 500):

Numerical simulations: imputed or real?

R session: Data cleaning of financial time series

In this R session we will follow the complete procedure of data cleaning of a financial time series, namely, the S&P 500. We will first identify and remove the outliers and then impute the missing values (and removed outliers). We will use the R package imputeFin.

Let’s start by loading some market data and adding some outliers and missing values:

library(xts)
library(quantmod)

# download data from YahooFinance
y_orig <- log(Ad(getSymbols("^GSPC", from = "2012-01-01", to = "2015-07-08", auto.assign = FALSE)))
y <- y_orig
T <- nrow(y)

# create some outliers
idx_outliers <- round(c(T/4, 1.5*T/4, 3*T/4))
y[idx_outliers] <- 1.05*y[idx_outliers]

# create NA's
miss_pct <- 0.2
n_miss <- floor(miss_pct*T)
idx_miss <- round(T/2) + 1:n_miss
y[idx_miss] <- NA

# plot
plot(y, main = "Log-prices of SP500 with outliers and NAs")

The first step is to identify the outliers. We will fit the time series to some parametric model and then identify the samples that deviate from the model significantly. In particular, we will consider the AR(1) model: \[y_{t}=\phi_0+\phi_1 y_{t-1}+\varepsilon_{t},\] where the innovation $\varepsilon_{t}$ follows a Student’s $t$-distribution $\varepsilon_t\sim t\left(0,\sigma^{2},\nu\right),$ and then we will identify the innovation samples that deviate more than $3\sigma-5\sigma$ (so basically Grubb’s test on the innovations of the parametric model).

It’s instructive to see the effect of the outliers if we don’t use a robust estimation method (i.e., assuming a Gaussian distribution for the innovation):

library(imputeFin)  # package to fit models with NAs and robust to outliers

# if we use a Gaussian (nonrobust)
res_y_orig_Gaussian <- estimateAR1Gaussian(y_orig)
res_y_Gaussian <- estimateAR1Gaussian(y, zero_mean = FALSE)

cbind(res_y_orig_Gaussian, res_y_Gaussian)

R>>        res_y_orig_Gaussian res_y_Gaussian
R>> phi0   0.01486504          0.1612201     
R>> phi1   0.9980751           0.9783727     
R>> sigma2 5.489247e-05        0.001205647

The estimation of the parameters $\sigma^2$ and $\phi_0$ have been severely affected ($\phi_1$ not so much).

Let’s try a robust estimation (i.e., assuming a Student’s $t$-distribution for the innovation):

res_y_orig_Studentt <- estimateAR1t(y_orig, zero_mean = FALSE)
res_y_Studentt <- estimateAR1t(y, zero_mean = FALSE)
cbind(res_y_orig_Studentt, res_y_Studentt)

R>>        res_y_orig_Studentt res_y_Studentt
R>> phi0   0.01090953          0.01049288    
R>> phi1   0.9986277           0.9986609     
R>> sigma2 3.703216e-05        2.741813e-05  
R>> nu     5.778217            2.2622

Much better!

Let’s now identify the outliers from the innovation:

# compute innovation and find outliers
innovation <- y - (res_y_Studentt$phi0 + res_y_Studentt$phi1 * lag(y))
idx_out_inno <- which(abs(innovation) > 5*sqrt(res_y_Studentt$sigma2))
{ plot(innovation, ylim = 1.2*c(min(innovation, na.rm=TRUE), max(innovation, na.rm=TRUE)))
  points(innovation[idx_out_inno], col = "red") }

Each outlier in the price affects two innovations, so let’s just keep the correct one and remove the outliers by setting them to NA:

idx_out_inno <- idx_out_inno[seq(from = 1, to = length(idx_out_inno), by = 2)]
y_nooutliers <- y
y_nooutliers[idx_out_inno] <- NA
plot(y_nooutliers, main = "Log-prices of SP500 after removing outliers but still with NAs")

Now let’s do a model refitting after removing the outliers:

res_y_Studentt <- estimateAR1t(y_nooutliers, zero_mean = FALSE)
cbind(res_y_orig_Studentt, res_y_Studentt)

R>>        res_y_orig_Studentt res_y_Studentt
R>> phi0   0.01090953          0.01307457    
R>> phi1   0.9986277           0.9983113     
R>> sigma2 3.703216e-05        3.937763e-05  
R>> nu     5.778217            5.95058

At this point, we are ready to do the imputation of the missing values:

y_imputed <- imputeAR1t(y_nooutliers)
plot(y_imputed, main = "Log-prices of SP500 after removing outliers and imputing NAs")

To summarize, let’s compare the original dirty time series with the clean one:

par(mfrow=c(2,1))
plot(y, main = "Log-prices of SP500 with outliers and NAs")
plot(y_imputed, main = "Log-prices of SP500 after removing outliers and imputing NAs")

Summary

Data scientists spend 80% of their time cleaning data and only 20% of their time actually analyzing it.
Three main components of data cleaning are
- Data wrangling: mundane tasks of fixing the details of the data and putting data in the right structure and format.
- Outlier detection: critical task of detecting outliers and removing them (otherwise the subsequent algorithms may totally fail).
- Imputation of missing values: important task of imputing missing values (otherwise the subsequent algorithms may not even be executable).
Most textbooks on data analytics, big data, and machine learning totally ignore the data cleaning aspect (because it’s not sexy).
In fact, data cleaning can be very interesting if done properly in a systematic way (not if it has to be done by hand on a case by case basis).
In practice, it has to be done. Period.

Thanks

For more information visit: https://www.danielppalomar.com

References

Hastie, T., Tibshirani, R., & Friedman, J. (2001). The Elements of Statistical Learning. Springer.

Little, R. J., & Rubin, D. B. (2002). Statistical Analysis with Missing Data (2nd Ed.). Hoboken, N.J.: John Wiley & Sons.

Liu, J., Kumar, S., & Palomar, D. P. (2019). Parameter estimation of heavy-tailed AR model with missing data via stochastic EM. IEEE Trans. Signal Processing, 67(8), 2159–2172. https://arxiv.org/pdf/1809.07203.pdf

Outline

What Is Data Cleaning?

What is data cleaning?

Data cleaning

Data cleaning

Data Wrangling (aka Data Munging)

Data wrangling (aka data munging)

Data wrangling: Mundane tasks

R session: Cleaning dirty data with the tidyverse

R session: Cleaning dirty data with data.table

Tidy data

From wide to long data format

From long to wide data format

R session: Wide and long data formats

Outlier Detection

Outliers

What is an outlier?

General application scenarios

R session: Regression with outliers

Outlier correction: Trimming and Winsorization

Winsorization

R session: Trimming and Winsorization

Statistical approaches

Statistical approach: Grubbs’ test

Statistical approach: Grubbs’ test

Example: Hadlum vs. Hadlum (1949)

R session: Outlier detection in pregnancy duration

Statistical approach in multivariate case: Mahalanobis distance

Statistical approach in multivariate case

Likelihood approach

Depth-based approaches

Depth-based approaches

Deviation-based approaches

Distance-based approaches

Distance-based approach: \(DB(\epsilon,\pi)\)-Outliers

Density-based approaches

Density-based approaches

High-dimensional approaches

High-dimensional approaches

High-dimensional approaches: Angle-based

High-dimensional approaches: Angle-based

R session: Multidimensional outlier detection

R session: Estimation of mean vector and covariance matrix with outliers

R session: Cleaning of outliers in time series

Imputation of Missing Values

Data with missing values

Data with missing values

What is imputation?

Netflix problem: Low-rank matrix completion

Inpainting in image processing

Time series with structure

Simple imputation methods

Machine learning imputation methods

Naive vs sound imputation methods

Naive vs sound time series imputation methods

Single and multiple imputation

Univariate and multivariate imputation in R

R session: Univariate imputation methods

R session: Multivariate imputation

Estimation of parameters for iid Gaussian

Estimation of parameters for iid Student’s \(t\)

Estimation of parameters for iid Student’s \(t\)

Estimation of parameters for iid Student’s \(t\)

Estimation of parameters for iid Student’s \(t\)

Algorithm

Stochastic EM algorithm for iid Student’s \(t\):

Estimation of parameters for AR(1) Student’s \(t\)

Algorithm

Stochastic EM algorithm for AR(1) Student’s \(t\):

Data Cleaning of Financial Time Series

Data cleaning of financial time series

Simple imputation methods

Statistical parameter estimation for log-prices

Numerical simulations: imputed or real?

Numerical simulations: imputed or real?

R session: Data cleaning of financial time series

Summary

Thanks

References