Overview of simple outlier detection methods with their combination using dplyr and ruler packages. Prologue During the process of data analysis one of the most crucial steps is to identify and account for outliers, observations that have essentially different nature than most other observations. Their presence can lead to untrustworthy conclusions. The most complicated part of this task is to define a notion of “outlier”. After that, it is straightforward to identify them based on given data. There are many techniques developed for outlier detection. Majority of them deal with numerical data. This post will describe the most basic ones with their application using dplyr and ruler packages. After reading this post you will know: Most basic outlier detection techniques. A way to implement them using dplyr and ruler. A way to combine their results in order to obtain a new outlier detection method. A way to discover notion of “diamond quality” without prior knowledge of this topic (as a happy consequence of previous point). Overview We will perform an analysis with the goal to find not typical diamonds listed in diamonds dataset from ggplot2 package. Here one observation represents one diamond and is stored as a row in data frame. The way we will do that is by combining different outlier detection techniques to identify rows which are “strong outliers”, i.e. which might by considered outliers based on several methods. Packages required for this analysis: library(dplyr) library(tidyr) library(ggplot2) library(ruler) Outlier detection methods To do convenient outlier detection with ruler it is better to define notion of non-outlier in form of the rule “Observation is not an outlier if …”. This way actual outliers are considered as rule breakers, objects of interest of ruler package. Note that definition of non-outlier is essentially a definition of outlier because of total two possibilities. Z-score Z-score, also called a standard score, of an observation is [broadly speaking] a distance from the population center measured in number of normalization units. The default choice for center is sample mean and for normalization unit is standard deviation. ⬛ Observation is not an outlier based on z-score if its absolute value of default z-score is lower then some threshold (popular choice is 3). Here is the function for identifying non-outliers based on z-score: isnt_out_z % compute_group_non_outliers() ## # A tibble: 276 x 22 ## group carat_z depth_z table_z price_z x_z y_z z_z carat_mad ## ## 1 Fair_D_I1 FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE ## 2 Fair_D_IF TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE ## 3 Fair_D_SI1 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE ## 4 Fair_D_SI2 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE ## 5 Fair_D_VS1 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE ## # ... with 271 more rows, and 13 more variables: depth_mad , ## # table_mad , price_mad , x_mad , y_mad , ## # z_mad , carat_tukey , depth_tukey , table_tukey , ## # price_tukey , x_tukey , y_tukey , z_tukey The result has outputs for 21 methods applied to the 276 groups. Their names are of the form _. So the name ‘carat_z’ is interpreted as result of method ‘z’ for summary function equal to mean value of ‘carat’ column. Column group defines names of the groupings. Exposure Column and Mahalanobis based definition of non-outlier rows can be expressed with row packs and group based - as group packs. row_packs_isnt_out % transmute_if(is.numeric, isnt_out_funs), # Non-outliers based on Mahalanobis distance maha = . %>% transmute(maha = maha_dist(.)) %>% transmute_at(vars(maha = maha), isnt_out_funs) ) group_packs_isnt_out