Combined outlier detection with dplyr and ruler

Image gallery for: Combined outlier detection with dplyr and ruler

Combined outlier detection with dplyr and ruler

Overview of simple outlier detection methods with their combination using dplyr and ruler packages. Prologue During the process of data analysis one of the most crucial steps is to identify and account for outliers, observations that have essentially different nature than most other observations. Their presence can lead to untrustworthy conclusions. The most complicated part of this task is to define a notion of “outlier”. After that, it is straightforward to identify them based on given data. There are many techniques developed for outlier detection. Majority of them deal with numerical data. This post will describe the most basic ones with their application using dplyr and ruler packages. After reading this post you will know: Most basic outlier detection techniques. A way to implement them using dplyr and ruler. A way to combine their results in order to obtain a new outlier detection method. A way to discover notion of “diamond quality” without prior knowledge of this topic (as a happy consequence of previous point). Overview We will perform an analysis with the goal to find not typical diamonds listed in diamonds dataset from ggplot2 package. Here one observation represents one diamond and is stored as a row in data frame. The way we will do that is by combining different outlier detection techniques to identify rows which are “strong outliers”, i.e. which might by considered outliers based on several methods. Packages required for this analysis: library(dplyr) library(tidyr) library(ggplot2) library(ruler) Outlier detection methods To do convenient outlier detection with ruler it is better to define notion of non-outlier in form of the rule “Observation is not an outlier if …”. This way actual outliers are considered as rule breakers, objects of interest of ruler package. Note that definition of non-outlier is essentially a definition of outlier because of total two possibilities. Z-score Z-score, also called a standard score, of an observation is [broadly speaking] a distance from the population center measured in number of normalization units. The default choice for center is sample mean and for normalization unit is standard deviation. ⬛ Observation is not an outlier based on z-score if its absolute value of default z-score is lower then some threshold (popular choice is 3). Here is the function for identifying non-outliers based on z-score: isnt_out_z % compute_group_non_outliers() ## # A tibble: 276 x 22 ## group carat_z depth_z table_z price_z x_z y_z z_z carat_mad ## ## 1 Fair_D_I1 FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE ## 2 Fair_D_IF TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE ## 3 Fair_D_SI1 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE ## 4 Fair_D_SI2 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE ## 5 Fair_D_VS1 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE ## # ... with 271 more rows, and 13 more variables: depth_mad , ## # table_mad , price_mad , x_mad , y_mad , ## # z_mad , carat_tukey , depth_tukey , table_tukey , ## # price_tukey , x_tukey , y_tukey , z_tukey The result has outputs for 21 methods applied to the 276 groups. Their names are of the form _. So the name ‘carat_z’ is interpreted as result of method ‘z’ for summary function equal to mean value of ‘carat’ column. Column group defines names of the groupings. Exposure Column and Mahalanobis based definition of non-outlier rows can be expressed with row packs and group based - as group packs. row_packs_isnt_out % transmute_if(is.numeric, isnt_out_funs), # Non-outliers based on Mahalanobis distance maha = . %>% transmute(maha = maha_dist(.)) %>% transmute_at(vars(maha = maha), isnt_out_funs) ) group_packs_isnt_out
Advertisement
forecast 8.0

R Programming
R data.validator – How to Create Automated Data Quality Reports in R and Shiny

R Programming
Automated Text Feature Engineering using textfeatures in R

시도해 볼 프로젝트
10 Tips for Choosing the Optimal Number of Clusters

R programming
Advertisement
Beyond Basic R – Plotting with ggplot2 and Multiple Plots in One Figure

_1 Python & R
Introduction to Interactive Graphics in R with plotly

bullet journal ari
Deploying R models in SQL server

R
7 steps that make custom inputs in shiny easy

R Programming
Robust Regressions: Dealing with Outliers in R

R Programming
Trust in ML models. Slides from TWiML & AI EMEA Meetup + iX Articles

R Programming
Revisiting World Bank data analysis with WDI and gVisMotionChart

R Programming
Advertisement
Advertisement
Advertisement
topic models for synchronic & diachronic corpus exploration

DH Text Mining