Movatterモバイル変換


[0]ホーム

URL:


Outlier Treatment

Introduction

Data treatment is the process of altering indicators to improve theirstatistical properties, mainly for the purposes of aggregation. Datatreatment is a delicate subject, because it essentially involveschanging the values of certain observations, or transforming an entiredistribution. Like any other step or assumption though, any datatreatment should be carefully recorded and its implications understood.Of course, data treatment does nothave to be applied, it issimply another tool in your toolbox.

TheTreat() function

The COINr function for treating data is calledTreat().This is a generic function with methods for coins, purses, data framesand numeric vectors. It is very flexible but this can add a layer ofcomplexity. If you want to run mostly at default options, see theqTreat() function mentioned below inSimplified function.

TheTreat() function operates a two-stage data treatmentprocess, based on two data treatment functions (f1 andf2), and a pass/fail functionf_pass whichdetects outliers. The arrangement of this function is inspired by afairly standard data treatment process applied to indicators, whichconsists of checking skew and kurtosis, then if the criteria are notmet, applying Winsorisation up to a specified limit. Then ifWinsorisation still does not bring skew and kurtosis within limits,applying a nonlinear transformation such as log or Box-Cox.

This function generalises this process by using the following generalsteps:

  1. Check if variable passes or fails usingf_pass
  2. Iff_pass returnsFALSE, applyf1, else returnx unmodified
  3. Check again usingf_pass
  4. Iff_pass still returnsFALSE, applyf2
  5. Return the modifiedx as well as otherinformation.

For the “typical” case described abovef1 is aWinsorisation function,f2 is a nonlinear transformationandf_pass is a skew and kurtosis check. However, anyfunctions can be passed asf1,f2 andf_pass, which makes it a flexible tool that is alsocompatible with other packages.

Further details on how this works are given in the followingsections.

Numeric vectors

The clearest way to demonstrate theTreat() function ison a numeric vector. Let’s make a vector with a couple of outliers:

# numbers between 1 and 10x<-1:10# two outliersx<-c(x,30,100)

We can check the skew and kurtosis of this vector:

library(COINr)skew(x)#> [1] 3.063241kurt(x)#> [1] 9.741391

The skew and kurtosis are both high. If we follow the default limitsin COINr (absolute skew capped at 2, and kurtosis capped at 3.5), thiswould be classed as a vector with outliers. Indeed we can confirm thisusing thecheck_SkewKurt() function, which is the defaultpass/fail function used inTreat(). This also anywayoutputs the skew and kurtosis:

check_SkewKurt(x)#> $Pass#> [1] FALSE#>#> $Skew#> [1] 3.063241#>#> $Kurt#> [1] 9.741391

Now we know thatx has outliers, we can treat it (if wewant). We use theTreat() function to specify that ourfunction for checking for outliersf_pass = "check_SkewKurt", and our first function fortreating outliers isf1 = "winsorise". We also pass anadditional parameter towinsorise(), which iswinmax = 2. You can check thewinsorise()function documentation to better understand how it works.

l_treat<-Treat(x,f1 ="winsorise",f1_para =list(winmax =2),f_pass ="check_SkewKurt")plot(x, l_treat$x)

The result of this data treatment is shown in the scatter plot: onepoint fromx has been Winsorised (reassigned the nexthighest value). We can check the skew and kurtosis of the treatedvector:

check_SkewKurt(l_treat$x)#> $Pass#> [1] TRUE#>#> $Skew#> [1] 1.712038#>#> $Kurt#> [1] 1.815781

Clearly, Winsorising one point was enough in this case to bring theskew and kurtosis within the specified thresholds.

Data frames

Treatment of a data frame withTreat() is effectivelythe same as treating a numeric vector, because the data frame methodpasses each column of the data frame to the numeric method. Here, we usesome data from the COINr package to demonstrate.

# select three indicatorsdf1<- ASEM_iData[c("Flights","Goods","Services")]# treat the data frame using defaultsl_treat<-Treat(df1)str(l_treat,max.level =1)#> List of 3#>  $ x_treat       :'data.frame':  51 obs. of  3 variables:#>  $ Dets_Table    :'data.frame':  3 obs. of  8 variables:#>  $ Treated_Points:'data.frame':  51 obs. of  3 variables:

We can see the output is a list withx_treat, thetreated data frame;Dets_Table, a table describing whathappened to each indicator; andTreated_Points, which markswhich individual points were adjusted. This is effectively the sameoutput as for treating a numeric vector.

l_treat$Dets_Table#>      iCode check_SkewKurt0.Pass check_SkewKurt0.Skew check_SkewKurt0.Kurt#> 1  Flights                FALSE             2.103287             4.508879#> 2    Goods                FALSE             2.649973             8.266610#> 3 Services                 TRUE             1.701085             2.375656#>   winsorise.nwin check_SkewKurt1.Pass check_SkewKurt1.Skew check_SkewKurt1.Kurt#> 1              1                 TRUE             1.900658            3.3360647#> 2              2                 TRUE             1.140608            0.1572047#> 3             NA                   NA                   NA                   NA

We also check the individual points:

l_treat$Treated_Points#>    Flights Goods Services#> 1#> 2#> 3#> 4#> 5#> 6#> 7#> 8#> 9#> 10#> 11         winhi#> 12#> 13#> 14#> 15#> 16#> 17#> 18#> 19#> 20#> 21#> 22#> 23#> 24#> 25#> 26#> 27#> 28#> 29#> 30   winhi#> 31#> 32#> 33#> 34#> 35         winhi#> 36#> 37#> 38#> 39#> 40#> 41#> 42#> 43#> 44#> 45#> 46#> 47#> 48#> 49#> 50#> 51

Coins

Treating coins is a simple extension of treating a data frame. Thecoin method simply extracts the relevant data set as a data frame, andpasses it to the data frame method. So more or less, the same argumentsare present.

We begin by building the example coin, which will be used for theexamples here.

coin<-build_example_coin(up_to ="new_coin")#> iData checked and OK.#> iMeta checked and OK.#> Written data set to .$Data$Raw

Default treatment

TheTreat() function can be applied directly to a coinwith completely default options:

coin<-Treat(coin,dset ="Raw")#> Written data set to .$Data$Treated

For each indicator, theTreat() function:

  1. Checks skew and kurtosis using thecheck_SkewKurt()function
  2. If the indicator fails the test (returnsFALSE),applies Winsorisation
  3. Checks again skew and kurtosis
  4. If the indicator still fails, applies a log transformation.

If at any stage the indicator passes the skew and kurtosis test, itis returned without further treatment.

When we runTreat() on a coin, it also storesinformation returned fromf1,f2 andf_pass in the coin:

# summary of treatment for each indicatorhead(coin$Analysis$Treated$Dets_Table)#>     iCode check_SkewKurt0.Pass check_SkewKurt0.Skew check_SkewKurt0.Kurt#> 1     LPI                 TRUE           -0.3042681           -0.6567514#> 2 Flights                FALSE            2.1032872            4.5088794#> 3    Ship                 TRUE           -0.5756680           -0.6814795#> 4    Bord                FALSE            2.1482360            5.7914905#> 5    Elec                FALSE            2.2252736            5.7910268#> 6     Gas                FALSE            2.8294486           10.3346494#>   winsorise.nwin check_SkewKurt1.Pass check_SkewKurt1.Skew check_SkewKurt1.Kurt#> 1             NA                   NA                   NA                   NA#> 2              1                 TRUE             1.900658             3.336065#> 3             NA                   NA                   NA                   NA#> 4              1                 TRUE             1.899211             4.346298#> 5              1                 TRUE             1.717744             2.586062#> 6              1                 TRUE             1.602518             1.525576

Notice that only one treatment function was used here, since afterWinsorisation (f1), all indicators passed the skew andkurtosis test (f_pass).

In general,Treat() tries to collect all informationreturned from the functions that it calls. Details of the treatment ofindividual points are also stored in.$Analysis$Treated$Treated_Points.

TheTreat() function gives you a high degree of controlover which functions are used to treat and test indicators, and it isalso possible to specify different functions for different indicators.Let’s begin though by seeing how we can change the specifications forall indicators, before proceeding to individual treatment.

Unlessindiv_specs is specified (see later), the sameprocedure is applied to all indicators. This process is specified by theglobal_specs argument. To see how to use this, it iseasiest to show the default of this argument which is built into thetreat() function:

# default treatment for all colsspecs_def<-list(f1 ="winsorise",f1_para =list(na.rm =TRUE,winmax =5,skew_thresh =2,kurt_thresh =3.5,force_win =FALSE),f2 ="log_CT",f2_para =list(na.rm =TRUE),f_pass ="check_SkewKurt",f_pass_para =list(na.rm =TRUE,skew_thresh =2,kurt_thresh =3.5))

Notice that there are six entries in the list:

To understand what the individual parameters do, for example inf1_para, we need to look at the function called byf1, which is thewinsorise() function:

Here we see the same parameters as named in the listf1_para, and we can change the maximum number of points tobe Winsorised, the skew and kurtosis thresholds, and other things.

To make adjustments, unless we want to redefine everything, we don’tneed to specify the entire list. So for example, if we want to changethe maximum Winsorisation limitwinmax, we can just passthis part of the list (notice we still have to wrap the parameter insidea list):

# treat with max winsorisation of 3 pointscoin<-Treat(coin,dset ="Raw",global_specs =list(f1_para =list(winmax =1)))#> Written data set to .$Data$Treated#> (overwritten existing data set)# see what happenedcoin$Analysis$Treated$Dets_Table|>head(10)#>       iCode check_SkewKurt0.Pass check_SkewKurt0.Skew check_SkewKurt0.Kurt#> 1       LPI                 TRUE           -0.3042681           -0.6567514#> 2   Flights                FALSE            2.1032872            4.5088794#> 3      Ship                 TRUE           -0.5756680           -0.6814795#> 4      Bord                FALSE            2.1482360            5.7914905#> 5      Elec                FALSE            2.2252736            5.7910268#> 6       Gas                FALSE            2.8294486           10.3346494#> 7  ConSpeed                 TRUE            0.4622037            0.1873214#> 8     Cov4G                 TRUE           -1.3725191            0.5419314#> 9     Goods                FALSE            2.6499733            8.2666095#> 10 Services                 TRUE            1.7010849            2.3756557#>    winsorise.nwin check_SkewKurt1.Pass check_SkewKurt1.Skew#> 1              NA                   NA                   NA#> 2               1                 TRUE             1.900658#> 3              NA                   NA                   NA#> 4               1                 TRUE             1.899211#> 5               1                 TRUE             1.717744#> 6               1                 TRUE             1.602518#> 7              NA                   NA                   NA#> 8              NA                   NA                   NA#> 9               1                FALSE             2.469910#> 10             NA                   NA                   NA#>    check_SkewKurt1.Kurt check_SkewKurt2.Pass check_SkewKurt2.Skew#> 1                    NA                   NA                   NA#> 2              3.336065                   NA                   NA#> 3                    NA                   NA                   NA#> 4              4.346298                   NA                   NA#> 5              2.586062                   NA                   NA#> 6              1.525576                   NA                   NA#> 7                    NA                   NA                   NA#> 8                    NA                   NA                   NA#> 9              7.087309                 TRUE           0.03104001#> 10                   NA                   NA                   NA#>    check_SkewKurt2.Kurt#> 1                    NA#> 2                    NA#> 3                    NA#> 4                    NA#> 5                    NA#> 6                    NA#> 7                    NA#> 8                    NA#> 9            -0.8888965#> 10                   NA

Having imposed a much stricter Winsorisation limit (only one point),we can see that now one indicator has been passed to the secondtreatment functionf2, which has performed a logtransformation. After doing this, the indicator passes the skew andkurtosis test.

By default, if an indicator does not satisfyf_passafter applyingf1, it is passed tof2inits original form - in other words it is not the output off1 that is passed tof2, andf2is appliedinstead off1, rather than in additionto it. If you want to applyf2 on top off1setcombine_treat = TRUE. In this case, iff_pass is not satisfied afterf1 then theoutput off1 is used as the input off2. Forthe defaults off1 andf2 this approach isprobably not advisable because Winsorisation and the log transform arequite different approaches. However depending on what you want to do, itmight be useful.

Individual treatment

Theglobal_specs specifies the treatment methodology toapply to all indicators. However, theindiv_specs argument(if specified), can be used to override the treatment specified inglobal_specs for specific indicators. It is specified inexactly the same way asglobal_specs but requires aparameter list for each indicator that is to have individualspecifications applied, wrapped inside one list.

This is probably clearer using an example. To begin with somethingsimple, let’s say that we keep the defaults for all indicators exceptone, where we change the Winsorisation limit. We will set theWinsorisation limit of the indicator “Flights” to zero, to force it tobe log-transformed.

# change individual specs for Flightsindiv_specs<-list(Flights =list(f1_para =list(winmax =0)  ))# re-run data treatmentcoin<-Treat(coin,dset ="Raw",indiv_specs = indiv_specs)#> Written data set to .$Data$Treated#> (overwritten existing data set)

The only thing to remember here is to make sure the list is createdcorrectly. Each indicator to assign individual treatment must have itsown list - here containingf1_para. Thenf1_para itself is a list of named parameter values forf1. Finally, all lists for each indicator have to bewrapped into a single list to pass toindiv_specs. Thislooks a bit convoluted for changing a single parameter, but gives a highdegree of control over how data treatment is performed.

We can now see what happened to “Flights”:

coin$Analysis$Treated$Dets_Table[  coin$Analysis$Treated$Dets_Table$iCode=="Flights",]#>     iCode check_SkewKurt0.Pass check_SkewKurt0.Skew check_SkewKurt0.Kurt#> 2 Flights                FALSE             2.103287             4.508879#>   winsorise.nwin check_SkewKurt1.Pass check_SkewKurt1.Skew check_SkewKurt1.Kurt#> 2              0                FALSE             2.103287             4.508879#>   check_SkewKurt2.Pass check_SkewKurt2.Skew check_SkewKurt2.Kurt#> 2                 TRUE          -0.09502644           -0.8305217

Now we see that “Flights” didn’t pass the first Winsorisation step(because nothing happened to it), and was passed to the log transform.After that, the indicator passed the skew and kurtosis check.

As another example, we may wish to exclude some indicators from datatreatment completely. To do this, we can set the corresponding entriesinindiv_specs to"none". This is the onlycase where we don’t have to pass a list for each indicator.

# change individual specs for two indicatorsindiv_specs<-list(Flights ="none",LPI ="none")# re-run data treatmentcoin<-Treat(coin,dset ="Raw",indiv_specs = indiv_specs)#> Written data set to .$Data$Treated#> (overwritten existing data set)

Now if we examine the treatment table, we will find that theseindicators have been excluded from the table, as they were not subjectedto treatment.

External functions

Any functions can be passed toTreat(), for bothtreating and checking for outliers. As an example, we can pass anoutlier detection function ` from theperformancepackage

The following code chunk will only run if you have the ‘performance’package installed.

library(performance)# the check_outliers function outputs a logical vector which flags specific points as outliers.# We need to wrap this to give a single TRUE/FALSE output, where FALSE means it doesn't pass,# i.e. there are outliersoutlier_pass<-function(x){# return FALSE if any outliers!any(check_outliers(x))}# now call treat(), passing this function# we set f_pass_para to NULL to avoid passing default parameters to the new functioncoin<-Treat(coin,dset ="Raw",global_specs =list(f_pass ="outlier_pass",f_pass_para =NULL))# see what happenedcoin$Analysis$Treated$Dets_Table|>head(10)

Here we see that the test for outliers is much stricter and very fewof the indicators pass the test, even after applying a logtransformation. Clearly, how an outlier is defined can vary and dependon your application.

Purses

The purse method fortreat() is fairly straightforward.It takes almost the same arguments as the coin method, and applies thesame specifications to each coin. Here we simply demonstrate it on theexample purse.

# build example pursepurse<-build_example_purse(up_to ="new_coin",quietly =TRUE)# apply treatment to all coins in purse (default specs)purse<-Treat(purse,dset ="Raw")#> Written data set to .$Data$Treated#> Written data set to .$Data$Treated#> Written data set to .$Data$Treated#> Written data set to .$Data$Treated#> Written data set to .$Data$Treated

Simplified function

TheTreat() function is very flexible but comes at theexpense of a possibly fiddly syntax. If you don’t need that level offlexibility, consider usingqTreat(), which is a simplifiedwrapper forTreat().

The main features ofqTreat() are that:

TheqTreat() function is a generic with methods for dataframes, coins and purses. Here, we’ll just demonstrate it on a dataframe.

# select three indicatorsdf1<- ASEM_iData[c("Flights","Goods","Services")]# treat data frame, changing winmax and skew/kurtosis limitsl_treat<-qTreat(df1,winmax =1,skew_thresh =1.5,kurt_thresh =3)

Now we check what the results are:

l_treat$Dets_Table#>      iCode check_SkewKurt0.Pass check_SkewKurt0.Skew check_SkewKurt0.Kurt#> 1  Flights                FALSE             2.103287             4.508879#> 2    Goods                FALSE             2.649973             8.266610#> 3 Services                 TRUE             1.701085             2.375656#>   winsorise.nwin check_SkewKurt1.Pass check_SkewKurt1.Skew check_SkewKurt1.Kurt#> 1              1                FALSE             1.900658             3.336065#> 2              1                FALSE             2.469910             7.087309#> 3             NA                   NA                   NA                   NA#>   check_SkewKurt2.Pass check_SkewKurt2.Skew check_SkewKurt2.Kurt#> 1                 TRUE          -0.09502644           -0.8305217#> 2                 TRUE           0.03104001           -0.8888965#> 3                   NA                   NA                   NA

We can see that in this case, Winsorsing by one point was not enoughto bring “Flights” and “Goods” within the specified skew/kurtosislimits. Consequently,f2 was invoked, which uses a logtransform and brought both indicators within the specified limits.


[8]ページ先頭

©2009-2025 Movatter.jp