Movatterモバイル変換

Type:

Package

Title:

Preprocessing Algorithms for Imbalanced Datasets

Version:

1.0.2.1

Maintainer:

Ignacio Cordón <nacho.cordon.castillo@gmail.com>

Description:

Class imbalance usually damages the performance of classifiers. Thus, it is important to treat data before applying a classifier algorithm. This package includes recent resampling algorithms in the literature: (Barua et al. 2014) <doi:10.1109/tkde.2012.232>; (Das et al. 2015) <doi:10.1109/tkde.2014.2324567>, (Zhang et al. 2014) <doi:10.1016/j.inffus.2013.12.003>; (Gao et al. 2014) <doi:10.1016/j.neucom.2014.02.006>; (Almogahed et al. 2014) <doi:10.1007/s00500-014-1484-5>. It also includes an useful interface to perform oversampling.

License:

GPL-2 |GPL-3 | file LICENSE [expanded from: GPL (≥ 2) | file LICENSE]

Encoding:

UTF-8

LazyData:

true

BugReports:

http://github.com/ncordon/imbalance/issues

URL:

http://github.com/ncordon/imbalance

Depends:

R (≥ 3.3.0)

Imports:

bnlearn, KernelKnn, ggplot2, utils, stats, mvtnorm, Rcpp,smotefamily, FNN, C50

Suggests:

testthat, knitr, rmarkdown

RoxygenNote:

7.0.2

VignetteBuilder:

knitr

LinkingTo:

Rcpp, RcppArmadillo

NeedsCompilation:

yes

Packaged:

2020-03-30 12:10:40 UTC; hornik

Author:

Ignacio Cordón [aut, cre], Salvador García [aut], Alberto Fernández [aut], Francisco Herrera [aut]

Repository:

CRAN

Date/Publication:

2020-04-07 06:51:44 UTC

Binary banana dataset

Description

Dataset containing two attributes as well as a class one, that, if plotted, representa banana shape

Usage

bananabanana_orig

Format

At1: First attribute.
At2: Second attribute.
Class: Two possible classes: positive (banana shape), negative (surroundingof the banana).

Shape

banana: A data frame with 2640 instances, 264 of which belong to positive class,and 3 variables

banana_orig: A data frame with 5300 instances, 2376 of which belong to positiveclass, and 3 variables:

Source

KEEL Repository.

Imbalanced binary ecoli protein localization sites

Description

Imbalanced binary dataset containing protein traits for predicting their cellularlocalization sites.

Usage

ecoli1

Format

A data frame with 336 instances, 77 of which belong to positive class,and 8 variables:

Mcg: McGeoch's method for signal sequence recognition.Continuous attribute.
Gvh: Von Heijne's method for signal sequence recognition.Continuous attribute.
Lip: von Heijne's Signal Peptidase II consensus sequence score.Discrete attribute.
Chg: Presence of charge on N-terminus of predicted lipoproteins.Discrete attribute.
Aac: Score of discriminant analysis of the amino acid content of outermembrane and periplasmic proteins. Continuous attribute.
Alm1: Score of the ALOM membrane spanning region prediction program.Continuous attribute.
Alm2: score of ALOM program after excluding putative cleavable signalregions from the sequence. Continuous attribute.
Class: Two possible classes: positive (type im), negative (the rest).

Source

KEEL Repository.

Imbalanced binary glass identification

Description

Imbalanced binary classification dataset containing variables toidentify types of glass.

Usage

glass0

Format

A data frame with 214 instances, 70 of which belong to positve class,and 10 variables:

RI: Refractive Index. Continuous attribute.
Na: Sodium, weight percent in component. Continuous attribute.
Mg: Magnesium, weight percent in component. Continuous attribute.
Al: Aluminum, weight percent in component. Continuous attribute.
Si: Silicon, weight percent in component. Continuous attribute.
K: Potasium, weight percent in component. Continuous attribute.
Ca: Calcium, weight percent in component. Continuous attribute.
Ba: Barium, weight percent in component. Continuous attribute.
Fe: Iron, weight percent in component. Continuous attribute.
Class: Two possible glass types: positive (building windows, float processed)and negative (the rest).

Source

KEEL Repository.

Haberman's survival data

Description

The dataset contains cases from a study that was conducted between1958 and 1970 at the University of Chicago's Billings Hospital onthe survival of patients who had undergone surgery for breastcancer.

Usage

haberman

Format

A data frame with 306 instances, 81 of which belong to positive class,and 4 variables:

Age: Age of patient at time of operation. Discrete attribute.
Year: Patient's year of operation. Discrete attribute.
Positive: Number of positive axillary nodes detected. Discrete attribute.
Class: Two possible survival status: positive(survival rate of less than 5 years),negative (survival rate or more than 5 years).

Source

KEEL Repository.

imabalance: A package to treat imbalanced datasets

Description

Focused on binary class datasets, theimbalance package providesmethods to generate synthetic examples and achieve balance between theminority and majority classes in dataset distributions

Oversampling

Methods to oversample the minority class:racog,wracog,rwo,pdfos,mwmote

Evaluation

Method to measure imbalance ratio in a given two-class dataset:imbalanceRatio.

Method to visually evaluate algorithms:plotComparison.

Filtering

Methods to filter oversampled instancesneater.

Compute imbalance ratio of a binary dataset

Description

Given a two-class dataset, it computes its imbalance ratio as {Size ofminority class}/{Size of majority class}

Usage

imbalanceRatio(dataset, classAttr = "Class")

Arguments

dataset

A targetdata.frame to compute its imbalance ratio

classAttr

Acharacter containing the class name attribute.

Value

A real number in [0,1] representing the imbalance ratio ofdataset

Examples

data(glass0)imbalanceRatio(glass0, classAttr = "Class")

Imbalanced binary iris dataset

Description

Modification ofiris dataset. Measurements incentimeters of the variables sepal length and width and petal length andwidth, respectively, for 50 flowers from each of 3 species of iris. Thepossible classifications are positive (setosa) and negative (versicolor +virginica).

Usage

iris0

Format

A data frame with 150 instances, 50 of which belong to positive class,and 5 variables:

SepalLength: Measurement of sepal length, in cm. Continuous attribute.
SepalWidth: Measurement of sepal width, in cm. Continuous attribute.
PetalLength: Measurement of petal length, in cm. Continuous attribute.
PetalWidth: Measurement of petal width, in cm. Continuous attribute.
Class: Two possible classes: positive (setosa) and negative (versicolor +virginica).

Source

KEEL Repository.

Majority weighted minority oversampling technique for imbalance datasetlearning

Description

Modification for SMOTE technique which overcomes some of the problems of theSMOTE technique when there are noisy instances, in which case SMOTE wouldgenerate more noisy instances out of them.

Usage

mwmote(  dataset,  numInstances,  kNoisy = 5,  kMajority = 3,  kMinority,  threshold = 5,  cmax = 2,  cclustering = 3,  classAttr = "Class")

Arguments

dataset

data.frame to treat. All columns, exceptclassAttr one, have to be numeric or coercible to numeric.

numInstances

Integer. Number of new minority examples to generate.

kNoisy

Integer. Parameter of euclidean KNN to detect noisy examples asthose whose whole kNoisy-neighbourhood is from the opposite class.

kMajority

Integer. Parameter of euclidean KNN to detect majorityborderline examples as those who are in any kMajority-neighbourhood ofminority instances. Should be a low integer.

kMinority

Integer. Parameter of euclidean KNN to detect minorityborderline examples as those who are in the KMinority-neighbourhood ofmajority borderline ones. It should be a large integer. By default if notparameter is fed to the function,|S^{+}|/2 whereS^{+} is theset of minority examples.

threshold

Numeric. A positive real indicating how much we measuretolerance of closeness to the boundary of minority boundary examples. Alarge integer indicates more margin of distance for a example to beconsiderated important boundary one.

cmax

Numeric. A positive real indicating how much we measure toleranceof closeness to the boundary of minority boundary examples. The larger thisnumber, the more we are valuing boundary examples.

cclustering

Numeric. A positive real for tuning the output of aninternal clustering. The larger this parameter, the more area focused isgoing to be the oversampling.

classAttr

character. Indicates the class attribute fromdataset. Must exist in it.

Value

Adata.frame with the same structure asdataset,containing the generated synthetic examples.

References

Barua, Sukarna; Islam, Md.M.; Yao, Xin; Murase, Kazuyuki. Mwmote–majorityWeighted Minority Oversampling Technique for Imbalanced Data Set Learning.IEEE Transactions on Knowledge and Data Engineering 26 (2014), Nr. 2, p.405–425

Examples

data(iris0)# Generates new minority examplesnewSamples <- mwmote(iris0, numInstances = 100, classAttr = "Class")

Fitering of oversampled data based on non-cooperative game theory

Description

Filters oversampled examples from a binary classdataset using gametheory to find out if keeping an example is worthy enough.

Usage

neater(  dataset,  newSamples,  k = 3,  iterations = 100,  smoothFactor = 1,  classAttr = "Class")

Arguments

dataset

The originaldata.frame. All columns, exceptclassAttr one, have to be numeric or coercible to numeric.

newSamples

Adata.frame containing the samples to be filtered.Must have the same structure asdataset.

k

Integer. Number of nearest neighbours to use in KNN algorithm torule out samples. By default, 3.

iterations

Integer. Number of iterations for the algorithm. Bydefault, 100.

smoothFactor

A positivenumeric. By default, 1.

classAttr

character. Indicates the class attribute fromdataset andnewSamples. Must exist in them.

Details

Uses game theory and Nash equilibriums to calculate the minority examplesprobability of trully belonging to the minority class. It discards exampleswhich at the final stage of the algorithm have more probability of being amajority example than a minority one.

Value

Filtered samples as adata.frame with same structure asnewSamples.

References

Almogahed, B.A.; Kakadiaris, I.A. Neater: Filtering of Over-Sampled DataUsing Non-Cooperative Game Theory. Soft Computing 19 (2014), Nr. 11, p.3301–3322.

Examples

data(iris0)newSamples <- smotefamily::SMOTE(iris0[,-5], iris0[,5])$syn_data# SMOTE overrides Class attr turning it into class# and dataset must have same class attribute as newSamplesnames(newSamples) <- c(names(newSamples)[-5], "Class")neater(iris0, newSamples, k = 5, iterations = 100,       smoothFactor = 1, classAttr = "Class")

Imbalanced binary thyroid gland data

Description

Data to predict patient's hyperthyroidism.

Usage

newthyroid1

Format

A data frame with 215 instances, 35 of which belong to positive class,and 6 variables:

T3resin: T3-resin uptake test, percentage. Discrete attribute.
Thyroxin: Total Serum thyroxin as measured by the isotopicdisplacement method. Continuous attribute.
Triiodothyronine: Total serum triiodothyronine as measured by radioimmunoassay. Continuous attribute.
Thyroidstimulating: Basal thyroid-stimulating hormone (TSH) as measured byradioimmuno assay. Continuous attribute.
TSH_value: Maximal absolute difference of TSH value after injection of 200micro grams of thyrotropin-releasing hormone as compared to the basal value.Continuous attribute.
Class: Two possible classes: positive as hyperthyroidism, negative as nonhyperthyroidism.

Source

KEEL Repository.

Wrapper that encapsulates a collection of algorithms to perform a classbalancing preprocessing task for binary class datasets

Description

Wrapper that encapsulates a collection of algorithms to perform a classbalancing preprocessing task for binary class datasets

Usage

oversample(  dataset,  ratio = NA,  method = c("RACOG", "wRACOG", "PDFOS", "RWO", "ADASYN", "ANSMOTE", "SMOTE", "MWMOTE",    "BLSMOTE", "DBSMOTE", "SLMOTE", "RSLSMOTE"),  filtering = FALSE,  classAttr = "Class",  wrapper = c("KNN", "C5.0"),  ...)

Arguments

dataset

A binary classdata.frame to balance.

ratio

Number between 0 and 1 indicating the desired ratio betweenminority examples and majority ones, that is, the quotient size ofminority class/size of majority class. There are methods, such asADASYN orwRACOG to which this parameter does not apply.

method

Acharacter corresponding to method to apply. Possiblemethods are: RACOG, wRACOG, PDFOS, RWO, ADASYN, ANSMOTE, SMOTE, MWMOTE,BLSMOTE, DBSMOTE, SLMOTE, RSLSMOTE

filtering

Logical (TRUE or FALSE) indicating wheter to apply filteringof oversampled instances withneater algorithm.

classAttr

character. Indicates the class attribute fromdataset.Must exist in it.

wrapper

Acharacter corresponding to wrapper to apply ifselected method iswracog. Possibilities are:"C5.0"and"KNN".

...

Further arguments to apply in selected method

Value

A balanceddata.frame with same structure asdataset,containing both original instances and new ones

Examples

data(glass0)# Oversample glass0 to get an imbalance ratio of 0.8imbalanceRatio(glass0)# 0.4861111newDataset <- oversample(glass0, ratio = 0.8, method = "MWMOTE")imbalanceRatio(newDataset)newDataset <- oversample(glass0, method = "ADASYN")newDataset <- oversample(glass0, ratio = 0.8, method = "SMOTE")

Probability density function estimation based oversampling

Description

Generates synthetic minority examples for a numerical dataset approximating aGaussian multivariate distribution which best fits the minority data.

Usage

pdfos(dataset, numInstances, classAttr = "Class")

Arguments

dataset

data.frame to treat. All columns, exceptclassAttr one, have to be numeric or coercible to numeric.

numInstances

Integer. Number of new minority examples to generate.

classAttr

character. Indicates the class attribute fromdataset. Must exist in it.

Details

To generate the synthetic data, it approximates a normal distribution withmean a given example belonging to the minority class, and whose variance isthe minority class variance multiplied by a constant; that constant iscomputed so that it minimizes the mean integrated squared error of a Gaussianmultivariate kernel function.

Value

Adata.frame with the same structure asdataset,containing the generated synthetic examples.

References

Gao, Ming; Hong, Xia; Chen, Sheng; Harris, Chris J.; Khalaf, Emad. Pdfos: PdfEstimation Based Oversampling for Imbalanced Two-Class Problems.Neurocomputing 138 (2014), p. 248–259

Silverman, B. W. Density Estimation for Statistics and Data Analysis. Chapman& Hall, 1986. – ISBN 0412246201

Examples

data(iris0)newSamples <- pdfos(iris0, numInstances = 100, classAttr = "Class")

Plots comparison between the original and the new balanced dataset.

Description

It plots a grid of one to one variable comparison, placing the former datasetgraphics next to the balanced one, for each pair of attributes.

Usage

plotComparison(dataset, anotherDataset, attrs, cols = 2, classAttr = "Class")

Arguments

dataset

Adata.frame. The former imbalanced dataset.

anotherDataset

Adata.frame. The balanced dataset.dataset andanotherDataset must have the same columns.

attrs

Vector ofcharacter. Attributes to compare. The functiongenerates each posible combination of attributes to build the comparison.

cols

Integer. It indicates the number of columns of resulting grid.Must be an even number. By default, 2.

classAttr

character. Indicates the class attribute fromdataset. Must exist in it.

Value

Plot of 2D comparison between the variables.

Examples

data(iris0)set.seed(12345)rwoSamples <- rwo(iris0, numInstances = 100)rwoBalanced <- rbind(iris0, rwoSamples)plotComparison(iris0, rwoBalanced, names(iris0), cols = 2, classAttr = "Class")

Rapidly converging Gibbs algorithm.

Description

Allows you to treat imbalanced discrete numeric datasets by generatingsynthetic minority examples, approximating their probability distribution.

Usage

racog(dataset, numInstances, burnin = 100, lag = 20, classAttr = "Class")

Arguments

dataset

data.frame to treat. All columns, exceptclassAttr one, have to be numeric or coercible to numeric.

numInstances

Integer. Number of new minority examples to generate.

burnin

Integer. It determines how many examples generated for a givenone are going to be discarded firstly. By default, 100.

lag

Integer. Number of iterations between new generated example for aminority one. By default, 20.

classAttr

character. Indicates the class attribute fromdataset. Must exist in it.

Details

Approximates minority distribution using Gibbs Sampler. Dataset must bediscretized and numeric. In each iteration, it builds a new sample using aMarkov chain. It discards firstburnin iterations, and from then on,eachlag iterations, it validates the example as a new minorityexample. It generatesd (iterations-burnin)/lag whered isminority examples number.

Value

Adata.frame with the same structure asdataset,containing the generated synthetic examples.

References

Das, Barnan; Krishnan, Narayanan C.; Cook, Diane J. Racog and Wracog: TwoProbabilistic Oversampling Techniques. IEEE Transactions on Knowledge andData Engineering 27(2015), Nr. 1, p. 222–234.

Examples

data(iris0)# Generates new minority examplesnewSamples <- racog(iris0, numInstances = 40, burnin = 20, lag = 10,                    classAttr = "Class")newSamples <- racog(iris0, numInstances = 100)

Random walk oversampling

Description

Generates synthetic minority examples for a dataset trying to preserve thevariance and mean of the minority class. Works on every type of dataset.

Usage

rwo(dataset, numInstances, classAttr = "Class")

Arguments

dataset

data.frame to treat. All columns, exceptclassAttr one, have to be numeric or coercible to numeric.

numInstances

Integer. Number of new minority examples to generate.

classAttr

character. Indicates the class attribute fromdataset. Must exist in it.

Details

GeneratesnumInstances new minority examples fordataset,adding to the each numeric column of the j-th example its variance scalatedby the inverse of the number of minority examples and a factor following aN(0,1) distribution which depends on the example. When the column isnominal, it uses a roulette scheme.

Value

Adata.frame with the same structure asdataset,containing the generated synthetic examples.

References

Zhang, Huaxiang; Li, Mingfang. Rwo-Sampling: A Random Walk Over-SamplingApproach To Imbalanced Data Classification. Information Fusion 20 (2014), p.99–116.

Examples

data(iris0)newSamples <- rwo(iris0, numInstances = 100, classAttr = "Class")

Generic methods to train classifiers

Description

Generic methods to train classifiers

Usage

trainWrapper(wrapper, train, trainClass, ...)

Arguments

wrapper

the wrapper instance

train

data.frame of the train dataset without the class column

trainClass

a vector containing the class column fortrain

...

further arguments forwrapper

Value

A model which ispredict callable.

Examples

myWrapper <- structure(list(),)trainWrapper.C50Wrapper <- function(wrapper, train, trainClass){  C50::C5.0(train, trainClass)}

Imbalanced binary breast cancer Wisconsin dataset

Description

Binary class dataset containing traits about patients with cancer. Original dataset wasobtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg.

Usage

wisconsin

Format

A data frame with 683 instances, 239 of which belong to positive class,and 10 variables:

ClumpThickness: Discrete attribute.
CellSize: Discrete attribute.
CellShape: Discrete attribute.
MarginalAdhesion: Discrete attribute.
EpithelialSize: Discrete attribute.
BareNuclei: Discrete attribute.
BlandChromatin: Disrete attribute.
NormalNucleoli: Discrete attribute.
Mitoses: Discrete attribute.
Class: Two possible classes: positive (cancer) and negative (not cancer).

Source

KEEL Repository.

Wrapper for rapidly converging Gibbs algorithm.

Description

Generates synthetic minority examples by approximating their probabilitydistribution until sensitivity ofwrapper overvalidationcannot be further improved. Works only on discrete numeric datasets.

Usage

wracog(  train,  validation,  wrapper,  slideWin = 10,  threshold = 0.02,  classAttr = "Class",  ...)

Arguments

train

data.frame. A initial dataset to generate first model.All columns, exceptclassAttr one, have to be numeric or coercibleto numeric.

validation

data.frame. A dataset to compare results ofconsecutive classifiers. Must have the same structure oftrain.

wrapper

AnS3 object. There must exist a methodtrainWrapper implemented for the class of the object, and apredict method implemented for the class of the modelreturned bytrainWrapper. Alternatively, it can the name of one ofthe wrappers distributed with the package,"KNN" or"C5.0".

slideWin

Number of last sensitivities to take into account to meet thestopping criteria. By default, 10.

threshold

Threshold that the lastslideWin sensitivities meanshould reach. By default, 0.02.

classAttr

character. Indicates the class attribute fromtrain andvalidation. Must exist in them.

...

further arguments forwrapper.

Details

Until the lastslideWin executions ofwrapper overvalidation dataset reach a mean sensitivity lower thanthreshold, the algorithm keeps generating samples using Gibbs Sampler,and adding misclassified samples with respect to a model generated by aformer train, to the train dataset. Initial model is built on initialtrain.

Value

Adata.frame with the same structure astrain,containing the generated synthetic examples.

References

Das, Barnan; Krishnan, Narayanan C.; Cook, Diane J. Racog and Wracog: TwoProbabilistic Oversampling Techniques. IEEE Transactions on Knowledge andData Engineering 27(2015), Nr. 1, p. 222–234.

Examples

data(haberman)# Create train and validation partitions of habermantrainFold <- sample(1:nrow(haberman), nrow(haberman)/2, FALSE)trainSet <- haberman[trainFold, ]validationSet <- haberman[-trainFold, ]# Defines our own wrapper with a C5.0 treemyWrapper <- structure(list(),)trainWrapper.TestWrapper <- function(wrapper, train, trainClass){  C50::C5.0(train, trainClass)}# Execute wRACOG with our own wrappernewSamples <- wracog(trainSet, validationSet, myWrapper,                     classAttr = "Class")# Execute wRACOG with predifined wrappers for "KNN" or "C5.0"KNNSamples <- wracog(trainSet, validationSet, "KNN")C50Samples <- wracog(trainSet, validationSet, "C5.0")

Imbalanced binary yeast protein localization sites

Description

Imbalanced binary dataset containing protein traits for predicting their cellularlocalization sites.

Usage

yeast4

Format

A data frame with 1484 instances, 51 of which belong to positive class,and 9 variables:

Mcg: McGeoch's method for signal sequence recognition.Continuous attribute.
Gvh: Von Heijne's method for signal sequence recognition.Continuous attribute.
Alm: Score of the ALOM membrane spanning region prediction program.Continuous attribute.
Mit: Score of discriminant analysis of the amino acid content of theN-terminal region (20 residues long) of mitochondrial and non-mitochondrialproteins. Continuous attribute.
Erl: Presence of "HDEL" substring (thought to act as a signal forretention in the endoplasmic reticulum lumen). Binary attribute. Discreteattribute.
Pox: Peroxisomal targeting signal in the C-terminus. Continuous attribute.
Vac: Score of discriminant analysis of the amino acid content of vacuolarand extracellular proteins. Continuous attribute.
Nuc: Score of discriminant analysis of nuclear localization signals ofnuclear and non-nuclear proteins. Continuous attribute.
Class: Two possible classes: positive (membrane protein, uncleaved signal),negative (rest of localizations).

Source

KEEL Repository.

Movatterモバイル変換

Binary banana dataset

Description

Usage

Format

Shape

Source

Imbalanced binary ecoli protein localization sites

Description

Usage

Format

Source

See Also

Imbalanced binary glass identification

Description

Usage

Format

Source

See Also

Haberman's survival data

Description

Usage

Format

Source

See Also

imabalance: A package to treat imbalanced datasets

Description

Oversampling

Evaluation

Filtering

Compute imbalance ratio of a binary dataset

Description

Usage

Arguments

Value

Examples

Imbalanced binary iris dataset

Description

Usage

Format

Source

Majority weighted minority oversampling technique for imbalance datasetlearning

Description

Usage

Arguments

Value

References

Examples

Fitering of oversampled data based on non-cooperative game theory

Description

Usage

Arguments

Details

Value

References

Examples

Imbalanced binary thyroid gland data

Description

Usage

Format

Source

See Also

Wrapper that encapsulates a collection of algorithms to perform a classbalancing preprocessing task for binary class datasets

Description

Usage

Arguments

Value

Examples

Probability density function estimation based oversampling

Description

Usage

Arguments

Details

Value

References

Examples

Plots comparison between the original and the new balanced dataset.

Description

Usage

Arguments