| Type: | Package |
| Title: | Preprocessing Algorithms for Imbalanced Datasets |
| Version: | 1.0.2.1 |
| Maintainer: | Ignacio Cordón <nacho.cordon.castillo@gmail.com> |
| Description: | Class imbalance usually damages the performance of classifiers. Thus, it is important to treat data before applying a classifier algorithm. This package includes recent resampling algorithms in the literature: (Barua et al. 2014) <doi:10.1109/tkde.2012.232>; (Das et al. 2015) <doi:10.1109/tkde.2014.2324567>, (Zhang et al. 2014) <doi:10.1016/j.inffus.2013.12.003>; (Gao et al. 2014) <doi:10.1016/j.neucom.2014.02.006>; (Almogahed et al. 2014) <doi:10.1007/s00500-014-1484-5>. It also includes an useful interface to perform oversampling. |
| License: | GPL-2 |GPL-3 | file LICENSE [expanded from: GPL (≥ 2) | file LICENSE] |
| Encoding: | UTF-8 |
| LazyData: | true |
| BugReports: | http://github.com/ncordon/imbalance/issues |
| URL: | http://github.com/ncordon/imbalance |
| Depends: | R (≥ 3.3.0) |
| Imports: | bnlearn, KernelKnn, ggplot2, utils, stats, mvtnorm, Rcpp,smotefamily, FNN, C50 |
| Suggests: | testthat, knitr, rmarkdown |
| RoxygenNote: | 7.0.2 |
| VignetteBuilder: | knitr |
| LinkingTo: | Rcpp, RcppArmadillo |
| NeedsCompilation: | yes |
| Packaged: | 2020-03-30 12:10:40 UTC; hornik |
| Author: | Ignacio Cordón [aut, cre], Salvador García [aut], Alberto Fernández [aut], Francisco Herrera [aut] |
| Repository: | CRAN |
| Date/Publication: | 2020-04-07 06:51:44 UTC |
Binary banana dataset
Description
Dataset containing two attributes as well as a class one, that, if plotted, representa banana shape
Usage
bananabanana_origFormat
- At1
First attribute.
- At2
Second attribute.
- Class
Two possible classes: positive (banana shape), negative (surroundingof the banana).
Shape
banana: A data frame with 2640 instances, 264 of which belong to positive class,and 3 variables
banana_orig: A data frame with 5300 instances, 2376 of which belong to positiveclass, and 3 variables:
Source
Imbalanced binary ecoli protein localization sites
Description
Imbalanced binary dataset containing protein traits for predicting their cellularlocalization sites.
Usage
ecoli1Format
A data frame with 336 instances, 77 of which belong to positive class,and 8 variables:
- Mcg
McGeoch's method for signal sequence recognition.Continuous attribute.
- Gvh
Von Heijne's method for signal sequence recognition.Continuous attribute.
- Lip
von Heijne's Signal Peptidase II consensus sequence score.Discrete attribute.
- Chg
Presence of charge on N-terminus of predicted lipoproteins.Discrete attribute.
- Aac
Score of discriminant analysis of the amino acid content of outermembrane and periplasmic proteins. Continuous attribute.
- Alm1
Score of the ALOM membrane spanning region prediction program.Continuous attribute.
- Alm2
score of ALOM program after excluding putative cleavable signalregions from the sequence. Continuous attribute.
- Class
Two possible classes: positive (type im), negative (the rest).
Source
See Also
Original available inUCI ML Repository.
Imbalanced binary glass identification
Description
Imbalanced binary classification dataset containing variables toidentify types of glass.
Usage
glass0Format
A data frame with 214 instances, 70 of which belong to positve class,and 10 variables:
- RI
Refractive Index. Continuous attribute.
- Na
Sodium, weight percent in component. Continuous attribute.
- Mg
Magnesium, weight percent in component. Continuous attribute.
- Al
Aluminum, weight percent in component. Continuous attribute.
- Si
Silicon, weight percent in component. Continuous attribute.
- K
Potasium, weight percent in component. Continuous attribute.
- Ca
Calcium, weight percent in component. Continuous attribute.
- Ba
Barium, weight percent in component. Continuous attribute.
- Fe
Iron, weight percent in component. Continuous attribute.
- Class
Two possible glass types: positive (building windows, float processed)and negative (the rest).
Source
See Also
Original available inUCI ML Repository.
Haberman's survival data
Description
The dataset contains cases from a study that was conducted between1958 and 1970 at the University of Chicago's Billings Hospital onthe survival of patients who had undergone surgery for breastcancer.
Usage
habermanFormat
A data frame with 306 instances, 81 of which belong to positive class,and 4 variables:
- Age
Age of patient at time of operation. Discrete attribute.
- Year
Patient's year of operation. Discrete attribute.
- Positive
Number of positive axillary nodes detected. Discrete attribute.
- Class
Two possible survival status: positive(survival rate of less than 5 years),negative (survival rate or more than 5 years).
Source
See Also
Original available inUCI ML Repository.
imabalance: A package to treat imbalanced datasets
Description
Focused on binary class datasets, theimbalance package providesmethods to generate synthetic examples and achieve balance between theminority and majority classes in dataset distributions
Oversampling
Methods to oversample the minority class:racog,wracog,rwo,pdfos,mwmote
Evaluation
Method to measure imbalance ratio in a given two-class dataset:imbalanceRatio.
Method to visually evaluate algorithms:plotComparison.
Filtering
Methods to filter oversampled instancesneater.
Compute imbalance ratio of a binary dataset
Description
Given a two-class dataset, it computes its imbalance ratio as {Size ofminority class}/{Size of majority class}
Usage
imbalanceRatio(dataset, classAttr = "Class")Arguments
dataset | A target |
classAttr | A |
Value
A real number in [0,1] representing the imbalance ratio ofdataset
Examples
data(glass0)imbalanceRatio(glass0, classAttr = "Class")Imbalanced binary iris dataset
Description
Modification ofiris dataset. Measurements incentimeters of the variables sepal length and width and petal length andwidth, respectively, for 50 flowers from each of 3 species of iris. Thepossible classifications are positive (setosa) and negative (versicolor +virginica).
Usage
iris0Format
A data frame with 150 instances, 50 of which belong to positive class,and 5 variables:
- SepalLength
Measurement of sepal length, in cm. Continuous attribute.
- SepalWidth
Measurement of sepal width, in cm. Continuous attribute.
- PetalLength
Measurement of petal length, in cm. Continuous attribute.
- PetalWidth
Measurement of petal width, in cm. Continuous attribute.
- Class
Two possible classes: positive (setosa) and negative (versicolor +virginica).
Source
Majority weighted minority oversampling technique for imbalance datasetlearning
Description
Modification for SMOTE technique which overcomes some of the problems of theSMOTE technique when there are noisy instances, in which case SMOTE wouldgenerate more noisy instances out of them.
Usage
mwmote( dataset, numInstances, kNoisy = 5, kMajority = 3, kMinority, threshold = 5, cmax = 2, cclustering = 3, classAttr = "Class")Arguments
dataset |
|
numInstances | Integer. Number of new minority examples to generate. |
kNoisy | Integer. Parameter of euclidean KNN to detect noisy examples asthose whose whole kNoisy-neighbourhood is from the opposite class. |
kMajority | Integer. Parameter of euclidean KNN to detect majorityborderline examples as those who are in any kMajority-neighbourhood ofminority instances. Should be a low integer. |
kMinority | Integer. Parameter of euclidean KNN to detect minorityborderline examples as those who are in the KMinority-neighbourhood ofmajority borderline ones. It should be a large integer. By default if notparameter is fed to the function, |
threshold | Numeric. A positive real indicating how much we measuretolerance of closeness to the boundary of minority boundary examples. Alarge integer indicates more margin of distance for a example to beconsiderated important boundary one. |
cmax | Numeric. A positive real indicating how much we measure toleranceof closeness to the boundary of minority boundary examples. The larger thisnumber, the more we are valuing boundary examples. |
cclustering | Numeric. A positive real for tuning the output of aninternal clustering. The larger this parameter, the more area focused isgoing to be the oversampling. |
classAttr |
|
Value
Adata.frame with the same structure asdataset,containing the generated synthetic examples.
References
Barua, Sukarna; Islam, Md.M.; Yao, Xin; Murase, Kazuyuki. Mwmote–majorityWeighted Minority Oversampling Technique for Imbalanced Data Set Learning.IEEE Transactions on Knowledge and Data Engineering 26 (2014), Nr. 2, p.405–425
Examples
data(iris0)# Generates new minority examplesnewSamples <- mwmote(iris0, numInstances = 100, classAttr = "Class")Fitering of oversampled data based on non-cooperative game theory
Description
Filters oversampled examples from a binary classdataset using gametheory to find out if keeping an example is worthy enough.
Usage
neater( dataset, newSamples, k = 3, iterations = 100, smoothFactor = 1, classAttr = "Class")Arguments
dataset | The original |
newSamples | A |
k | Integer. Number of nearest neighbours to use in KNN algorithm torule out samples. By default, 3. |
iterations | Integer. Number of iterations for the algorithm. Bydefault, 100. |
smoothFactor | A positive |
classAttr |
|
Details
Uses game theory and Nash equilibriums to calculate the minority examplesprobability of trully belonging to the minority class. It discards exampleswhich at the final stage of the algorithm have more probability of being amajority example than a minority one.
Value
Filtered samples as adata.frame with same structure asnewSamples.
References
Almogahed, B.A.; Kakadiaris, I.A. Neater: Filtering of Over-Sampled DataUsing Non-Cooperative Game Theory. Soft Computing 19 (2014), Nr. 11, p.3301–3322.
Examples
data(iris0)newSamples <- smotefamily::SMOTE(iris0[,-5], iris0[,5])$syn_data# SMOTE overrides Class attr turning it into class# and dataset must have same class attribute as newSamplesnames(newSamples) <- c(names(newSamples)[-5], "Class")neater(iris0, newSamples, k = 5, iterations = 100, smoothFactor = 1, classAttr = "Class")Imbalanced binary thyroid gland data
Description
Data to predict patient's hyperthyroidism.
Usage
newthyroid1Format
A data frame with 215 instances, 35 of which belong to positive class,and 6 variables:
- T3resin
T3-resin uptake test, percentage. Discrete attribute.
- Thyroxin
Total Serum thyroxin as measured by the isotopicdisplacement method. Continuous attribute.
- Triiodothyronine
Total serum triiodothyronine as measured by radioimmunoassay. Continuous attribute.
- Thyroidstimulating
Basal thyroid-stimulating hormone (TSH) as measured byradioimmuno assay. Continuous attribute.
- TSH_value
Maximal absolute difference of TSH value after injection of 200micro grams of thyrotropin-releasing hormone as compared to the basal value.Continuous attribute.
- Class
Two possible classes: positive as hyperthyroidism, negative as nonhyperthyroidism.
Source
See Also
Original available inUCI ML Repository.
Wrapper that encapsulates a collection of algorithms to perform a classbalancing preprocessing task for binary class datasets
Description
Wrapper that encapsulates a collection of algorithms to perform a classbalancing preprocessing task for binary class datasets
Usage
oversample( dataset, ratio = NA, method = c("RACOG", "wRACOG", "PDFOS", "RWO", "ADASYN", "ANSMOTE", "SMOTE", "MWMOTE", "BLSMOTE", "DBSMOTE", "SLMOTE", "RSLSMOTE"), filtering = FALSE, classAttr = "Class", wrapper = c("KNN", "C5.0"), ...)Arguments
dataset | A binary class |
ratio | Number between 0 and 1 indicating the desired ratio betweenminority examples and majority ones, that is, the quotient size ofminority class/size of majority class. There are methods, such as |
method | A |
filtering | Logical (TRUE or FALSE) indicating wheter to apply filteringof oversampled instances with |
classAttr |
|
wrapper | A |
... | Further arguments to apply in selected method |
Value
A balanceddata.frame with same structure asdataset,containing both original instances and new ones
Examples
data(glass0)# Oversample glass0 to get an imbalance ratio of 0.8imbalanceRatio(glass0)# 0.4861111newDataset <- oversample(glass0, ratio = 0.8, method = "MWMOTE")imbalanceRatio(newDataset)newDataset <- oversample(glass0, method = "ADASYN")newDataset <- oversample(glass0, ratio = 0.8, method = "SMOTE")Probability density function estimation based oversampling
Description
Generates synthetic minority examples for a numerical dataset approximating aGaussian multivariate distribution which best fits the minority data.
Usage
pdfos(dataset, numInstances, classAttr = "Class")Arguments
dataset |
|
numInstances | Integer. Number of new minority examples to generate. |
classAttr |
|
Details
To generate the synthetic data, it approximates a normal distribution withmean a given example belonging to the minority class, and whose variance isthe minority class variance multiplied by a constant; that constant iscomputed so that it minimizes the mean integrated squared error of a Gaussianmultivariate kernel function.
Value
Adata.frame with the same structure asdataset,containing the generated synthetic examples.
References
Gao, Ming; Hong, Xia; Chen, Sheng; Harris, Chris J.; Khalaf, Emad. Pdfos: PdfEstimation Based Oversampling for Imbalanced Two-Class Problems.Neurocomputing 138 (2014), p. 248–259
Silverman, B. W. Density Estimation for Statistics and Data Analysis. Chapman& Hall, 1986. – ISBN 0412246201
Examples
data(iris0)newSamples <- pdfos(iris0, numInstances = 100, classAttr = "Class")Plots comparison between the original and the new balanced dataset.
Description
It plots a grid of one to one variable comparison, placing the former datasetgraphics next to the balanced one, for each pair of attributes.
Usage
plotComparison(dataset, anotherDataset, attrs, cols = 2, classAttr = "Class")Arguments
dataset | A |
anotherDataset | A |
attrs | Vector of |
cols | Integer. It indicates the number of columns of resulting grid.Must be an even number. By default, 2. |
classAttr |
|
Value
Plot of 2D comparison between the variables.
Examples
data(iris0)set.seed(12345)rwoSamples <- rwo(iris0, numInstances = 100)rwoBalanced <- rbind(iris0, rwoSamples)plotComparison(iris0, rwoBalanced, names(iris0), cols = 2, classAttr = "Class")Rapidly converging Gibbs algorithm.
Description
Allows you to treat imbalanced discrete numeric datasets by generatingsynthetic minority examples, approximating their probability distribution.
Usage
racog(dataset, numInstances, burnin = 100, lag = 20, classAttr = "Class")Arguments
dataset |
|
numInstances | Integer. Number of new minority examples to generate. |
burnin | Integer. It determines how many examples generated for a givenone are going to be discarded firstly. By default, 100. |
lag | Integer. Number of iterations between new generated example for aminority one. By default, 20. |
classAttr |
|
Details
Approximates minority distribution using Gibbs Sampler. Dataset must bediscretized and numeric. In each iteration, it builds a new sample using aMarkov chain. It discards firstburnin iterations, and from then on,eachlag iterations, it validates the example as a new minorityexample. It generatesd (iterations-burnin)/lag whered isminority examples number.
Value
Adata.frame with the same structure asdataset,containing the generated synthetic examples.
References
Das, Barnan; Krishnan, Narayanan C.; Cook, Diane J. Racog and Wracog: TwoProbabilistic Oversampling Techniques. IEEE Transactions on Knowledge andData Engineering 27(2015), Nr. 1, p. 222–234.
Examples
data(iris0)# Generates new minority examplesnewSamples <- racog(iris0, numInstances = 40, burnin = 20, lag = 10, classAttr = "Class")newSamples <- racog(iris0, numInstances = 100)Random walk oversampling
Description
Generates synthetic minority examples for a dataset trying to preserve thevariance and mean of the minority class. Works on every type of dataset.
Usage
rwo(dataset, numInstances, classAttr = "Class")Arguments
dataset |
|
numInstances | Integer. Number of new minority examples to generate. |
classAttr |
|
Details
GeneratesnumInstances new minority examples fordataset,adding to the each numeric column of the j-th example its variance scalatedby the inverse of the number of minority examples and a factor following aN(0,1) distribution which depends on the example. When the column isnominal, it uses a roulette scheme.
Value
Adata.frame with the same structure asdataset,containing the generated synthetic examples.
References
Zhang, Huaxiang; Li, Mingfang. Rwo-Sampling: A Random Walk Over-SamplingApproach To Imbalanced Data Classification. Information Fusion 20 (2014), p.99–116.
Examples
data(iris0)newSamples <- rwo(iris0, numInstances = 100, classAttr = "Class")Generic methods to train classifiers
Description
Generic methods to train classifiers
Usage
trainWrapper(wrapper, train, trainClass, ...)Arguments
wrapper | the wrapper instance |
train |
|
trainClass | a vector containing the class column for |
... | further arguments for |
Value
A model which ispredict callable.
See Also
Examples
myWrapper <- structure(list(),)trainWrapper.C50Wrapper <- function(wrapper, train, trainClass){ C50::C5.0(train, trainClass)}Imbalanced binary breast cancer Wisconsin dataset
Description
Binary class dataset containing traits about patients with cancer. Original dataset wasobtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg.
Usage
wisconsinFormat
A data frame with 683 instances, 239 of which belong to positive class,and 10 variables:
- ClumpThickness
Discrete attribute.
- CellSize
Discrete attribute.
- CellShape
Discrete attribute.
- MarginalAdhesion
Discrete attribute.
- EpithelialSize
Discrete attribute.
- BareNuclei
Discrete attribute.
- BlandChromatin
Disrete attribute.
- NormalNucleoli
Discrete attribute.
- Mitoses
Discrete attribute.
- Class
Two possible classes: positive (cancer) and negative (not cancer).
Source
See Also
Original available inUCI ML Repository.
Wrapper for rapidly converging Gibbs algorithm.
Description
Generates synthetic minority examples by approximating their probabilitydistribution until sensitivity ofwrapper overvalidationcannot be further improved. Works only on discrete numeric datasets.
Usage
wracog( train, validation, wrapper, slideWin = 10, threshold = 0.02, classAttr = "Class", ...)Arguments
train |
|
validation |
|
wrapper | An |
slideWin | Number of last sensitivities to take into account to meet thestopping criteria. By default, 10. |
threshold | Threshold that the last |
classAttr |
|
... | further arguments for |
Details
Until the lastslideWin executions ofwrapper overvalidation dataset reach a mean sensitivity lower thanthreshold, the algorithm keeps generating samples using Gibbs Sampler,and adding misclassified samples with respect to a model generated by aformer train, to the train dataset. Initial model is built on initialtrain.
Value
Adata.frame with the same structure astrain,containing the generated synthetic examples.
References
Das, Barnan; Krishnan, Narayanan C.; Cook, Diane J. Racog and Wracog: TwoProbabilistic Oversampling Techniques. IEEE Transactions on Knowledge andData Engineering 27(2015), Nr. 1, p. 222–234.
Examples
data(haberman)# Create train and validation partitions of habermantrainFold <- sample(1:nrow(haberman), nrow(haberman)/2, FALSE)trainSet <- haberman[trainFold, ]validationSet <- haberman[-trainFold, ]# Defines our own wrapper with a C5.0 treemyWrapper <- structure(list(),)trainWrapper.TestWrapper <- function(wrapper, train, trainClass){ C50::C5.0(train, trainClass)}# Execute wRACOG with our own wrappernewSamples <- wracog(trainSet, validationSet, myWrapper, classAttr = "Class")# Execute wRACOG with predifined wrappers for "KNN" or "C5.0"KNNSamples <- wracog(trainSet, validationSet, "KNN")C50Samples <- wracog(trainSet, validationSet, "C5.0")Imbalanced binary yeast protein localization sites
Description
Imbalanced binary dataset containing protein traits for predicting their cellularlocalization sites.
Usage
yeast4Format
A data frame with 1484 instances, 51 of which belong to positive class,and 9 variables:
- Mcg
McGeoch's method for signal sequence recognition.Continuous attribute.
- Gvh
Von Heijne's method for signal sequence recognition.Continuous attribute.
- Alm
Score of the ALOM membrane spanning region prediction program.Continuous attribute.
- Mit
Score of discriminant analysis of the amino acid content of theN-terminal region (20 residues long) of mitochondrial and non-mitochondrialproteins. Continuous attribute.
- Erl
Presence of "HDEL" substring (thought to act as a signal forretention in the endoplasmic reticulum lumen). Binary attribute. Discreteattribute.
- Pox
Peroxisomal targeting signal in the C-terminus. Continuous attribute.
- Vac
Score of discriminant analysis of the amino acid content of vacuolarand extracellular proteins. Continuous attribute.
- Nuc
Score of discriminant analysis of nuclear localization signals ofnuclear and non-nuclear proteins. Continuous attribute.
- Class
Two possible classes: positive (membrane protein, uncleaved signal),negative (rest of localizations).
Source
See Also
Original available inUCI ML Repository.