| Type: | Package |
| Title: | Imputation of Missing Data in Sequence Analysis |
| Version: | 2.2.0 |
| Description: | Multiple imputation of missing data in a dataset using MICT or MICT-timing methods. The core idea of the algorithms is to fill gaps of missing data, which is the typical form of missing data in a longitudinal setting, recursively from their edges. Prediction is based on either a multinomial or random forest regression model. Covariates and time-dependent covariates can be included in the model. |
| License: | GPL-2 |
| Imports: | Amelia, cluster, dfidx, doRNG, doSNOW, dplyr, foreach,graphics, mlr, nnet, parallel, plyr, ranger, rms, stats,stringr, TraMineR, TraMineRextras, utils, mice, parallelly |
| Suggests: | R.rsp, rmarkdown, testthat (≥ 3.0.0) |
| VignetteBuilder: | R.rsp |
| Config/testthat/edition: | 3 |
| Encoding: | UTF-8 |
| LazyData: | true |
| RoxygenNote: | 7.3.2 |
| URL: | https://github.com/emerykevin/seqimpute |
| BugReports: | https://github.com/emerykevin/seqimpute/issues |
| NeedsCompilation: | no |
| Packaged: | 2025-01-15 14:33:26 UTC; Kevin |
| Author: | Kevin Emery [aut, cre], Anthony Guinchard [aut], Andre Berchtold [aut], Kamyar Taher [aut] |
| Maintainer: | Kevin Emery <kevin.emery@unige.ch> |
| Depends: | R (≥ 3.5.0) |
| Repository: | CRAN |
| Date/Publication: | 2025-01-15 16:10:02 UTC |
Function that adds the clustering result to aseqimp objectobtained with theseqimpute function
Description
Function that adds the clustering result to aseqimp objectobtained with theseqimpute function
Usage
addcluster(impdata, clustering)Arguments
impdata | An object of class |
clustering | clustering made on the multiple imputed dataset. Caneither be a dataframe or a matrix, where each row correspond to anobservation and each column to a multiple imputed dataset |
Value
Returns aseqimp object containing the cluster to which eachsequence in each imputed dataset belongs. Specifically, a column namedcluster is added to the imputed datasets.
Transform an object of classseqimp into a dataframe or amidsobject
Description
The function converts aseqimp object into a specified format.
Usage
fromseqimp(data, format = "long", include = FALSE)Arguments
data | An object of class seqimp as created by thefunctionseqimpute |
format | The format in which the seqimp object should be returned. Itcould be: |
include | logical that indicates if the original dataset with missingvalue should be included or not. This parameter does not applyif |
Details
The argumentformat specifies the object that should be returnedby the function. It can take the following values
"long"produces a data set in which imputed data sets are stacked vertically.The following columns are added: 1)
.impreferring to theimputation number, and 2).idthe row names of the original dataset"stacked"the same as
"long", but without the inclusion ofthe two columns.impand.id"mids"produces an object of class
mids, which is the formatused by themicepackage.
Value
Transform aseqimp object into the desired format.
Author(s)
Kevin Emery
Examples
## Not run: # Imputation with the MICT algorithmimp <- seqimpute(data = gameadd, var = 1:4)# The object imp is transformed to a dataframe, where completed datasets are# stacked verticallyimp.stacked <- fromseqimp( data = imp, format = "stacked", include = FALSE)## End(Not run)Example data set: Game addiction
Description
Dataset containing variables on the gaming addiction of young people.The data consists of gaming addiction, coded as either 'no' or 'yes',measured over four consecutive years for 500 individuals, three covariatesand one time-dependent covariate. The yearly statesare recorded in columns 1 (T1_abuse) to 4 (T4_abuse).
The three covariates are
Gender(female or male),Age(measured at time 1),Track(school or apprenticeship).
The time-varying covariate consists of the individual's relationship togambling at each of the four time points, appearing in columnsT1_gambling,T2_gambling,T3_gambling, andT4_gambling. The states are eitherno, gambler or problematic gambler
Usage
data(gameadd)Format
A data frame containing 500 rows, 4 states variable, 3 covariatesand a time-dependent covariate.
Plot aseqimp object
Description
Plot aseqimp object. The state distribution plot of the firstm completed datasets is shown, possibly alongside the originaldataset with missing data
Usage
## S3 method for class 'seqimp'plot(x, m = 5, include = TRUE, ...)Arguments
x | Object of class |
m | Number of completed datasets to show |
include | logical that indicates if the original dataset with missingvalue should be plotted or not |
... | Arguments to be passed to the seqdplot function |
Author(s)
Kevin Emery
Print aseqimp object
Description
Print aseqimp object
Usage
## S3 method for class 'seqimp'print(x, ...)Arguments
x | Object of class |
... | additional arguments passed to other functions |
Author(s)
Kevin Emery
Summary of the types of gaps among a dataset
Description
TheseqQuickLook() function aimed at providing an overview of thenumber and size of the different types of gapsspread in the original dataset.
Usage
seqQuickLook(data, var = NULL, np = 1, nf = 1)Arguments
data | a data.frame where missing data are coded as NA ora state sequence object built withseqdef function |
var | the list of columns containing the trajectories.Default is NULL, i.e. all the columns. |
np | number of previous observations in the imputation model of theinternal gaps. |
nf | number of future observations in the imputation model of theinternal gaps. |
Details
The distinction between internal and SLG gaps depends on thenumber of previous (np) and future (nf) observations that areset for theMICT andMICT-timing algorithms.
Value
Returns adata.frame object that summarizes, for eachtype of gaps (Internal Gaps, Initial Gaps, Terminal Gaps,LEFT-hand side SLG, RIGHT-hand side SLG, Both-hand side SLG),the minimum length, the maximum length, the total number of gaps andthe total number of missing they contain.
Author(s)
Andre Berchtold and Kevin Emery
Examples
data(gameadd)seqQuickLook(data = gameadd, var = 1:4, np = 1, nf = 1)Spotting impossible transitions in longitudinal categorical data
Description
The purpose ofseqTrans is to spot impossible transitionsin longitudinal categorical data.
Usage
seqTrans(data, var = NULL, trans)Arguments
data | a data frame containing sequences of a multinomialvariable with missing data (coded as |
var | the list of columns containing the trajectories.Default is NULL, i.e. all the columns. |
trans |
|
Value
It returns a matrix where each row is the position of animpossible transition.
Author(s)
Andre Berchtold and Kevin Emery
Examples
data(gameadd)seqTransList <- seqTrans(data = gameadd, var = 1:4, trans = c("yes->no"))Generation of missing on longitudinal categorical data.
Description
Generation of missing data in sequence based on a Markovianapproach.
Usage
seqaddNA( data, var = NULL, states.high = NULL, propdata = 1, pstart.high = 0.1, pstart.low = 0.005, pcont = 0.66, maxgap = 3, maxprop = 0.75, only.traj = FALSE)Arguments
data | A data frame containing sequences of a categorical (multinomial)variable, where missing data are coded as |
var | A vector specifying the columns of the datasetthat contain the trajectories. Default is |
states.high | A list of states with a higher probability ofinitiating a subsequent missing data gap. |
propdata | Proportion of trajectories for which missing datais simulated, as a decimal between 0 and 1. |
pstart.high | Probability of starting a missing data gap for thestates specified in the |
pstart.low | Probability of starting a missing data gap for allother states. |
pcont | Probability of a missing data gap to continue. |
maxgap | Maximum length of a missing data gap. |
maxprop | Maximum proportion of missing data allowed in a sequence,as a decimal between 0 and 1. |
only.traj | Logical, if |
Details
The first time point of a trajectory has apstart.low probability tobe missing. For the next time points, the probability to be missing dependson the previous time point. There are four cases:
1. If the previous time point is missing and the maximum length of amissing gap, which is specified by the argumentmaxgap, is reached,the time point is set as observed.
2. If the previous time point is missing, but the maximum length of a gap isnot reached, there is apcont probability that this time point is missing.
3. If the previous time point is observed and the previous time point belongsto the list of states specified bypstart.high, the probability tobe missing ispstart.high.
4. If the previous time point is observed but the previous time point does notbelong to the list of states specified bypstart.high, theprobability to be missing ispstart.low.
If the proportion of missing data in a given trajectory exceeds theproportion specified bymaxprop, the missing data simulation isrepeated for the sequence.
Value
A data frame with simulated missing data.
Author(s)
Kevin Emery
Examples
# Generate MCAR missing data on the mvad dataset# from the TraMineR package## Not run: data(mvad, package = "TraMineR")mvad.miss <- seqaddNA(mvad, var = 17:86)# Generate missing data on mvad where joblessness is more likely to trigger# a missing data gapmvad.miss2 <- seqaddNA(mvad, var = 17:86, states.high = "joblessness")## End(Not run)Extract all the trajectories without missing value.
Description
Extract all the trajectories without missing value.
Usage
seqcomplete(data, var = NULL)Arguments
data | either a data frame containing sequences of a multinomialvariable with missing data (coded as |
var | the list of columns containing the trajectories.Default is NULL, i.e. all the columns. |
Value
Returns either a data frame or a state sequence object, dependingthe type of data that was provided to the function
Author(s)
Kevin Emery
Examples
# Game addiction datasetdata(gameadd)# Extract the trajectories without any missing datagameadd.complete <- seqcomplete(gameadd, var = 1:4)seqimpute: Imputation of missing data in longitudinal categorical data
Description
The seqimpute package implements the MICT and MICT-timingmethods. These are multiple imputation methods for longitudinal data.The core idea of the algorithms is to fills gaps of missing data, which isthe typical form of missing data in a longitudinal setting, recursively fromtheir edges. The prediction is based on either a multinomial or arandom forest regression model. Covariates and time-dependent covariatescan be included in the model.
The MICT-timing algorithm is an extension of the MICT algorithm designedto address a key limitation of the latter: its assumption that position inthe trajectory is irrelevant.
Usage
seqimpute( data, var = NULL, np = 1, nf = 1, m = 5, timing = FALSE, frame.radius = 0, covariates = NULL, time.covariates = NULL, regr = "multinom", npt = 1, nfi = 1, ParExec = FALSE, ncores = NULL, SetRNGSeed = FALSE, end.impute = TRUE, verbose = TRUE, available = TRUE, pastDistrib = FALSE, futureDistrib = FALSE, ...)Arguments
data | Either a data frame containing sequences of a categoricalvariable, where missing data are coded as |
var | A specifying the columns of the datasetthat contain the trajectories. Default is |
np | Number of prior states to include in the imputation modelfor internal gaps. |
nf | Number of subsequent states to include in the imputation modelfor internal gaps. |
m | Number of multiple imputations to perform (default: |
timing | Logical, specifies the imputation algorithm to use.If |
frame.radius | Integer, relevant only for the MICT-timing algorithm,specifying the radius of the timeframe. |
covariates | List of the columns of the datasetcontaining covariates to be included in the imputation model. |
time.covariates | List of the columns of the datasetwith time-varying covariates to include in the imputation model. |
regr | Character specifying the imputation method. Options include |
npt | Number of prior observations in the imputation model forterminal gaps (i.e., gaps at the end of sequences). |
nfi | Number of future observations in the imputation model forinitial gaps (i.e., gaps at the beginning of sequences). |
ParExec | Logical, indicating whether to run multiple imputationsin parallel. Setting to |
ncores | Integer, specifying the number of cores to use for parallelcomputation. If unset, defaults to the maximum number of CPU cores minus one. |
SetRNGSeed | Integer, to set the random seed for reproducibility inparallel computations. Note that setting |
end.impute | Logical. If |
verbose | Logical, if |
available | Logical, specifies whether to consider already imputeddata in the predictive model. If |
pastDistrib | Logical, if |
futureDistrib | Logical, if |
... | Named arguments that are passed down to the imputation functions. |
Details
The imputation process is divided into several steps, depending onthe type of gaps of missing data. The order of imputation of the gaps are:
Internal gap:there is at least
npobservationsbefore an internal gap andnfafter the gapInitial gap:gaps situated at the very beginningof a trajectory
Terminal gap:gaps situated at the very endof a trajectory
Left-hand side specifically located gap (SLG):gapsthat have at least
nfobservations after the gap, but less thannpobservation before itRight-hand side SLG:gapsthat have at least
npobservations before the gap, but less thannfobservation after itBoth-hand side SLG:gapsthat have less than
npobservations before the gap, and less thannfobservations after it
The primary difference between the MICT and MICT-timingalgorithms lies in their approach to selecting patterns from othersequences for fitting the multinomial model. While the MICT algorithmconsiders all similar patterns regardless of their temporal placement,MICT-timing restricts pattern selection to those that are temporallyclosest to the missing value. This refinement ensures that theimputation process adequately accounts for temporal dynamics, impingin more accurate imputed values.
Value
An object of classseqimp, which is a list with the followingelements:
dataA
data.framecontaining the original(incomplete) data.impA list of
mdata.framecorresponding tothe imputed datasets.mThe number of imputations.
methodA character vector specifying whether MICT orMICT-timing was used.
npNumber of prior states included in the imputation model.
nfNumber of subsequent states included in the imputationmodel.
regrA character vector specifying whether multinomial orrandom forest imputation models were applied.
callThe call that created the object.
Author(s)
Kevin Emery <kevin.emery@unige.ch>, Andre Berchtold,Anthony Guinchard, and Kamyar Taher
References
Halpin, B. (2012). Multiple imputation for life-coursesequence data. Working Paper WP2012-01, Department of Sociology,University of Limerick. http://hdl.handle.net/10344/3639.
Halpin, B. (2013). Imputing sequence data: Extensions toinitial and terminal gaps, Stata's. Working Paper WP2013-01,Department of Sociology,University of Limerick. http://hdl.handle.net/10344/3620
Emery, K., Studer, M., & Berchtold, A. (2024). Comparison ofimputation methods for univariate categorical longitudinal data.Quality & Quantity, 1-25.https://link.springer.com/article/10.1007/s11135-024-02028-z
Examples
# Default multiple imputation of the trajectories of game addiction with the# MICT algorithm## Not run: set.seed(5)imp1 <- seqimpute(data = gameadd, var = 1:4)# Default multiple imputation with the MICT-timing algorithmset.seed(3)imp2 <- seqimpute(data = gameadd, var = 1:4, timing = TRUE)# Inclusion in the MICt-timing imputation process of the three background# characteristics (Gender, Age and Track), and the time-varying covariate# about gamblingset.seed(4)imp3 <- seqimpute( data = gameadd, var = 1:4, covariates = 5:7, time.covariates = 8:11)# Parallel computationimp4 <- seqimpute( data = gameadd, var = 1:4, covariates = 5:7, time.covariates = 8:11, ParExec = TRUE, ncores = 5, SetRNGSeed = 2)## End(Not run)Plot all the patterns of missing data.
Description
This function plots all patterns of missing data within sequences, based ontheseqIplot function.
Usage
seqmissIplot(data, var = NULL, with.complete = TRUE, void.miss = TRUE, ...)Arguments
data | Either a data frame containing sequences of a categoricalvariable, where missing data are coded as |
var | A vector specifying the columns of the datasetthat contain the trajectories. Default is |
with.complete | Logical, if |
void.miss | Logical, if |
... | Additional parameters passed to theseqIplot function. |
Details
This function usesseqIplot to visualize all patterns ofmissing data within sequences. For further customization options, refer totheseqIplot documentation.
Author(s)
Kevin Emery
Examples
# Plot all the patterns of missing dataseqmissIplot(gameadd, var = 1:4)# Plot all the patterns of missing data discarding# complete trajectoriesseqmissIplot(gameadd, var = 1:4, with.missing = FALSE)Plot the most common patterns of missing data.
Description
This function plots the most frequent patterns of missing data, based on theseqfplot function.
Usage
seqmissfplot(data, var = NULL, with.complete = TRUE, void.miss = TRUE, ...)Arguments
data | Either a data frame containing sequences of a categoricalvariable, where missing data are coded as |
var | A vector specifying the columns of the datasetthat contain the trajectories. Default is |
with.complete | Logical, if |
void.miss | Logical, if |
... | Additional parameters passed to theseqfplotfunction. |
Details
This plot function is based on theseqfplot function,allowing users to visualize patterns of missing data within sequences.For details on additional customizable arguments, see theseqfplot documentation.
By default, this function plots the 10 most frequent patterns. The numberof patterns to be plotted can be adjusted using theidxs argumentinseqfplot.
Author(s)
Kevin Emery
Examples
# Plot the 10 most common patterns of missing dataseqmissfplot(gameadd, var = 1:4)# Plot the 10 most common patterns of missing data discarding# complete trajectoriesseqmissfplot(gameadd, var = 1:4, with.missing = FALSE)# Plot only the 5 most common patterns of missing data discarding# complete trajectoriesseqmissfplot(gameadd, var = 1:4, with.missing = FALSE, idxs = 1:5)Identification and visualization of states that best characterize sequenceswith missing data
Description
This function identifies and visualizes states that best characterizesequences with missing data at each position (time point), comparing them tosequences without missing data at each position (time point). It is based ontheseqimplic function. For more information on themethodology, see theseqimplic documentation.
Usage
seqmissimplic(data, var = NULL, void.miss = TRUE, ...)Arguments
data | Either a data frame containing sequences of a categoricalvariable, where missing data are coded as |
var | A vector specifying the columns of the datasetthat contain the trajectories. Default is |
void.miss | Logical, if |
... | parameters to be passed to theseqimplicfunction |
Value
returns aseqimplic object that can be plotted and printed.
Author(s)
Kevin Emery
Examples
# For illustration purpose, we simulate missing data on the mvad dataset,# available in the TraMineR package. The state "joblessness" state has a# higher probability of triggering a missing gap## Not run: data(mvad, package = "TraMineR")mvad.miss <- seqaddNA(mvad, var = 17:86, states.high = "joblessness")# The states that best characterize sequences with missing dataimplic <- seqmissimplic(mvad.miss, var = 17:86)# Visualization of the resultsplot(implic)## End(Not run)Extract all the trajectories with at least one missing value
Description
Extract all the trajectories with at least one missing value
Usage
seqwithmiss(data, var = NULL)Arguments
data | either a data frame containing sequences of a multinomialvariable with missing data (coded as |
var | the list of columns containing the trajectories.Default is NULL, i.e. all the columns. |
Value
Returns either a data frame or a state sequence object,depending the type of data that was provided to the function
Author(s)
Kevin Emery
Examples
# Game addiction datasetdata(gameadd)# Extract the trajectories without any missing datagameadd.withmiss <- seqwithmiss(gameadd, var = 1:4)Summary of aseqimp object
Description
Summary of aseqimp object
Usage
## S3 method for class 'seqimp'summary(object, ...)Arguments
object | Object of class |
... | additional arguments passed to other functions |
Author(s)
Kevin Emery