Movatterモバイル変換

Type:

Package

Title:

Imputation of Missing Data in Sequence Analysis

Version:

2.2.0

Description:

Multiple imputation of missing data in a dataset using MICT or MICT-timing methods. The core idea of the algorithms is to fill gaps of missing data, which is the typical form of missing data in a longitudinal setting, recursively from their edges. Prediction is based on either a multinomial or random forest regression model. Covariates and time-dependent covariates can be included in the model.

License:

GPL-2

Imports:

Amelia, cluster, dfidx, doRNG, doSNOW, dplyr, foreach,graphics, mlr, nnet, parallel, plyr, ranger, rms, stats,stringr, TraMineR, TraMineRextras, utils, mice, parallelly

Suggests:

R.rsp, rmarkdown, testthat (≥ 3.0.0)

VignetteBuilder:

R.rsp

Config/testthat/edition:

Encoding:

UTF-8

LazyData:

true

RoxygenNote:

7.3.2

URL:

https://github.com/emerykevin/seqimpute

BugReports:

https://github.com/emerykevin/seqimpute/issues

NeedsCompilation:

Packaged:

2025-01-15 14:33:26 UTC; Kevin

Author:

Kevin Emery [aut, cre], Anthony Guinchard [aut], Andre Berchtold [aut], Kamyar Taher [aut]

Maintainer:

Kevin Emery <kevin.emery@unige.ch>

Depends:

R (≥ 3.5.0)

Repository:

CRAN

Date/Publication:

2025-01-15 16:10:02 UTC

Function that adds the clustering result to a`seqimp` objectobtained with the`seqimpute` function

Description

Function that adds the clustering result to aseqimp objectobtained with theseqimpute function

Usage

addcluster(impdata, clustering)

Arguments

impdata

An object of classseqimp as created by theseqimpute function

clustering

clustering made on the multiple imputed dataset. Caneither be a dataframe or a matrix, where each row correspond to anobservation and each column to a multiple imputed dataset

Value

Returns aseqimp object containing the cluster to which eachsequence in each imputed dataset belongs. Specifically, a column namedcluster is added to the imputed datasets.

Transform an object of class`seqimp` into a dataframe or a`mids`object

Description

The function converts aseqimp object into a specified format.

Usage

fromseqimp(data, format = "long", include = FALSE)

Arguments

data

An object of class seqimp as created by thefunctionseqimpute

format

The format in which the seqimp object should be returned. Itcould be:"long","stacked" and"mids".See the Details section for the interpretation.

include

logical that indicates if the original dataset with missingvalue should be included or not. This parameter does not applyifformat="mids".

Details

The argumentformat specifies the object that should be returnedby the function. It can take the following values

"long": produces a data set in which imputed data sets are stacked vertically.The following columns are added: 1).imp referring to theimputation number, and 2).id the row names of the original dataset
"stacked": the same as"long", but without the inclusion ofthe two columns.imp and.id
"mids": produces an object of classmids, which is the formatused by themice package.

Value

Transform aseqimp object into the desired format.

Author(s)

Kevin Emery

Examples

## Not run: # Imputation with the MICT algorithmimp <- seqimpute(data = gameadd, var = 1:4)# The object imp is transformed to a dataframe, where completed datasets are# stacked verticallyimp.stacked <- fromseqimp(  data = imp,  format = "stacked", include = FALSE)## End(Not run)

Example data set: Game addiction

Description

Dataset containing variables on the gaming addiction of young people.The data consists of gaming addiction, coded as either 'no' or 'yes',measured over four consecutive years for 500 individuals, three covariatesand one time-dependent covariate. The yearly statesare recorded in columns 1 (T1_abuse) to 4 (T4_abuse).

The three covariates are

Gender (female or male),
Age (measured at time 1),
Track (school or apprenticeship).

The time-varying covariate consists of the individual's relationship togambling at each of the four time points, appearing in columnsT1_gambling,T2_gambling,T3_gambling, andT4_gambling. The states are eitherno, gambler or problematic gambler

Usage

data(gameadd)

Format

A data frame containing 500 rows, 4 states variable, 3 covariatesand a time-dependent covariate.

Plot a`seqimp` object

Description

Plot aseqimp object. The state distribution plot of the firstm completed datasets is shown, possibly alongside the originaldataset with missing data

Usage

## S3 method for class 'seqimp'plot(x, m = 5, include = TRUE, ...)

Arguments

x

Object of classseqimp

m

Number of completed datasets to show

include

logical that indicates if the original dataset with missingvalue should be plotted or not

...

Arguments to be passed to the seqdplot function

Author(s)

Kevin Emery

Print a`seqimp` object

Description

Print aseqimp object

Usage

## S3 method for class 'seqimp'print(x, ...)

Arguments

x

Object of classseqimp

...

additional arguments passed to other functions

Author(s)

Kevin Emery

Summary of the types of gaps among a dataset

Description

TheseqQuickLook() function aimed at providing an overview of thenumber and size of the different types of gapsspread in the original dataset.

Usage

seqQuickLook(data, var = NULL, np = 1, nf = 1)

Arguments

data

a data.frame where missing data are coded as NA ora state sequence object built withseqdef function

var

the list of columns containing the trajectories.Default is NULL, i.e. all the columns.

np

number of previous observations in the imputation model of theinternal gaps.

nf

number of future observations in the imputation model of theinternal gaps.

Details

The distinction between internal and SLG gaps depends on thenumber of previous (np) and future (nf) observations that areset for theMICT andMICT-timing algorithms.

Value

Returns adata.frame object that summarizes, for eachtype of gaps (Internal Gaps, Initial Gaps, Terminal Gaps,LEFT-hand side SLG, RIGHT-hand side SLG, Both-hand side SLG),the minimum length, the maximum length, the total number of gaps andthe total number of missing they contain.

Author(s)

Andre Berchtold and Kevin Emery

Examples

data(gameadd)seqQuickLook(data = gameadd, var = 1:4, np = 1, nf = 1)

Spotting impossible transitions in longitudinal categorical data

Description

The purpose ofseqTrans is to spot impossible transitionsin longitudinal categorical data.

Usage

seqTrans(data, var = NULL, trans)

Arguments

data

a data frame containing sequences of a multinomialvariable with missing data (coded asNA)

var

the list of columns containing the trajectories.Default is NULL, i.e. all the columns.

trans

character vector gathering the impossible transitions.For example: trans <- c("1->3","1->4","2->1","4->1","4->3")

Value

It returns a matrix where each row is the position of animpossible transition.

Author(s)

Andre Berchtold and Kevin Emery

Examples

data(gameadd)seqTransList <- seqTrans(data = gameadd, var = 1:4, trans = c("yes->no"))

Generation of missing on longitudinal categorical data.

Description

Generation of missing data in sequence based on a Markovianapproach.

Usage

seqaddNA(  data,  var = NULL,  states.high = NULL,  propdata = 1,  pstart.high = 0.1,  pstart.low = 0.005,  pcont = 0.66,  maxgap = 3,  maxprop = 0.75,  only.traj = FALSE)

Arguments

data

A data frame containing sequences of a categorical (multinomial)variable, where missing data are coded asNA.

var

A vector specifying the columns of the datasetthat contain the trajectories. Default isNULL, meaning all columnsare used.

states.high

A list of states with a higher probability ofinitiating a subsequent missing data gap.

propdata

Proportion of trajectories for which missing datais simulated, as a decimal between 0 and 1.

pstart.high

Probability of starting a missing data gap for thestates specified in thestates.high argument.

pstart.low

Probability of starting a missing data gap for allother states.

pcont

Probability of a missing data gap to continue.

maxgap

Maximum length of a missing data gap.

maxprop

Maximum proportion of missing data allowed in a sequence,as a decimal between 0 and 1.

only.traj

Logical, ifTRUE, only thetrajectories (specified invar) are returned. IfFALSE,the entire data frame is returned.

Details

The first time point of a trajectory has apstart.low probability tobe missing. For the next time points, the probability to be missing dependson the previous time point. There are four cases:

1. If the previous time point is missing and the maximum length of amissing gap, which is specified by the argumentmaxgap, is reached,the time point is set as observed.

2. If the previous time point is missing, but the maximum length of a gap isnot reached, there is apcont probability that this time point is missing.

3. If the previous time point is observed and the previous time point belongsto the list of states specified bypstart.high, the probability tobe missing ispstart.high.

4. If the previous time point is observed but the previous time point does notbelong to the list of states specified bypstart.high, theprobability to be missing ispstart.low.

If the proportion of missing data in a given trajectory exceeds theproportion specified bymaxprop, the missing data simulation isrepeated for the sequence.

Value

A data frame with simulated missing data.

Author(s)

Kevin Emery

Examples

# Generate MCAR missing data on the mvad dataset# from the TraMineR package## Not run: data(mvad, package = "TraMineR")mvad.miss <- seqaddNA(mvad, var = 17:86)# Generate missing data on mvad where joblessness is more likely to trigger# a missing data gapmvad.miss2 <- seqaddNA(mvad, var = 17:86, states.high = "joblessness")## End(Not run)

Extract all the trajectories without missing value.

Description

Extract all the trajectories without missing value.

Usage

seqcomplete(data, var = NULL)

Arguments

data

either a data frame containing sequences of a multinomialvariable with missing data (coded asNA) or a state sequenceobject built with the TraMineR package

var

the list of columns containing the trajectories.Default is NULL, i.e. all the columns.

Value

Returns either a data frame or a state sequence object, dependingthe type of data that was provided to the function

Author(s)

Kevin Emery

Examples

# Game addiction datasetdata(gameadd)# Extract the trajectories without any missing datagameadd.complete <- seqcomplete(gameadd, var = 1:4)

seqimpute: Imputation of missing data in longitudinal categorical data

Description

The seqimpute package implements the MICT and MICT-timingmethods. These are multiple imputation methods for longitudinal data.The core idea of the algorithms is to fills gaps of missing data, which isthe typical form of missing data in a longitudinal setting, recursively fromtheir edges. The prediction is based on either a multinomial or arandom forest regression model. Covariates and time-dependent covariatescan be included in the model.

The MICT-timing algorithm is an extension of the MICT algorithm designedto address a key limitation of the latter: its assumption that position inthe trajectory is irrelevant.

Usage

seqimpute(  data,  var = NULL,  np = 1,  nf = 1,  m = 5,  timing = FALSE,  frame.radius = 0,  covariates = NULL,  time.covariates = NULL,  regr = "multinom",  npt = 1,  nfi = 1,  ParExec = FALSE,  ncores = NULL,  SetRNGSeed = FALSE,  end.impute = TRUE,  verbose = TRUE,  available = TRUE,  pastDistrib = FALSE,  futureDistrib = FALSE,  ...)

Arguments

data

Either a data frame containing sequences of a categoricalvariable, where missing data are coded asNA, or a state sequenceobject created using theseqdef function. If using astate sequence object, any "void" elements will also be treated as missing.See theend.impute argument if you wish to skip imputing valuesat the end of the sequences.

var

A specifying the columns of the datasetthat contain the trajectories. Default isNULL, meaning all columnsare used.

np

Number of prior states to include in the imputation modelfor internal gaps.

nf

Number of subsequent states to include in the imputation modelfor internal gaps.

m

Number of multiple imputations to perform (default:5).

timing

Logical, specifies the imputation algorithm to use.IfFALSE, the MICT algorithm is applied; ifTRUE, theMICT-timing algorithm is used.

frame.radius

Integer, relevant only for the MICT-timing algorithm,specifying the radius of the timeframe.

covariates

List of the columns of the datasetcontaining covariates to be included in the imputation model.

time.covariates

List of the columns of the datasetwith time-varying covariates to include in the imputation model.

regr

Character specifying the imputation method. Options include"multinom" for multinomial models and"rf" for random forestmodels.

npt

Number of prior observations in the imputation model forterminal gaps (i.e., gaps at the end of sequences).

nfi

Number of future observations in the imputation model forinitial gaps (i.e., gaps at the beginning of sequences).

ParExec

Logical, indicating whether to run multiple imputationsin parallel. Setting toTRUE can improve computation time dependingon available cores.

ncores

Integer, specifying the number of cores to use for parallelcomputation. If unset, defaults to the maximum number of CPU cores minus one.

SetRNGSeed

Integer, to set the random seed for reproducibility inparallel computations. Note that settingset.seed() alone does notensure reproducibility in parallel mode.

end.impute

Logical. IfFALSE, missing data at the end ofsequences will not be imputed.

verbose

Logical, ifTRUE, displays progress and warningsin the console. UseFALSE for silent computation.

available

Logical, specifies whether to consider already imputeddata in the predictive model. IfTRUE, previous imputations areused; ifFALSE, only original data are considered.

pastDistrib

Logical, ifTRUE, includes the past distributionas a predictor in the imputation model.

futureDistrib

Logical, ifTRUE, includes the futuredistribution as a predictor in the imputation model.

...

Named arguments that are passed down to the imputation functions.

Details

The imputation process is divided into several steps, depending onthe type of gaps of missing data. The order of imputation of the gaps are:

Internal gap:: there is at leastnp observationsbefore an internal gap andnf after the gap
Initial gap:: gaps situated at the very beginningof a trajectory
Terminal gap:: gaps situated at the very endof a trajectory
Left-hand side specifically located gap (SLG):: gapsthat have at leastnf observations after the gap, but less thannp observation before it
Right-hand side SLG:: gapsthat have at leastnp observations before the gap, but less thannf observation after it
Both-hand side SLG:: gapsthat have less thannp observations before the gap, and less thannf observations after it

The primary difference between the MICT and MICT-timingalgorithms lies in their approach to selecting patterns from othersequences for fitting the multinomial model. While the MICT algorithmconsiders all similar patterns regardless of their temporal placement,MICT-timing restricts pattern selection to those that are temporallyclosest to the missing value. This refinement ensures that theimputation process adequately accounts for temporal dynamics, impingin more accurate imputed values.

Value

An object of classseqimp, which is a list with the followingelements:

data: Adata.frame containing the original(incomplete) data.
imp: A list ofmdata.frame corresponding tothe imputed datasets.
m: The number of imputations.
method: A character vector specifying whether MICT orMICT-timing was used.
np: Number of prior states included in the imputation model.
nf: Number of subsequent states included in the imputationmodel.
regr: A character vector specifying whether multinomial orrandom forest imputation models were applied.
call: The call that created the object.

Author(s)

Kevin Emery <kevin.emery@unige.ch>, Andre Berchtold,Anthony Guinchard, and Kamyar Taher

References

Halpin, B. (2012). Multiple imputation for life-coursesequence data. Working Paper WP2012-01, Department of Sociology,University of Limerick. http://hdl.handle.net/10344/3639.

Halpin, B. (2013). Imputing sequence data: Extensions toinitial and terminal gaps, Stata's. Working Paper WP2013-01,Department of Sociology,University of Limerick. http://hdl.handle.net/10344/3620

Emery, K., Studer, M., & Berchtold, A. (2024). Comparison ofimputation methods for univariate categorical longitudinal data.Quality & Quantity, 1-25.https://link.springer.com/article/10.1007/s11135-024-02028-z

Examples

# Default multiple imputation of the trajectories of game addiction with the# MICT algorithm## Not run: set.seed(5)imp1 <- seqimpute(data = gameadd, var = 1:4)# Default multiple imputation with the MICT-timing algorithmset.seed(3)imp2 <- seqimpute(data = gameadd, var = 1:4, timing = TRUE)# Inclusion in the MICt-timing imputation process of the three background# characteristics (Gender, Age and Track), and the time-varying covariate# about gamblingset.seed(4)imp3 <- seqimpute(  data = gameadd, var = 1:4, covariates = 5:7,  time.covariates = 8:11)# Parallel computationimp4 <- seqimpute(  data = gameadd, var = 1:4, covariates = 5:7,  time.covariates = 8:11, ParExec = TRUE, ncores = 5, SetRNGSeed = 2)## End(Not run)

Plot all the patterns of missing data.

Description

This function plots all patterns of missing data within sequences, based ontheseqIplot function.

Usage

seqmissIplot(data, var = NULL, with.complete = TRUE, void.miss = TRUE, ...)

Arguments

data

Either a data frame containing sequences of a categoricalvariable, where missing data are coded asNA, or a state sequenceobject created using theseqdef function.

var

A vector specifying the columns of the datasetthat contain the trajectories. Default isNULL, meaning all columnsare used.

with.complete

Logical, ifTRUE, complete trajectorieswill be included in the plot.

void.miss

Logical, ifTRUE, treats void elements asmissing values. Applies only to state sequence objects created withseqdef. Note that the default behavior ofseqdefis to treat missing data at the end of sequences as void elements.

...

Additional parameters passed to theseqIplot function.

Details

This function usesseqIplot to visualize all patterns ofmissing data within sequences. For further customization options, refer totheseqIplot documentation.

Author(s)

Kevin Emery

Examples

# Plot all the patterns of missing dataseqmissIplot(gameadd, var = 1:4)# Plot all the patterns of missing data discarding# complete trajectoriesseqmissIplot(gameadd, var = 1:4, with.missing = FALSE)

Plot the most common patterns of missing data.

Description

This function plots the most frequent patterns of missing data, based on theseqfplot function.

Usage

seqmissfplot(data, var = NULL, with.complete = TRUE, void.miss = TRUE, ...)

Arguments

data

Either a data frame containing sequences of a categoricalvariable, where missing data are coded asNA, or a state sequenceobject created using theseqdef function.

var

A vector specifying the columns of the datasetthat contain the trajectories. Default isNULL, meaning all columnsare used.

with.complete

Logical, ifTRUE, complete trajectorieswill be included in the plot.

void.miss

...

Additional parameters passed to theseqfplotfunction.

Details

This plot function is based on theseqfplot function,allowing users to visualize patterns of missing data within sequences.For details on additional customizable arguments, see theseqfplot documentation.

By default, this function plots the 10 most frequent patterns. The numberof patterns to be plotted can be adjusted using theidxs argumentinseqfplot.

Author(s)

Kevin Emery

Examples

# Plot the 10 most common patterns of missing dataseqmissfplot(gameadd, var = 1:4)# Plot the 10 most common patterns of missing data discarding# complete trajectoriesseqmissfplot(gameadd, var = 1:4, with.missing = FALSE)# Plot only the 5 most common patterns of missing data discarding# complete trajectoriesseqmissfplot(gameadd, var = 1:4, with.missing = FALSE, idxs = 1:5)

Identification and visualization of states that best characterize sequenceswith missing data

Description

This function identifies and visualizes states that best characterizesequences with missing data at each position (time point), comparing them tosequences without missing data at each position (time point). It is based ontheseqimplic function. For more information on themethodology, see theseqimplic documentation.

Usage

seqmissimplic(data, var = NULL, void.miss = TRUE, ...)

Arguments

data

Either a data frame containing sequences of a categoricalvariable, where missing data are coded asNA, or a state sequenceobject created using theseqdef function.

var

A vector specifying the columns of the datasetthat contain the trajectories. Default isNULL, meaning all columnsare used.

void.miss

Logical, ifTRUE, treats void elements asmissing values. This argument applies only to state sequence objects createdwithseqdef. Note that the default behavior ofseqdef is to treat missing data at the end of sequences asvoid elements.

...

parameters to be passed to theseqimplicfunction

Value

returns aseqimplic object that can be plotted and printed.

Author(s)

Kevin Emery

Examples

# For illustration purpose, we simulate missing data on the mvad dataset,# available in the TraMineR package. The state "joblessness" state has a# higher probability of triggering a missing gap## Not run: data(mvad, package = "TraMineR")mvad.miss <- seqaddNA(mvad, var = 17:86, states.high = "joblessness")# The states that best characterize sequences with missing dataimplic <- seqmissimplic(mvad.miss, var = 17:86)# Visualization of the resultsplot(implic)## End(Not run)

Extract all the trajectories with at least one missing value

Description

Extract all the trajectories with at least one missing value

Usage

seqwithmiss(data, var = NULL)

Arguments

data

either a data frame containing sequences of a multinomialvariable with missing data (coded asNA) or a state sequenceobject built with the TraMineR package

var

the list of columns containing the trajectories.Default is NULL, i.e. all the columns.

Value

Returns either a data frame or a state sequence object,depending the type of data that was provided to the function

Author(s)

Kevin Emery

Examples

# Game addiction datasetdata(gameadd)# Extract the trajectories without any missing datagameadd.withmiss <- seqwithmiss(gameadd, var = 1:4)

Summary of a`seqimp` object

Description

Summary of aseqimp object

Usage

## S3 method for class 'seqimp'summary(object, ...)

Arguments

object

Object of classseqimp

...

additional arguments passed to other functions

Author(s)

Kevin Emery

Movatterモバイル変換

Function that adds the clustering result to aseqimp objectobtained with theseqimpute function

Description

Usage

Arguments

Value

Transform an object of classseqimp into a dataframe or amidsobject

Description

Usage

Arguments

Details

Value

Author(s)

Examples

Example data set: Game addiction

Description

Usage

Format

Plot aseqimp object

Description

Usage

Arguments

Author(s)

Print aseqimp object

Description

Usage

Arguments

Author(s)

Summary of the types of gaps among a dataset

Description

Usage

Arguments

Details

Value

Author(s)

Examples

Spotting impossible transitions in longitudinal categorical data

Description

Usage

Arguments

Value

Author(s)

Examples

Generation of missing on longitudinal categorical data.

Description

Usage

Arguments

Details

Value

Author(s)

Examples

Extract all the trajectories without missing value.

Description

Usage

Arguments

Value

Author(s)

Examples

seqimpute: Imputation of missing data in longitudinal categorical data

Description

Usage

Arguments

Details

Value

Author(s)

References

Examples

Plot all the patterns of missing data.

Description

Usage

Arguments

Details

Author(s)

Examples

Plot the most common patterns of missing data.

Description

Usage

Arguments

Details

Author(s)

Function that adds the clustering result to a`seqimp` objectobtained with the`seqimpute` function

Transform an object of class`seqimp` into a dataframe or a`mids`object

Plot a`seqimp` object

Print a`seqimp` object

Summary of a`seqimp` object