Movatterモバイル変換

Type:

Package

Title:

Sequence Clustering with Discrete-Output HMMs

Version:

0.0.3

Date:

2022-12-21

Author:

Gabriel Budel [aut, cre], Flavius Frasincar [aut]

Maintainer:

Gabriel Budel <gabysp_budel@hotmail.com>

Description:

Provides an implementation of a mixture of hidden Markov models (HMMs) for discrete sequence data in the Discrete Bayesian HMM Clustering (DBHC) algorithm. The DBHC algorithm is an HMM Clustering algorithm that finds a mixture of discrete-output HMMs while using heuristics based on Bayesian Information Criterion (BIC) to search for the optimal number of HMM states and the optimal number of clusters.

License:

GPL (≥ 3)

Encoding:

UTF-8

URL:

https://github.com/gabybudel/DBHC

BugReports:

https://github.com/gabybudel/DBHC/issues

Imports:

seqHMM (≥ 1.0.8), TraMineR (≥ 2.0-7), reshape2 (≥ 1.2.1),ggplot2 (≥ 2.2.1), methods (≥ 4.2.2)

NeedsCompilation:

Repository:

CRAN

RoxygenNote:

7.2.3

Suggests:

testthat (≥ 3.0.0)

Config/testthat/edition:

Packaged:

2022-12-22 07:49:05 UTC; gabys

Date/Publication:

2022-12-22 13:10:15 UTC

Cluster Assignment

Description

Assign sequences to cluster models that give the highest sequence-to-hmmlikelihood. Used inhmm.clust.

Usage

assign.clusters(partition, memberships, sequences, smoothing = 1e-04)

Arguments

partition

A list object with the partition, a mixture of HMMs. Eachelement in the list is anhmm object (seebuild_hmm).

memberships

A matrix with cluster memberships for each sequence.

sequences

Anstslist object (seeseqdef) of sequences with discrete observations.

smoothing

Smoothing parameter for absolute discounting insmooth.probabilities.

Value

The updated matrix with cluster memberships for each sequence.

HMM BIC

Description

Compute the BIC of a single HMM given a threshold epsilon for countingparameters. Auxiliary function used insize.search.

Usage

cluster.bic(hmm, eps = 0.001)

Arguments

hmm

Anhmm object (seebuild_hmm).

eps

A threshold epsilon for counting parameters.

Value

The BIC ofhmm.

Count HMM Parameters

Description

Count the number of parameters in an HMM larger than a small number epsilon.Auxiliary function used inpartition.bic andcluster.bic.

Usage

count.parameters(hmm, eps = 0.001)

Arguments

hmm

Anhmm object (seebuild_hmm).

eps

A threshold epsilon for counting parameters.

Value

The number of parameters larger thaneps.

Heatmap Emission Probabilities

Description

Plots a heatmap of an HMM's emission probabilities.

Usage

emission.heatmap(emission, base_size = 10)

Arguments

emission

A matrix with emission probabilities (see alsobuild_hmm).

base_size

Numerical, a size parameter for the plots made usingggplot2(seetheme), default = 10.

DBHC Algorithm

Description

Implementation of the DBHC algorithm, an HMM clustering algorithm that findsa mixture of discrete-output HMMs. The algorithm uses heuristics based on BICto search for the optimal number of hidden states in each HMM and the optimalnumber of clusters.

Usage

hmm.clust(  sequences,  id = NULL,  smoothing = 1e-04,  eps = 0.001,  init.size = 2,  alphabet = NULL,  K.max = NULL,  log_space = FALSE,  print = FALSE,  seed.size = 3)

Arguments

sequences

Anstslist object (seeseqdef) of sequences with discrete observations oradata.frame.

id

A vector with ids that identify the sequences insequences.

smoothing

Smoothing parameter for absolute discounting insmooth.probabilities.

eps

A threshold epsilon for counting parameters incount.parameters.

init.size

The number of HMM states in an initial HMM.

alphabet

The alphabet of output labels, if not provided alphabet istaken fromstslist object (seeseqdef).

K.max

Maximum number of clusters, if not provided algorithm searchesfor the optimal number itself.

log_space

Logical, parameter provided tofit_model for whether to use optimization in logspace or not.

print

Logical, whether to print intermediate steps or not.

seed.size

Seed size, the number of sequences to be selected for a seed

Value

A list with components:

sequences: Anstslist object of sequences with discrete observations.
id: A vector with ids that identify the sequences insequences.
cluster: A vector with found clustermemberships for the sequences.
partition: A list object withthe partition, a mixture of HMMs. Each element in the list is anhmmobject.
memberships: A matrix with cluster memberships foreach sequence.
n.clusters: Numerical, the found number ofclusters.
sizes: A vector with the number of HMM states foreach cluster model.
bic: A vector with the BICs for eachcluster model.

Examples

## Simulated datalibrary(seqHMM)output.labels <-  c("H", "T")# HMM 1states.1 <- c("A", "B", "C")transitions.1 <- matrix(c(0.8,0.1,0.1,0.1,0.8,0.1,0.1,0.1,0.8), nrow = 3)rownames(transitions.1) <- states.1colnames(transitions.1) <- states.1emissions.1 <- matrix(c(0.5,0.75,0.25,0.5,0.25,0.75), nrow = 3)rownames(emissions.1) <- states.1colnames(emissions.1) <- output.labelsinitials.1 <- c(1/3,1/3,1/3)# HMM 2states.2 <- c("A", "B")transitions.2 <- matrix(c(0.75,0.25,0.25,0.75), nrow = 2)rownames(transitions.2) <- states.2colnames(transitions.2) <- states.2emissions.2 <- matrix(c(0.8,0.6,0.2,0.4), nrow = 2)rownames(emissions.2) <- states.2colnames(emissions.2) <- output.labelsinitials.2 <- c(0.5,0.5)# Simulatehmm.sim.1 <- simulate_hmm(n_sequences = 100,                          initial_probs = initials.1,                          transition_probs = transitions.1,                          emission_probs = emissions.1,                          sequence_length = 25)hmm.sim.2 <- simulate_hmm(n_sequences = 100,                          initial_probs = initials.2,                          transition_probs = transitions.2,                          emission_probs = emissions.2,                          sequence_length = 25)sequences <- rbind(hmm.sim.1$observations, hmm.sim.2$observations)n <- nrow(sequences)# Clustering algorithmid <- paste0("K-", 1:n)rownames(sequences) <- idsequences <- sequences[sample(1:n, n),]res <- hmm.clust(sequences, id = rownames(sequences))############################################################################### Swiss Household Datadata("biofam", package = "TraMineR")# Clustering algorithmnew.alphabet <- c("P", "L", "M", "LM", "C", "LC", "LMC", "D")sequences <- seqdef(biofam[,10:25], alphabet = 0:7, states = new.alphabet)## Not run: res <- hmm.clust(sequences)# Heatmapscluster <- 1  # display heatmaps for cluster 1transition.heatmap(res$partition[[cluster]]$transition_probs,                   res$partition[[cluster]]$initial_probs)emission.heatmap(res$partition[[cluster]]$emission_probs)## End(Not run)## A smaller example, which takes less time to runsubset <- sequences[sample(1:nrow(sequences), 20, replace = FALSE),]# Clustering algorithm, limiting number of clusters to 2res <- hmm.clust(subset, K.max = 2)# Number of clustersprint(res$n.clusters)# Table of cluster membershipstable(res$memberships[,"cluster"])# BIC for each number of clustersprint(res$bic)# Heatmapscluster <- 1  # display heatmaps for cluster 1transition.heatmap(res$partition[[cluster]]$transition_probs,                   res$partition[[cluster]]$initial_probs)emission.heatmap(res$partition[[cluster]]$emission_probs)

Get HMM Log Likelihood

Description

Get the log likelihood of an HMM object and check if it is feasible (i.e.,contains no illegal emissions). Auxiliary function used inpartition.bic.

Usage

model.ll(hmm)

Arguments

hmm

Anhmm object (seebuild_hmm).

Value

The log likelihood of thehmm object, print warning if modelis infeasible (i.e., if the log likelihood is evaluated for a sequence thatcontains emissions that are assigned probability 0 in thehmmobject).

Partition BIC

Description

Compute the BIC of a partition given a threshold epsilon for countingparameters. Auxiliary function used inhmm.clust.

Usage

partition.bic(partition, eps = 0.001)

Arguments

partition

A list object with the partition of HMMs, a mixture of HMMs.

eps

A threshold epsilon for counting parameters incount.parameters.

Value

The BIC of the partition.

Seed Selection Procedure

Description

Seed selection procedure of the DBHC algorithm, also invokes size searchalgorithm for seed insize.search. Used inhmm.clust.

Usage

select.seeds(  sequences,  log_space = FALSE,  K,  seed.size = 3,  init.size = 2,  print = FALSE,  smoothing = 1e-04)

Arguments

sequences

Anstslist object (seeseqdef) of sequences with discrete observations.

log_space

Logical, parameter provided tofit_model for whether to use optimization in logspace or not.

K

The number of seeds to select, equal to the number of clusters in apartition.

seed.size

Seed size, the number of sequences to be selected for aseed.

init.size

The number of HMM states in an initial HMM.

print

Logical, whether to print intermediate steps or not.

smoothing

Smoothing parameter for absolute discounting insmooth.probabilities.

Value

A partition as a list object with HMMs for the selected seeds.

Sequence-to-HMM Likelihood

Description

Compute the sequence-to-HMM likelihood of an HMM evaluated for a singlesequence and check if the sequence contains emissions that are not possibleaccording to the HMM. Auxiliary function used inselect.seedsandassign.clusters.

Usage

seq2hmm.ll(hmm)

Arguments

hmm

Anhmm object (seebuild_hmm)containing a single sequence.

Value

The log likelihood of the sequence contained inhmm, valuewill be set to minus infinity if the sequence contains illegal emissions.

Size Search Algorithm

Description

The size search algorithm finds the optimal number of HMM states for a set ofsequences and returns both the optimalhmm object and thecorresponding number of hidden states. Used inselect.seeds.

Usage

size.search(sequences, log_space = FALSE, print = FALSE)

Arguments

sequences

Anstslist object (seeseqdef) of sequences with discrete observations.

log_space

Logical, parameter provided tofit_model for whether to use optimization in logspace or not.

print

Logical, whether to print intermediate steps or not.

Value

A list with the optimal number of HMM states and the optimalhmm object.

Smooth HMM Parameters

Description

Smooth the parameters of an HMM using absolute discounting given a thresholdepsilon. Auxiliary function used inselect.seeds,assign.clusters, andhmm.clust.

Usage

smooth.hmm(hmm, smoothing = 1e-04)

Arguments

hmm

A rawhmm object (seebuild_hmm).

smoothing

Smoothing parameter for absolute discounting insmooth.probabilities.

Value

Anhmm object with smoothed probabilities.

Smooth Probabilities

Description

Smooth a vector of probabilities using absolute discounting. Auxiliaryfunction used insmooth.hmm.

Usage

smooth.probabilities(probs, smoothing = 1e-04)

Arguments

probs

A vector of raw probabilities.

smoothing

Smoothing parameter for absolute discounting.

Value

A vector of smoothed probabilities.

Heatmap Transition Probabilities

Description

Plots a heatmap of an HMM's initial and transition probabilities.

Usage

transition.heatmap(transition, initial = NULL, base_size = 10)

Arguments

transition

A matrix with transition probabilities (see alsobuild_hmm).

initial

An (optional) vector of initial probabilities.

base_size

Numerical, a size parameter for the plots made usingggplot2(seetheme), default = 10.

Movatterモバイル変換

Cluster Assignment

Description

Usage

Arguments

Value

See Also

HMM BIC

Description

Usage

Arguments

Value

See Also

Count HMM Parameters

Description

Usage

Arguments

Value

See Also

Heatmap Emission Probabilities

Description

Usage

Arguments

See Also

DBHC Algorithm

Description

Usage

Arguments

Value

Examples

Get HMM Log Likelihood

Description

Usage

Arguments

Value

See Also

Partition BIC

Description

Usage

Arguments

Value

See Also

Seed Selection Procedure

Description

Usage

Arguments

Value

See Also

Sequence-to-HMM Likelihood

Description

Usage

Arguments

Value

See Also

Size Search Algorithm

Description

Usage

Arguments

Value

See Also

Smooth HMM Parameters

Description

Usage

Arguments

Value

See Also

Smooth Probabilities

Description

Usage

Arguments

Value

See Also

Heatmap Transition Probabilities

Description

Usage

Arguments

See Also