Similarity Calculations

Source:R/lma_simets.R

lma_simets.Rd

Enter a numerical matrix, set of vectors, or set of matrices to calculate similarity per vector.

Usage

lma_simets(a, b=NULL, metric=NULL, group=NULL, lag=0,  agg=TRUE, agg.mean=TRUE, pairwise=TRUE, symmetrical=FALSE,  mean=FALSE, return.list=FALSE)

Arguments

a

A vector or matrix. If a vector,b must also be provided. If a matrix andbis missing, each row will be compared. If a matrix andb is not missing, each row willbe compared withb or each row ofb.

b

A vector or matrix to be compared witha or rows ofa.

metric

A character or vector of characters at least partially matching one of theavailable metric names (or 'all' to explicitly include all metrics),or a number or vector of numbers indicating the metric by index:

jaccard:sum(a & b) / sum(a | b)
euclidean:1 / (1 + sqrt(sum((a - b) ^ 2)))
canberra:mean(1 - abs(a - b) / (a + b))
cosine:sum(a * b) / sqrt(sum(a ^ 2 * sum(b ^ 2)))
pearson:(mean(a * b) - (mean(a) * mean(b))) /
sqrt(mean(a ^ 2) - mean(a) ^ 2) / sqrt(mean(b ^ 2) - mean(b) ^ 2)

group

Ifb is missing anda has multiple rows, this will be used to makecomparisons between rows ofa, as modified byagg andagg.mean.

lag

Amount to adjust theb index; either rows ifb has multiple rows (e.g.,forlag = 1,a[1, ] is compared withb[2, ]), or values otherwise (e.g.,forlag = 1,a[1] is compared withb[2]). Ifb is not supplied,b is a copy ofa, resulting in lagged self-comparisons or autocorrelations.

agg

Logical: ifFALSE, only the boundary rows between groups will be compared, seeexample.

agg.mean

Logical: ifFALSE aggregated rows are summed instead of averaged.

pairwise

Logical: ifFALSE anda andb are matrices with the same number ofrows, only paired rows are compared. Otherwise (and if onlya is supplied), all pairwisecomparisons are made.

symmetrical

Logical: ifTRUE and pairwise comparisons betweena rows were made,the results in the lower triangle are copied to the upper triangle.

mean

Logical: ifTRUE, a single mean for each metric is returned per row ofa.

return.list

Logical: ifTRUE, a list-like object will always be returned, with an entryfor each metric, even when only one metric is requested.

Value

Output varies based on the dimensions ofa andb:

Out: A vector with a value per metric.
In: Only whena andb are both vectors.
Out: A vector with a value per row.
In: Any time a single value is expected per row:a orb is a vector,a andb are matrices with the same number of rows andpairwise = FALSE, a group is specified, ormean = TRUE, and only one metric is requested.
Out: A data.frame with a column per metric.
In: When multiple metrics are requested in the previous case.
Out: A sparse matrix with ametric attribute with the metric name.
In: Pairwise comparisons within ana matrix or between ana andb matrix, when only 1 metric is requested.
Out: A list with a sparse matrix per metric.
In: When multiple metrics are requested in the previous case.

Details

UsesetThreadOptions to change parallelization options; e.g., runRcppParallel::setThreadOptions(4) before a call to lma_simets to set the number of CPUthreads to 4.

Examples

text<-c("words of speaker A","more words from speaker A","words from speaker B","more words from speaker B")(dtm<-lma_dtm(text))#> 4 x 7 sparse Matrix of class "dgCMatrix"#>      a b from more of speaker words#> [1,] 1 .    .    .  1       1     1#> [2,] 1 .    1    1  .       1     1#> [3,] . 1    1    .  .       1     1#> [4,] . 1    1    1  .       1     1# compare each entrylma_simets(dtm)#> $jaccard#> 4 x 4 sparse Matrix of class "dtCMatrix" (unitriangular)#>#> [1,] I         .         .   .#> [2,] 0.5000000 I         .   .#> [3,] 0.3333333 0.5000000 I   .#> [4,] 0.2857143 0.6666667 0.8 I#>#> $euclidean#> 4 x 4 sparse Matrix of class "dtCMatrix" (unitriangular)#>#> [1,] I         .         .   .#> [2,] 0.3660254 I         .   .#> [3,] 0.3333333 0.3660254 I   .#> [4,] 0.3090170 0.4142136 0.5 I#>#> $canberra#> 4 x 4 sparse Matrix of class "dtCMatrix" (unitriangular)#>#> [1,] I         .         .         .#> [2,] 0.5714286 I         .         .#> [3,] 0.4285714 0.5714286 I         .#> [4,] 0.2857143 0.7142857 0.8571429 I#>#> $cosine#> 4 x 4 sparse Matrix of class "dtCMatrix" (unitriangular)#>#> [1,] I         .         .         .#> [2,] 0.6708204 I         .         .#> [3,] 0.5000000 0.6708204 I         .#> [4,] 0.4472136 0.8000000 0.8944272 I#>#> $pearson#> 4 x 4 sparse Matrix of class "dtCMatrix" (unitriangular)#>#> [1,]  I          .          .         .#> [2,]  0.09128709 I          .         .#> [3,] -0.16666667 0.09128709 I         .#> [4,] -0.54772256 0.30000000 0.7302967 I#>#> attr(,"time")#> simets#>      0# compare each entry with the mean of all entrieslma_simets(dtm,colMeans(dtm))#>     jaccard euclidean  canberra    cosine   pearson#> 1 0.5714286 0.4220645 0.4380952 0.7484552 0.1964186#> 2 0.7142857 0.5166852 0.5986395 0.9128709 0.6454972#> 3 0.5714286 0.5166852 0.5034014 0.8845380 0.7463905#> 4 0.7142857 0.5166852 0.5986395 0.9128709 0.6454972# compare by group (corresponding to speakers and turns in this case)speaker<-c("A","A","B","B")## by default, consecutive rows from the same group are averaged:lma_simets(dtm, group=speaker)#>                 jaccard euclidean  canberra    cosine    pearson#> 1, 2 <-> 3, 4 0.5714286 0.3874259 0.5238095 0.6888467 -0.1324532## with agg = FALSE, only the rows at the boundary between## groups (rows 2 and 3 in this case) are used:lma_simets(dtm, group=speaker, agg=FALSE)#>         jaccard euclidean  canberra    cosine    pearson#> 2 <-> 3     0.5 0.3660254 0.5714286 0.6708204 0.09128709

Movatterモバイル変換