| Type: | Package |
| Title: | Protocol Inspection and State Machine Analysis |
| Version: | 0.2-7 |
| Date: | 2018-05-26 |
| Depends: | R (≥ 2.10), Matrix, gplots, methods, ggplot2 |
| Suggests: | tm (≥ 0.6) |
| Author: | Tammo Krueger, Nicole Kraemer |
| Maintainer: | Tammo Krueger <tammokrueger@googlemail.com> |
| Description: | Loads and processes huge text corpora processed with the sally toolbox (http://www.mlsec.org/sally/). sally acts as a very fast preprocessor which splits the text files into tokens or n-grams. These output files can then be read with the PRISMA package which applies testing-based token selection and has some replicate-aware, highly tuned non-negative matrix factorization and principal component analysis implementation which allows the processing of very big data sets even on desktop machines. |
| License: | GPL-2 |GPL-3 [expanded from: GPL (≥ 2.0)] |
| NeedsCompilation: | no |
| Packaged: | 2018-05-26 15:51:57 UTC; tammok |
| Repository: | CRAN |
| Date/Publication: | 2018-05-26 22:01:47 UTC |
Protocol Inspection and State Machine Analysis
Description
Loads and processes huge textcorpora processed with the sally toolbox (<http://www.mlsec.org/sally/>).sally acts as a very fast preprocessor which splits the text files intotokens or n-grams. These output files can then be read with the PRISMApackage which applies testing-based token selection and has somereplicate-aware, highly tuned non-negative matrix factorization andprincipal component analysis implementation which allows the processing ofvery big data sets even on desktop machines.
Details
| Package: | PRISMA |
| Type: | Package |
| Title: | Protocol Inspection and State Machine Analysis |
| Version: | 0.2-7 |
| Date: | 2018-05-26 |
| Depends: | Matrix,gplots,methods,ggplot2 |
| Suggests: | tm (>= 0.6) |
| Author: | Tammo Krueger, Nicole Kraemer |
| Maintainer: | Tammo Krueger <tammokrueger@googlemail.com> |
| Description: | Loads and processes huge text corpora processed with the sally toolbox (<http://www.mlsec.org/sally/>). sally acts as a very fast preprocessor which splits the text files into tokens or n-grams. These output files can then be read with the PRISMA package which applies testing-based token selection and has some replicate-aware, highly tuned non-negative matrix factorization and principal component analysis implementation which allows the processing of very big data sets even on desktop machines. |
| License: | GPL (>=2.0) |
Index of help topics:
PRISMA-package Protocol Inspection and State Machine Analysisasap The ASAP Data SetcorpusToPrisma Convert tm copus to PRISMAestimateDimension Estimate Inner DimensiongetDuplicateData Restores Data with DuplicatesgetMatrixFactorizationLabels Convert Coordinates of Matrix Factorization to LabelsloadPrismaData Load PRISMA Data Filesplot.prisma Generics For PRISMA Objectsplot.prismaDimension Generics For PRISMA Objectsplot.prismaMF Generics For PRISMA ObjectsprismaDuplicatePCA Matrix Factorization Based on Replicate-Aware PCAprismaHclust Matrix Factorization Based on Hierarchical ClusteringprismaNMF Matrix Factorization Based on Replicate-Aware NMFthesis The Thesis Data Set
Further information is available in the following vignettes:
PRISMA | Quick introduction (source) |
Author(s)
Tammo Krueger, Nicole Kraemer
Maintainer: Tammo Krueger <tammokrueger@googlemail.com>
References
Krueger, T., Gascon, H., Kraemer, N., Rieck, K. (2012)Learning Stateful Models for Network Honeypots5th ACM Workshop on Artificial Intelligence and Security (AISEC 2012), accepted
Krueger, T., Kraemer, N., Rieck, K. (2011)ASAP: Automatic Semantics-Aware Analysis of Network PayloadsPrivacy and Security Issues in Data Mining and Machine Learning - International ECML/PKDD Workshop. Lecture Notes in Computer Science 6549, Springer. 50 - 63
Examples
# please see the vingette for examplesThe ASAP Data Set
Description
Toy data set to show the capabilities of the PRISMA package.
Usage
asapFormat
A prisma object.
Author(s)
Tammo Krueger <tammokrueger@googlemail.com>
References
Krueger, T., Kraemer, N., Rieck, K. (2011)ASAP: Automatic Semantics-Aware Analysis of Network PayloadsPrivacy and Security Issues in Data Mining and Machine Learning - International ECML/PKDD Workshop. Lecture Notes in Computer Science 6549, Springer. 50 - 63
Convert tm copus to PRISMA
Description
Converts a tm corpus object to a PRISMA object.
Usage
corpusToPrisma(corpus, alpha = 0.05, skipFeatureCorrelation = FALSE)Arguments
corpus | a tm corpus |
alpha | significance level for the feature tests. If NULL, all features are kept. |
skipFeatureCorrelation | should the grouping of features based on correlation analysis be skipped. |
Value
prismaData | data object representing the tokenized documents asfeatures x samples matrix. |
Author(s)
Tammo Krueger <tammokrueger@googlemail.com>
Examples
if (require("tm") && packageVersion("tm") >= '0.6') { data(thesis) thesis thesis = corpusToPrisma(thesis, NULL, TRUE) thesis}Estimate Inner Dimension
Description
Matrix factorization methods compress the original data matrixA \in R^{f,N} withf features andN samples into two parts,namelyA = B C withB \in R^{f,k}, C\in R^{k, N}. The function estimateDimension estimatesk based on a noisemodel estimated from a scrambled version of the original data matrix.
Usage
estimateDimension(prismaData, alpha = 0.05, nScrambleSamples = NULL)Arguments
prismaData | A prismaData object loaded via loadPrismaData |
alpha | Error probability for confidence intervals |
nScrambleSamples | The number of scrambled samples that should be used to estimate thenoise model. NULL means to use the complete data set. |
Value
estDim | prismaDimension object that can be printed and plotted. |
Author(s)
Tammo Krueger <tammokrueger@googlemail.com>
References
R. Schmidt. Multiple emitter location and signal parameter estimation.IEEE Transactions on Antennas and Propagation, 34(3):276 – 280, 1986.
Examples
# please see the vingette for examlesRestores Data with Duplicates
Description
TheloadPrismaData function triggers a feature selection anddata combination methods which subsequently remove duplicate entries forefficient representation of the data. ThegetDuplicateData rebuilds the data matrix withexplicit representation of all duplicate entries.
Usage
getDuplicateData(prismaData)Arguments
prismaData | prisma data loaded via |
Value
dataWithDuplicates | Data matrix containing explicit copies of all duplicates. |
Author(s)
Tammo Krueger <tammokrueger@googlemail.com>
Examples
data(asap)dataWithDuplicates = getDuplicateData(asap)Convert Coordinates of Matrix Factorization to Labels
Description
Given a matrix factorization objectA = B C, this function returns for eachdocument the index of the inner dimension which has the maximalcoordinate. Thus, it converts the fuzzy clustering found in thecolumns of theC matrix into a hard clustering by returning theposition with the maximal coordinate value.
Usage
getMatrixFactorizationLabels(prismaMF)Arguments
prismaMF | a matrix factorization object. |
Value
labels | vector containing the label assignment for each document. |
Author(s)
Tammo Krueger <tammokrueger@googlemail.com>
See Also
Load PRISMA Data Files
Description
Loads files generated by the sally tool (seehttp://www.mlsec.org/sally/) and represents the data as binarytoken/ngrams x documents matrix. After loading, statistical tests areapplied to find features which are not volatile norconstant. Co-occurring features are grouped to further compactify thedata. Seesystem.file("extdata","sallyPreprocessing.py", package="PRISMA") for a Python script which generates thecorresponding .fsally file from a .sally file which reduce theloading time vialoadPrismaData considerably.
Usage
loadPrismaData(path, maxLines = -1, fastSally = TRUE, alpha = 0.05, skipFeatureCorrelation=FALSE)Arguments
path | path of the data file without the .sally extension. loadPrisma loadspath.sally or path.fsally depending on the fastSally switch. |
maxLines | maximal number of lines to read from the data file. -1 means to readall lines. |
fastSally | should the fsally file be used, which drastically decreases loading time. |
alpha | significance level for the feature tests. If NULL, all features are kept. |
skipFeatureCorrelation | should the grouping of features based on correlation analysis be skipped. |
Value
prismaData | data object representing the tokenized documents asfeatures x samples matrix. |
Author(s)
Tammo Krueger <tammokrueger@googlemail.com>
References
Seehttp://www.mlsec.org/sally/ for the sally utility.
Examples
# please see the vingette for examles# please see system.file("extdata","asap.tar.gz", package="PRISMA") for# an example sally outputGenerics For PRISMA Objects
Description
Print and plot generic for the PRISMA objects.
Usage
## S3 method for class 'prisma'print(x, ...)## S3 method for class 'prisma'plot(x, ...)Arguments
x | PRISMA data loaded via |
... | not used |
Author(s)
Tammo Krueger <tammokrueger@googlemail.com>
See Also
estimateDimension,prismaHclust,prismaDuplicatePCA,prismaNMF
Examples
data(asap)print(asap)plot(asap)Generics For PRISMA Objects
Description
Print and plot generic for the PRISMA dimension objects.
Usage
## S3 method for class 'prismaDimension'print(x, ...)## S3 method for class 'prismaDimension'plot(x, ...)Arguments
x | PRISMA dimension object generated via |
... | not used |
Author(s)
Tammo Krueger <tammokrueger@googlemail.com>
See Also
estimateDimension,prismaHclust,prismaDuplicatePCA,prismaNMF
Examples
# please see the vingette for examlesGenerics For PRISMA Objects
Description
Print and plot generic for the PRISMA matrix factorization objects.
Usage
## S3 method for class 'prismaMF'plot(x, nLines = NULL, baseIndex = NULL, sampleIndex = NULL,minValue = NULL, noRowClustering = FALSE, noColClustering = FALSE, type= c("base", "coordinates"), ...)Arguments
x | PRISMA matrix factorization object |
nLines | number of lines that should be plotted |
baseIndex | which bases should be plotted |
sampleIndex | which samples should be plotted |
minValue | cut-off value, i.e., every value smaller than |
noRowClustering | don't cluster the rows |
noColClustering | don't cluster the columns |
type | show the base ( |
... | not used |
Author(s)
Tammo Krueger <tammokrueger@googlemail.com>
See Also
estimateDimension,prismaHclust,prismaDuplicatePCA,prismaNMF
Examples
# please see the vingette for examlesMatrix Factorization Based on Replicate-Aware PCA
Description
Efficient implementation of a replicate-aware principal componentanaylsis (PCA).
Usage
prismaDuplicatePCA(prismaData)Arguments
prismaData | PRISMA data for which a PCA should be calculated |
Value
prismaPCA | Matrix factorization object $A = B C$, in which thefactors are calculate by a replicate-aware PCA |
Author(s)
Tammo Krueger <tammokrueger@googlemail.com>
Examples
# please see the vingette for examlesMatrix Factorization Based on Hierarchical Clustering
Description
A matrix factorizationA = B C based on the results of hclust is constructed,which holds the mean feature values for each cluster in the matrixBand the indication of the cluster in the matrixC for each datapoint (i.e. each data point is represented by its assigned cluster center).
Usage
prismaHclust(prismaData, ncomp, method = "single")Arguments
prismaData | PRISMA data for which a clustering should be calculated. |
ncomp | the number of components that should be extracted. |
method | the method used for clustering. |
Value
prismaHclust | Matrix factorization object containing |
Author(s)
Tammo Krueger <tammokrueger@googlemail.com>
See Also
Examples
# please see the vingette for examlesMatrix Factorization Based on Replicate-Aware NMF
Description
Matrix factorizationA = B C with strictly positiv matricesB, Cwhich minimize the reconstruction error\|A - B C\|. Thisreplicate-aware version of the non-negtive matrix factorization (NMF)is based on the alternating least squaresapproach and exploits the replicate information to speed up the calculation.
Usage
prismaNMF(prismaData, ncomp, time = 60, pca.init = TRUE, doNorm = TRUE, oldResult = NULL)Arguments
prismaData | PRISMA data for which a NMF should be calculated. |
ncomp | either an |
time | seconds after which the calculation should end. |
pca.init | should the |
doNorm | should the |
oldResult | re-use results of a previous run, i.e. |
Value
prismaNMF | Matrix factorization object containing the |
Author(s)
Tammo Krueger <tammokrueger@googlemail.com>
References
Krueger, T., Gascon, H., Kraemer, N., Rieck, K. (2012)Learning Stateful Models for Network Honeypots5th ACM Workshop on Artificial Intelligence and Security (AISEC 2012), accepted
R. Albright, J. Cox, D. Duling, A. Langville, and C. Meyer. (2006)Algorithms, initializations, and convergence for the nonnegativematrix factorization.Technical Report 81706, North Carolina State University
Examples
# please see the vingette for examlesThe Thesis Data Set
Description
The 15 sections of a thesis (see references) as a tm-corpus.
Usage
thesisFormat
A tm-corpus.
Author(s)
Tammo Krueger <tammokrueger@googlemail.com>
References
Tammo Krueger.Probabilistic Methods for Network Security. From Analysis to Response. PhD thesis,TU Berlin, 2013.http://opus.kobv.de/tuberlin/volltexte/2013/3881/