Movatterモバイル変換

Title:

Mixed N-Grams and Unigram Sequence Segmentation

Version:

0.1.0

Description:

Segmentation of short text sequences - like hashtags - into the separated words sequence, done with the use of dictionary, which may be built on custom corpus of texts. Unigram dictionary is used to find most probable sequence, and n-grams approach is used to determine possible segmentation given the text corpus.

License:

GPL (≥ 3)

URL:

https://github.com/theogrost/NUSS

BugReports:

https://github.com/theogrost/NUSS/issues

Depends:

R (≥ 3.5)

Imports:

dplyr, magrittr, Rcpp, stringr, text2vec, textclean, utils

Suggests:

testthat (≥ 3.0.0)

LinkingTo:

BH, Rcpp

Config/testthat/edition:

Encoding:

UTF-8

Language:

LazyData:

true

RoxygenNote:

7.3.1

NeedsCompilation:

yes

Packaged:

2024-07-31 10:43:30 UTC; theog

Author:

Oskar Kosch

[aut, cre]

Maintainer:

Oskar Kosch <contact@oskarkosch.com>

Repository:

CRAN

Date/Publication:

2024-08-19 08:20:16 UTC

Base dictionary with unigrams

Description

Data contains English unigrams, with their replacements, points and ids.

Usage

data(base_dictionary)

Format

A data.frame with four columns: to_search, to_replace, points, id.

References

Created based on Wikipedia unigrams.

Examples

data(base_dictionary)

Perform inverse regex search (C++)

Description

This function takes character vectorpatterns with regex patterns(or fixed strings),and searches for match in thex string. It is inverse in the meaning,that ingrepl single pattern is used against multiple strings;instead, this function takes multiple patterns to fit on a single string.

Usage

igrepl(patterns, x, fixed = FALSE)

Arguments

patterns

a character vector of regex or fixed patterns.

x

a string to search for the match.

fixed

a logical, indicating whether patterns are fixed strings.

Value

Logical vector of length aspatterns with true if patternwas found.

Examples

igrepl(c("today","b.* fox", "jumps over", "vigorous"),"The quick brown fox jumps over the lazy dog", FALSE)igrepl(c("today","brown fox", "jumps over", "vigorous"),"The quick brown fox jumps over the lazy dog", TRUE)

Create n-grams dictionary

Description

ngrams_dictionary returns the data.frame containing dictionary forngrams_segmentation.

Usage

ngrams_dictionary(  texts,  clean = TRUE,  ngram_min = 1,  ngram_max = 5,  points_filter = 1)

Arguments

texts

character vector, these are the texts used to create n-gramsdictionary. Case-sensitive.

clean

logical, indicating if the texts should be cleaned beforecreating n-grams dictionary.

ngram_min

numeric, sets the minimum number of words in creatingthe dictionary.

ngram_max

numeric, sets the maximum number of words in creatingthe dictionary.

points_filter

numeric, sets the minimal number of points (occurrences)of an n-gram to be included in the dictionary.

Value

The output always will be data.frame with 4 columns: 1) to_search,2) to_replace, 3) id, 4) points.

Examples

texts <- c("this is science",           "science is #fascinatingthing",           "this is a scientific approach",           "science is everywhere",           "the beauty of science")ngrams_dictionary(texts)ngrams_dictionary(texts,                  clean = FALSE)ngrams_dictionary(texts,                  clean = TRUE,                  ngram_min = 2,                  ngram_max = 2)

Segmenting sequences with n-grams.

Description

ngrams_segmentation segments input sequence into possible segmentedtext based on n-grams segmentation approach.

Usage

ngrams_segmentation(  sequences,  ngrams_dictionary,  retrieve = "most-scored",  simplify = TRUE,  omit_zero = TRUE,  score_formula = "points / words.number ^ 2")

Arguments

sequences

character vector, sequence to be segmented(e.g., hashtag) or without it.

ngrams_dictionary

data.frame, containing ids, n-grams to search, wordsto use for segmentation, and their points. See details.

retrieve

character vector of length 1, with formulato calculate score.

simplify

logical, if adjacent numbers should be merged into one,and underscores removed. See simplification section.

omit_zero

logical, if words with 0 points should be omittedfrom word count. See simplification section.

score_formula

character vector of length 1, with formulato calculate score.

Value

The output always will be data.frame. Ifretrieve='all'is used, then the return will include all possible segmentationof the given sequence.
Ifretrieve='first-shortest' is used, the first of the shortestsegmentations (with respect to the order of word's appearancein the dictionary, 1 row).
Ifretrieve='most-pointed' is used, segmentation with most totalpoints is returned (1 row).
Ifretrieve='most-scored' is used, segmentation with the highestscore calculated as
score = points / words.number ^ 2 (or as specified by the user).
The output is not in the input order. If needed, uselapply

ngrams_dictionary

Dictionary has to be data.frame with four named columns: 1) to_search,2) to_replace, 3) id, 4) points.
'to_search' should be column of type character, containing n-grams tolook for. Word case might be used.
'to_replace' should be column of type character, containing n-grams thatshould be used for creating segmentation vector, if 'to_search' matchestext.
'id' should be column of type numeric, containing id of unigram.
'points' should be column of type numeric, containing number of pointsfor the word - the higher, the better. Unigrams with 0 points might beremoved from the word count with omit_zero argument. ngrams_dictionarymight be created withngrams_dictionary.

Simplification

Two arguments are possible for simplification:

simplify - removes spaces between numbers and removes underscores,
omit_zero - removes ids of 0-pointed unigrams,and omits them in the word count.
By default segmented sequence will be simplified,and numbers and underscores will be removed from word countfor score computing, since they are neutral as they are necessary.

Examples

texts <- c("this is science",           "science is #fascinatingthing",           "this is a scientific approach",           "science is everywhere",           "the beauty of science")ndict <- ngrams_dictionary(texts)ngrams_segmentation("thisisscience", ndict)ngrams_segmentation("this_is_science", ndict)ngrams_segmentation("ThisIsScience", ndict)ngrams_segmentation("thisisscience",                    ndict,                    simplify=FALSE,                    omit_zero=FALSE)

Mixed N-Grams and Unigram Sequence Segmentation (NUSS) function

Description

nuss returns the data.frame containinghashtag, its segmented version, ids of dictionary words,number of words it have taken to segment the hashtag,total number of points, and computed score.

Usage

nuss(sequences, texts)

Arguments

sequences

character vector, sequence to be segmented,(e.g., hashtag) or without it. Case-insensitive.

texts

character vector, these are the texts used to create n-gramsand unigram dictionary. Case-insensitive.

Details

This function is an arbitrary combination ofngrams_dictionary,unigram_dictionary,ngrams_segmentation,unigram_sequence_segmentation, created to easily segment short textsbased on text corpus.

Value

The output always will be data.frame with sequences, that were
The output is not in the input order. If needed, uselapply

Examples

texts <- c("this is science",           "science is #fascinatingthing",           "this is a scientific approach",           "science is everywhere",           "the beauty of science")nuss(c("thisisscience", "scienceisscience"), texts)

Create unigram dictionary

Description

unigram_dictionary returns the data.frame containing dictionary forunigram_sequence_segmentation.

Usage

unigram_dictionary(texts, points_filter = 1)

Arguments

texts

character vector, these are the texts used to create ngramsdictionary. Case-sensitive.

points_filter

numeric, sets the minimal number of points (occurrences)of an unigram to be included in the dictionary.

Value

The output always will be data.frame with 4 columns: 1) to_search,2) to_replace, 3) id, 4) points.

Examples

texts <- c("this is science",           "science is #fascinatingthing",           "this is a scientific approach",           "science is everywhere",           "the beauty of science")unigram_dictionary(texts)

Segmenting sequences with unigrams

Description

unigram_sequence_segmentation segments input sequence into possible segmentedtext based on unigram sequence segmentation approach.

Usage

unigram_sequence_segmentation(  sequences,  unigram_dictionary = NUSS::base_dictionary,  retrieve = "most-scored",  simplify = TRUE,  omit_zero = TRUE,  score_formula = "points / words.number ^ 2")

Arguments

sequences

character vector, sequence to be segmented(e.g., hashtag). Case-sensitive.

unigram_dictionary

data.frame, containing ids, words to search, words to usefor segmentation, and their points. See details.

retrieve

character vector of length 1, the type of the resultdata.frame to be returned: 'all', 'first-shortest', 'most-pointed' or'most-scored'. See value section.

simplify

logical, if adjacent numbers should be merged into one,and underscores removed. See simplification section.

omit_zero

logical, if words with 0 points should be omittedfrom word count. See simplification section.

score_formula

character vector of length 1, with formulato calculate score.

Details

This function is not intended for long strings segmentation -70 characters should be considered too longand may take hours to complete. 15 characters takes about 0.02s,30 characters about 0.03s.

Value

unigram_dictionary

Dictionary has to be data.frame with four named columns: 1) to_search,2) to_replace, 3) id, 4) points.
'to_search' should be column of type character, containing unigram tolook for. Word case might be used.
'to_replace' should be column of type character, containing word thatshould be used for creating segmentation vector, if 'to_search' matchestext.
'id' should be column of type numeric, containing id of unigram.
'points' should be column of type numeric, containing number of pointsfor the word - the higher, the better. Unigrams with 0 points might beremoved from the word count with omit_zero argument.

Simplification

Two arguments are possible for simplification:

simplify - removes spaces between numbers and removes underscores,
omit_zero - removes ids of 0-pointed unigrams,and omits them in the word count.
By default segmented sequence will be simplified,and numbers and underscores will be removed from word countfor score computing, since they are neutral as they are necessary.

Examples

# With custom dictionarytexts <- c("this is science",           "science is #fascinatingthing",           "this is a scientific approach",           "science is everywhere",           "the beauty of science")udict <- unigram_dictionary(texts)unigram_sequence_segmentation('thisisscience', udict)# With built-in dictionary (English, only lowercase)unigram_sequence_segmentation('thisisscience')unigram_sequence_segmentation('thisisscience2024')unigram_sequence_segmentation('thisisscience2024', simplify=FALSE, omit_zero=FALSE)