Movatterモバイル変換

Part of the book series:Lecture Notes in Computer Science ((LNTCS,volume 11537))

Included in the following conference series:

International Conference on Computational Science

1746Accesses
2Citations

Abstract

Statistical extraction of relevantn-grams in natural languagecorpora is important for text indexing and classification since it can be language independent. We show how a theoretical model identifies the distribution properties of the distinctn-grams and singletons appearing in largecorpora and how this knowledge contributes to understanding the performance of ann-gram cache system used for extraction of relevant terms. We show how this approach allowed us to evaluate the benefits from using Bloom filters for excluding singletons and from using static prefetching of nonsingletons in ann-gram cache. In the context of the distributed and parallel implementation of the LocalMaxs extraction method, we analyze the performance of the cache miss ratio and size, and the efficiency ofn-gram cohesion calculation with LocalMaxs.

Acknowledgements to FCT MCTES and NOVA LINCS UID/CEC/04516/2019.

You have full access to this open access chapter, Download conference paper PDF

An Empirical Model for n-gram Frequency Distribution in Large Corpora

Extended N-gram Model for Analysis of Polish Texts

A Model for Predicting n-gram Frequency Distribution in Large Corpora

Keywords

1Introduction

Multiword expressions in natural language texts aren-grams (sequences of$n\ge 1$ consecutive words). Statistical extraction of relevant expressions, useful for text indexing and classification, can be language-independent. Thus it can be included in initial stages of extraction pipelines, followed by language-specific syntactic/semantic filtering. The increased availability of largecorpora [1,2] due to the Web growth challenges statistical extraction methods. We focus onn-gram distribution models and parallel and distributed tools for extracting relevant expressions from largecorpora. LocalMaxs [3,4], a multiphase statistical extraction method, has a$1^{st}$ phase for collectingn-gram frequency statistics, a$2^{nd}$ phase for calculating ann-gram cohesion metric, and a$3^{rd}$ phase for applying ann-gram relevance filtering criterion. The computational complexity of methods as LocalMaxs depends onn-gram distribution properties. Thus we proposed [5] a theoretical model predicting then-gram distribution as a function ofcorpus size andn-gram size ($n\ge 1$), validated empirically for estimating the numbers of distinctn-grams,$1\le n \le 6$, with English and Frenchcorpora from 2 Mw ($10^{6}$ words) to 1 Gw ($10^{9}$ words) [6]. It allows to identify how the numbers of distinctn-grams tend asymptotically toplateaux as thecorporagrow toward infinity. Due to the large numbers of distinctn-grams in largecorpora, the memory limitations become critical, motivating optimizations for space efficientn-gram data structures [7]. We pursue an orthogonal approach using parallel computing for acceptable execution times, overcoming the memory limitations by data partitioning with more machines, and using data distribution for scalable storage of then-grams statistical data [8,9]. Guided by the theoretical model estimates, we developed a parallel architecture for LocalMaxs: with an on-demand dynamicn-gram cache to keep then-gram frequency data, used for supporting the cohesion calculations in the$2^{nd}$ phase; a distributed in-memory store as a repository of then-gram global frequency values in thecorpus and the cohesion and relevance values; and a workflow tool for specifying multiphase methods; supported by a distributed implementation with a configurable number of virtual machines. LocalMaxs execution performance for extracting relevant 2-grams and 3-grams from Englishcorpora up to 1 Gw was shown scalable, with almost linear relative speed-up and size-up, with up to 48 virtual machines on a public cloud [8,9]. However, that implementation achieves low efficiency relative to a single ideal sequential machine because the on-demand dynamicn-gram cache is unable to overcome the communication overheads due to then-gram references missing in the cache, requiring the remote fetching of the n-gram global frequency counts. To improve then-gram cache efficiency, we discuss two new aspects, as extensions to the LocalMaxs parallel architecture. The first one consists in filtering the singletonn-grams. To evaluate this, we extend the theoretical model to predict the distribution of singletonn-grams,$1 \le n \le 6$, applying this to Englishcorpora from a few Mw to infinity. Then we show that this singletons filtering with Bloom filters [10] leads to a reduction of then-gram cache miss ratio, but it depends on the evolution of the numbers of singletons as thecorpus size grows. The second improvement relies on the static prefetching of then-grams statistical data into the cache. This, for a multiphase method, can be performed completely in the$1^{st}$ phase (collectingn-gram statistics), so that during a subsequent phase where then-gram cache is used for cohesion metric and relevance calculation, there is no cache miss overhead. For LocalMaxs, this leads to a virtually$0\%$ cache miss ratio for anycorpus sizes. In the paper we discuss background (Sect. 2), the theoretical model and the distribution of singletonn-grams (Sect. 3), the twon-gram cache improvements (Sect. 4 ) and the obtained results (Sect. 5).

2Background

Relevant expressions, e.g. “United Nations”, can be used to summarize, index or cluster documents. Due to their semantic richness, their automatic extraction from raw text is of great interest. Extraction approaches can be linguistic, statistical or hybrid [11,12]. Most of the statistical ones are language-neutral [13], using metrics as Mutual Information [14], Likelihood Ratio [15],${\varPhi }^{2}$ [16]. Among the latter, LocalMaxs [3,4] extracts multiword relevant expressions [17].

LocalMaxs. It relies on a generic cohesion metric, called “glue”, (as$SCP_{f}$ Eq. (1) below; Dice [4]; or Mutual Information), and on a generic relevance criterion (as Eq. (2) below), that, for a given input set ofn-grams ($n\ge 2$), identifies the ones considered relevant, according to the strength of their internal co-occurrence:

$$\begin{aligned} SCP_{f}\left( w_{1}{\scriptstyle \cdots }w_{n}\right) = \frac{f\left( w_{1}{\scriptstyle \cdots }w_{n}\right) ^{2}}{\frac{1}{n-1}\sum _{i=1}^{n-1}f\left( w_{1}{\scriptstyle \cdots }w_{i}\right) \times f\left( w_{i+1}{\scriptstyle \cdots }w_{n}\right) } \end{aligned}$$

(1)

where$f\left( w_{1}{\scriptstyle \cdots }w_{i}\right) $ is the frequency of then-gram$\left( w_{1}{\scriptstyle \cdots }w_{i}\right) $,$i\ge 1$, in thecorpus. The denominator has the frequencies of all “leftmost and rightmost subn-grams” (that, for simplicity, are abbreviated as “subn-grams”) of sizes from 1 to$n-1$ contained in$\left( w_{1}{\scriptstyle {\scriptstyle \cdots }}w_{n}\right) $. E.g., considering the 5-gram “European Court of Human Rights”, the subn-grams whose frequencies are needed for the glue calculation are: the 1-grams, “European” and “Rights”; the 2-grams, “European Court” and “Human Rights”; the 3-grams, “European Court of” and “of Human Rights”; and the 4-grams, “European Court of Human” and “Court of Human Rights”.

LocalMaxs Relevance Criterion. Let$W=\left( w_{1}{\scriptstyle ...}w_{n}\right) $ be ann-gram and$g\left( .\right) $ a generic cohesion metric. Let${\varOmega }_{n-1}\left( W\right) $ be the set of$g\left( .\right) $ values for all contiguous$(n-1)$-grams within then-gramW; Let${\varOmega }_{n+1}\left( W\right) $ be the set of$g\left( .\right) $ values for all contiguous$(n+1)$-grams containingn-gramW.W is relevant expression iff:

$$\begin{aligned} \begin{array}{c} \forall _{x}\in {\varOmega }_{n-1}\left( W\right) ,\forall _{y}\in {\varOmega }_{n+1}\left( W\right) \\ length\left( W\right) =2\wedge g\left( W\right)>y\quad \vee \quad length\left( W\right)>2\wedge g\left( W\right) >\frac{x+y}{2} \end{array} \end{aligned}$$

(2)

For the example$W=(European Court of Human Rights)$, the sets are:${\varOmega }_{n-1}\left( W\right) =\left\{ g\left( European Court of Human\right) ,\,g\left( Court of Human Rights\right) \right\} $; and${\varOmega }_{n+1}\left( W\right) =\left\{ g\left( Y\right) \right\} $, such that$Y=\left( w_{L}\;W\right) $ or$Y=\left( W\;w_{R}\right) $ where symbols$w_{L}$ and$w_{R}$ stand for unigrams appearing in thecorpus, andY is the$(n+1)$-gram obtained from the concatenation of$w_{L}$ or$w_{R}$ withW.

Parallel LocalMaxs Architecture. Figure 1a shows the logical dependencies of LocalMaxs for extracting relevantn-grams,$2\le n\le 5$. For a given maximumn-gram size$n_{MAX}$ the relevantn-grams,$2\le n \le n_{MAX}$, are identified in thecorpus in three phases: (1) counting alln-gram occurrences,$1\le n \le \left( n_{MAX}+1\right) $; (2) calculating the glue ($g_{2\cdots \left( n_{MAX}+1\right) }$) for all distinctn-grams,$2\le n\le \left( n_{MAX}+1\right) $; and (3) applying a relevance criterion to all distinct nonsingletonn-grams,$2\le n\le n_{MAX}$. The workflow is executed [8,9] by a collection of virtual machines, each with one controller (for LocalMaxs functions: count, glue, relevance), one server (for storing then-gram data), and localn-gram caches (Fig. 1b).

In phase one, then-gram counting is performed in parallel by different controllers acting on equal-size inputcorpus partitions. It generates the distinctn-gram tables, one for eachn-gram size, containing the total counts of all then-gram occurrences in thecorpus. These tables, partitioned byn-gram hashing, are stored in a distributed collection of servers, thus supporting a repository of the globaln-gram frequency counts in thecorpus (in the end of phase one). ForK machines, each server$S\left( j\right) $ in each machinej:$1\le j\le K$, keeps a localn-gram table ($D_{i}\left( j\right) $), forn-grams of sizei:$1\le i\le \left( n_{MAX}+1\right) $. The set of localn-gram tables ($1\le i\le \left( n_{MAX}+1\right) $) within each server$S\left( j\right) $ is:$\left\{ D_{1}\left( j\right) ,D_{2}\left( j\right) ,\cdots ,D_{i}\left( j\right) ,\cdots ,D_{n_{MAX}+1}\left( j\right) \right\} $. The set of distinctn-grams of sizei in thecorpus ($D_{i}$) is the union of the disjoint localn-gram tables$D_{i}\left( j\right) $ in all servers$1\le j\le K$. In each machine (j), there is one controller ($Ctrl\left( j\right) $) co-located with one local server$S\left( j\right) $. Phase two input consists of a set of distinctn-grams whose glues must be calculated. Thesen-grams and their frequency counts are found, by each machine controller, in the local servern-gram tables. However, the frequencies of the subn-grams required for glue calculation of each distinctn-gram must be fetched from the global distributed repository. So, in this phase the repeated subn-gram references used by the glue calculations justify a per machinen-gram cache for eachn-gram size ($C_{1}$, ...,$C_{n_{MAX}}$) (Fig. 1b). Each local cache entry has the frequency of a distinctn-gram. In the end of phase two all the distinctn-gram entries in the global repository become updated with their glue values. The input to phase three, for each machine controller, consists of the localn-gram tables updated by phase two, used to evaluate then-gram relevance, finally stored in the localn-gram table. At the end, for all tables ($D_{i}\left( j\right) $) of the global repository, each entry has: an uniquen-gram identification, its global frequency, its glue value, and its relevance flag (yes/no). As thecorpus data is unchanged during LocalMaxs execution, a static work distribution leads to a balanced load in all phases since the local table sizes are approximately equal,$\left| D_{i}\left( j\right) \right| \approx \left( \left| D_{i}\right| /K\right) $ (for eachn-gram sizei,$1 \le i\le n$; machinej,$1 \le j \le K$), and the controller input partitions are of equal sizes.

3A Theoretical Model forn-gram Distribution

We review a theoretical model [5] for the efficient estimation of the number of distinctn-grams ($n\ge 1$), for anycorpus size, for each given language. Here the model is extended for predicting the number of singletonn-grams.

Distinct n-grams. By Zipf-Mandelbrot Law [18,19] and Poisson distribution:

$$\begin{aligned} f\left( r,c,n\right) =\left( \left( 1+\beta \left( n\right) \right) ^{\alpha \left( n\right) }\times f\left( 1,c,n\right) \right) \times \frac{1}{\left( r+\beta \left( n\right) \right) ^{\alpha \left( n\right) }} \end{aligned}$$

(3)

where$f\left( r,c,n\right) $ is the absolute frequency of the$r^{th}$ most frequentn-gram of sizen in acorpusC of$c=\left| C\right| $ words. The most frequentn-gram of sizen is ranked$r=1$ with frequency$f\left( 1,c,n\right) $, and the least frequentn-gram of sizen has rank$r=D\left( c,n\right) $, i.e., the number of distinctn-grams of sizen for acorpusC. For each language,$\alpha \left( n\right) $ and$\beta \left( n\right) $ are approximately constant (in Eq. (3)). As confirmed empirically, the relative frequency,$p_{1}\left( n\right) $, of the first rankedn-gram tends to be constant wrt thecorpus size:$f\left( 1,c,n\right) =p_{1}\left( n\right) \times c$. Thus,$\left( \left( 1+\beta \left( n\right) \right) ^{\alpha \left( n\right) }\times f\left( 1,c,n\right) \right) $ in Eq. (3) is constant for eachcorpus size, hence the frequency of each rank follows a power law with$\alpha \left( n\right) >0$. Let random variableX be the number of occurrences ofn-gramw in rankr in acorpus, in languagel, by Poisson distribution, the probability ofw occurring at least once is:

$$\begin{aligned} Pr\left( X\ge 1\right) =1-e^{-\lambda } \end{aligned}$$

(4)

where$\lambda $ is the Poisson parameter, the expected frequency ofn-gramw in thatcorpus. For each rankr, we have$\lambda =f\left( r,c,n\right) $. Thus,$Dist\left( l,c,n\right) $, the expected number of distinctn-grams of sizen in acorpus of sizec in languagel, is:

$$\begin{aligned} Dist\left( l,c,n\right) =\sum _{r=1}^{v\left( n\right) } \left( 1-e^{-f\left( r,c,n\right) }\right) =v\left( n\right) -\sum _{r=1}^{v\left( n\right) }e^{-\left( \left( \frac{1+\beta \left( n\right) }{r+\beta \left( n\right) }\right) ^{\alpha \left( n\right) } \times \left( p_{1}\left( n\right) \times c\right) \right) } \end{aligned}$$

(5)

For eachn-gram sizen there is a corresponding languagen-gram vocabulary of specific size$v\left( n\right) $, which in our interpretation includes all different word flexions as distinct. The parameters$\alpha $,$\beta $,$p_{1}$,v were estimated empirically for the English language, for 1-grams to 6-grams [5], using a set of Wikipediacorpora from 2 Mw to 982 Mw (Table 1). In Fig. 2a, the curves for the estimates (Es) of the numbers of distinctn-grams ($1\le n\le 6$) are shown dotted and for the observed data (Obs) are filled, corresponding to a relative error ($Es/Obs-1$) generally below$1\%$ [5]. Above well identifiedcorpus size thresholds, for eachn-gram size, the number of distinctn-grams reaches an asymptoticplateau determined by the finite vocabulary size, at a given time epoch. Any furthercorpus increase just increases the existingn-grams frequencies.

Table 1. Best$\alpha $,$\beta $,v (number ofn-grams) and$p_{1}$ for the Englishcorpora

Full size table

Number of Singleton n-grams. From the Poisson distribution, the number of distinctn-grams with frequency$k\ge 0$ is estimated as:

$$\begin{aligned} W(k,c,n)=\sum _{r=1}^{r=v\left( n\right) }\frac{\lambda _{r}^{k}\times e^{-\lambda _{r}}}{k!}=\sum _{r=1}^{r=v\left( n\right) }\frac{f(r,c,n)^{k}\times e^{-f(r,c,n)}}{k!} \end{aligned}$$

(6)

where$\lambda _{r}=f\left( r,c,n\right) $. For$k=1$ it estimates the number of singletons (Fig. 2b),$1\le n\le 6$. The number of singletons increases with thecorpus size, as new ones keep appearing until a maximum, and vanishes gradually due to the vocabulary finiteness. Singletons keep a significant proportion of the distinctn-grams for a wide range: e.g., proportions fall below$80\%$ only forcorporaaround 8 Mw, 1 Gw, 4 Gw, 16 Gw, 131 Gw, respectively, for 2-grams, 3-grams, 4-grams, 5-grams, 6-grams. Singleton 1-gram proportion is above$55\%$ forcorpora up to 16 Gw.

4n-gram Cache System

Caching has been widely studied: in linguistics [20], Zipf distributions [21], Web search [22], or mining [23]. Ann-gram cache is useful to statistical methods. In [9] a dynamic on-demandn-gram cache exploits repetitions in texts, reducing access overheads to a remote store in LocalMaxs$2^{nd}$ phase. Overheads were further reduced [9]: by ann-gram cache warm-up using combined metric calculations; and by using more machines, thus reducing the per machine number ofn-gram misses (albeit non linearly) and the miss time penalty. That reduction is still not enough. Thus, we discuss two new improvements, validated experimentally for Englishcorpora up to 1 Gw. Firstly (Sect. 4.2), we filter the large proportions of singletons in thecorpus, out of then-gram cache, using Bloom filters [10]. In [24], alternatives for Bloom filters, caching and disk/in-memory storage were evaluated but focused on performance and scalability in text mining. Distinctively we developed ann-gram cache for$n\ge 1$ and analyzed Bloom filters efficiency depending on the numbers of singletons, from smallcorpus sizes up to infinity. Secondly (Sect. 4.3), using static prefetching we achieved a$0\%$n-gram cache miss ratio.

An n-gram Cache in LocalMaxs$2^{nd}$Phase. For each glue calculation, references to subn-grams are generated, which are submitted as cache input references to check if they are already in the localn-gram cache, otherwise they must first be fetched from the globaln-gram repository. The set of references,$all_{glue_{g_{n}}Ref}\left( j\right) $, contains all subn-gram occurrences for glue calculation ($g_{n}$) of the distinctn-grams of sizen in table$D_{n}(j)$ in machinej. The set of distinct subn-grams (sizes 1 to$(n-1)$), found within$all_{glue_{g_{n}}Ref}\left( j\right) $, is$D_{all_{1\cdots \left( n-1\right) }}\left( j\right) = D_{1_{in}D_{2}\cdots D_{n}} \left( j\right) \cup D_{2_{in}D_{3}\cdots D_{n}}\left( j\right) \cup \cdots \cup D_{\left( n-1\right) _{in}D_{n}} \left( j\right) $. Each set$D_{i_{in}D_{n}}$,$1\le i\le (n-1)$, contains the distinct subn-grams of sizei, occurring within then-grams in$D_{n}$ table (Eq. (1)). For a single machine,$D_{i_{in}D_{n}}$ is$D_{i}$,$1\le i\le (n-1)$, the set of distinctn-grams of sizei in thecorpus. For multiple machines, each one handles the distinct subn-gram references in its local tables ($D_{2}(j)$,$D_{3}(j)$, etc.).

4.1Dynamic On-Demand n-gram Cache

A dynamic on-demandn-gram cache, assumed unbound, is able to contain all the distinct subn-grams of each localn-gram table. We analyzed the cold-start (first occurrence) misses behavior, for an initially empty cache. If there is not enough memory, the cache capacity misses are also handled. To reduce the cold misses overhead we built ann-gram cache warm-up [9] using combined glues: whereas the single glue calculation for 2-grams ($g_{2}$) only requires access to the 1-gram cache, for the combined glues of 2-grams up to 6-grams ($g_{2\cdots 6}$) the 1-gram cache ($C_{1}$) is reused five times, the 2-gram cache ($C_{2}$) is reused four times, and so on for caches$C_{3}$,$C_{4}$,$C_{5}$. In a single machine, the global miss ratio (mr) of an unbound cache system with subcaches$C_{1}$ to$C_{5}$ used for glue$g_{2\cdots 6}$, is:

$$\begin{aligned} mr=\frac{D_{all_{1\cdots 5}}}{all_{glue_{g_{2}\cdots g_{6}}Ref}}=\frac{\sum _{i=1}^{5}\left| D_{i}\right| }{\sum _{i=2}^{6}2\times (i-1)\times \left| D_{i}\right| } \end{aligned}$$

(7)

The miss ratio decreases with thecorpus size and increases with the glue calculation complexity (n-gram size). Using the theoretical model for a single machine, we predicted the evolution of the miss ratio of the dynamic on-demandn-gram cache (Fig. 3a) for glue$g_{2\cdots 6}$: it varies from around$11\%$, forcorpus size close to 10 Mw, to an asymptotic limit of around$1.5\%$ in theplateaux (beyond 1 Tw). Results were experimentally validated for Englishcorpora up to 1 Gw.

Effect of Multiple Machines($K>1$). Due to a multiplication effect of the nonsingletons (mostly the common ones e.g. “the”, “and”) cited by multiple distinctn-grams spread across multiple machines [9], the number of distinct subn-grams for glue calculation in each machine is not reduced by a factor ofK wrt the number of distinctn-grams in thecorpus, unlike the number of cache references per machine that is reduced as 1/K compared to the case of a single machine. Thus, the per machine miss ratio of a dynamic on-demandn-gram cache increases withK for eachcorpus size. Indeed we have shown [9] that, for eachn-gram sizen, the miss ratio follows a power trend:$mr(K)\propto K^{b\left( n\right) }$,$0<b\left( n\right) <1$.

4.2Bloom Filters for Singletons in an On-Demand n-gram Cache

Singletons can be filtered by Bloom filters [10], trained with the nonsingletons occurring in thecorpus. For the majority of singletons the Bloom filter says: “definitely not in set”. For all the nonsingletons and a minority of singletons it says: “possibly in set”. The percentage of false positives is kept low enough by adequate Bloom filter implementation. During phase one of LocalMaxs each server (Sect. 2) generates a Bloom filter for each localn-gram table. At the end of phase one, after the servers have updated all then-gram frequencies, the filters were trained with all the nonsingletons. In the beginning of phase two, each machine controller gets a copy of the trained Bloom filters.

Single Machine Case($K = 1$). The proportion of the total number of singletons ($S_{all}$) wrt the total number of distinctn-grams in thecorpus ($D_{all}$) is:

$$\begin{aligned} SF_{all}=\frac{S_{all}}{D_{all}}=\frac{\sum _{i=1}^{5}S_{i}}{\sum _{i=1}^{5}D_{i}} \end{aligned}$$

(8)

also illustrating the$SF_{all}$ ratio for the case of glue$g_{2\cdots 6}$. Thus:

$$\begin{aligned} \frac{mr}{mr_{BF}}=\frac{\frac{D_{all}}{all_{glueRef}}}{\frac{\left( D_{all}-S_{all}\right) }{all_{glueRef}}}= \frac{1}{1-SF_{all}} \end{aligned}$$

(9)

wheremr and$mr_{BF}$ are, respectively, the miss ratio without and with Bloom filters. The case of glue$g_{2\cdots 6}$ is illustrated in Fig. 3 where the miss ratios of the individual caches$C_{1}$ to$C_{5}$ are shown (dotted), as well as the global miss ratio of the cache system$C_{1+\cdots +5}$ (filled). The curves result from the model predictions, and were experimentally validated forcorpus sizes up to 1 Gw.

Due to the composition of the individual miss ratios of the caches$C_{1}$ to$C_{5}$, the miss ratio with Bloom filters (Fig. 3b) for glue$g_{2\cdots 6}$ is mostly dominated by the larger populations of the largern-gram sizes. It varies from about$1\%$ for the smallestcorpus (about 10 Mw) to$1.5\%$ in theplateau. The reduction, wrt not using Bloom filters, is due to the increased filtering effectiveness in handling the large global proportion of singletonn-grams,$1\le n\le 5$. The miss ratio has a non monotonic behavior, with a single peak of about$3\%$ at around 100 Gw, however forcorpora until 1 Gw it remains always below about$1.3\%$.

Effect of Multiple Machines($K > 1$). The per machine miss ratio is:

$$\begin{aligned} mr_{BF}=\frac{\hat{D}_{all}-\hat{S}_{all}}{\hat{al}l_{glueRef}} =\frac{\hat{NS}_{all}}{\hat{al}l_{glueRef}}= \frac{NS_{K=1}\times K^{-b_{NS}}}{all_{glueRef}/K}= \frac{NS_{K=1}\times K^{1-b_{NS}}}{all_{glueRef}} \end{aligned}$$

(10)

where$\hat{D}_{all}=\left( \sum _{j=1}^{K} D_{all_{1\cdots \left( n-1\right) }}\left( j\right) \right) /K$ is the per machine number of distinct subn-grams of sizes 1 to$n-1$;$\hat{S}_{all}$ is the per machine number of singleton subn-grams of sizes 1 to$n-1$. Due to their multiplicative effect withK (Sect. 4.1), the number of per machine nonsingletons follows$\hat{NS}_{all}\propto K^{-b_{NS}}$,$0< b_{NS}<1$ ($b_{NS}$ empirically determined). Thus$mr_{BF}$ increases withK. Table 2 shows experimental values of the miss ratios without and with Bloom filters, for a LocalMaxs implementation using$K=16$ machines in a public cloud [25], compared to a single machine case, forcorpora of sizes 205 Mw and 409 Mw, when calculating glue$g_{234}$.

Table 2. Distinctn-grams, singletons, cache references (numbers ofn-grams); Cache miss ratio (%) without/with Bloom filters

Full size table

Overall, Bloom filters lead to a reduction in the cache size and in the miss ratio, both determined by the singleton ratio ($SF_{All}$). As this ratio tends to zero the Bloom filter effect diminishes progressively.

4.3n-gram Cache with Static Prefetching

Whenever one can identify the set of distinctn-grams in acorpus and their frequencies in a$1^{st}$ phase, one can anticipate their fetching into then-gram cache before the execution starts in a$2^{nd}$ phase. The Fixed Frequency Accumulation Set (FAset)FA for ensuring a static hit ratio$h_{S}\left( n,FA\right) $ is the minimal subset of distinctn-grams of a given sizen, whose cumulative sum of frequencies is a percentage of the number of occurrences of then-grams of sizen:

(11)

whereng is a distinctn-gram within theFAset and$freq_{inCorpus}\left( ng\right) $ is its frequency in thecorpusC; and$\left| Set_{All_{n}}\right| =\left| C\right| -(n-1)$ for$n\ge 1$. When applying this concept to ann-gram cache one must consider, as the denominator of Eq. (11), the set of cache input references$all_{glue}Ref_{n-gram}$ instead of the set$Set_{All_{n}}$. For glue$g_{2}$ of the 2-grams in the$D_{2}$ table, LocalMaxs requires access to all the subunigram occurrences (in a total of$all_{g_{2}}Ref_{1-gram}=2\times \left| D_{2}\right| $). TheFAset to be loaded in cache$C_{1}$ is the subset of the elements in the set$D_{1_{in}D_{2}}$ (Sect. 2) whose accumulated sum of frequencies of occurrences within the 2-grams of the$D_{2}$ table ensures a desired static hit ratio ($h_{S_{C_{1}}}$) for the 1-grams cache. For a combined glue, e.g.$g_{234}$, using caches$C_{1}$,$C_{2}$ and$C_{3}$ (1-grams, ..., 3-grams), let$freq_{in\;all_{g_{234}}Ref_{i-gram}}\left( ng\right) $ ($1\le i\le 3$) be the frequency of a distinctn-gramng occurring in the set of cache input references$all_{g_{234}Ref_{i-gram}}$. To ensure a target static hit ratio$h_{S}$ (or miss ratio$mr_{S}$) theFAset must enforce the following proportion of hits (nbrHits) wrt the total number of cache references ($\left| all_{g_{234}Ref}\right| =\sum _{i=1}^{3}all_{g_{234}Ref_{i-gram}}$, for glue$g_{234}$):

(12)

Options for selecting the distinctn-grams for theFAset are: (i) All the distinctn-grams; (ii) Only the nonsingletons; (iii) A subset of the distinctn-grams. Option (i) seems the best but there is no need to include the singletons, which suggests option (ii). If there is not enough memory for all the nonsingletons in the cache, option (iii) must be taken ensuring the maximum number of hits pern-gram, under the existing memory constraints [17]. The LocalMaxs workflow (Fig. 1a) allows to completely calculate theFAsets for eachn-gram size in phase one, overlapped with then-gram counting, using dedicated threads. As the machine allocation to LocalMaxs tasks in all phases is made before execution starts, one can also prefetch theFAsets into the corresponding machines in phase one. Thus theFAset calculation and prefetching times are hidden from the total execution time, as far as the additional thread overheads are kept small. In option (ii), by prefetching all distinct nonsingletons completely in phase one, the nonsingleton miss overheads in phase two are eliminated, leading to a$0\%$ overall miss ratio.

Multiple Machines Case($K>1$). TheFAset size per machine decreases withK as the number of distinct subn-grams per machine [17]. But, unlike the dynamic on-demand cache, the miss ratio with static prefetching can be kept constant wrtK by adjusting the per machineFAset according to the number of machines, e.g., for a$0\%$ miss ratio, all the nonsingletons in the per machine distinctn-gram tables must be always included in the localFAset.

Experimental Results. We compared the communication and glue calculation times of static prefetching of all nonsingletonsversus on-demand caching. In each machine (j), phase two takes a total time$T_{2}\left( j\right) $ consisting of time components for: input$T_{input}\left( j\right) $; local glue calculation$T_{Glue}\left( j\right) $; subn-gram fetch$T_{comm}\left( j\right) $; glue output$T_{output}\left( j\right) $. The input/output consists of local machine interactions between the co-located server and controller (Fig. 1b), being the same in both cache cases. Table 3 shows the communication and glue times of$g_{234}$ (machine average), for twocorpus sizes, in LocalMaxs phase two [9,17] in a public cloud [25] with 16 machines (each 64 GBRAM,4vCPU@1.5GHz).

Table 3. Glue and communication times ($min{:}sec)K =16$: Dynamicvs. Static cache

Full size table

$\hat{T}_{comm}$ includes the per machine times forn-gram cache fetch: local access and remote ($\hat{T}_{RemoteFecth}$).$\hat{T}_{Glue}$ is the local per machine glue calculation time. For the static prefetching cache$\hat{T}_{RemoteFecth}$ is zero. The cache static prefetching time of the nonsingletons is accounted for in phase one, overlapped with counting.

4.4Cache Alternatives

For glue$g_{2\cdots 6}$ and threecorpus sizes, Table 4 shows the cache miss ratio and size, and the efficiency of the glue calculation (values shown as triples$\left\{ (mr);(Size);(E)\right\} $) for a single machine. This efficiency (with$K=1$) reflects the ratio of the communication overheads suffered by a single real machineversus an ideal machine, i.e.,$E=T_{0}/T_{1}$. Miss ratio and size are analyzed first, followed by the efficiency.

Cache Miss Ratio and Size. These values result from the model predictions of the numbers of distinctn-grams and singletons (Sect. 3). The values for the 8 Mw and 1 Gwcorpora agree with the empirical data from real Englishcorpora [6]. The first line shows miss ratio and size expressions. Remaining lines show: (i) For the on-demand cache, its miss ratio (cache system$C_{1}$, ...,$C_{5}$), from$11.06\%$ in the 8 Mwcorpus to$2.11\%$ in the 1 Tw corpus, and its size (the number of distinctn-grams) – Sect. 4.1; (ii) For the dynamic cache with Bloom filter, its miss ratio, from$0.76\%$ in the 8 Mwcorpus to$2.08\%$ in the 1 Twcorpus (where the singletons have practically disappeared, Fig. 2b), and its size (the number of nonsingletons) – Sect. 4.2; (iii) For the static prefetching case of theFAset filled with all the nonsingletons – Sect. 4.3, the miss ratios of$43.3\%$,$27.7\%$ and$0.04\%$, respectively, for the 8 Mw, 1 Gw and 1 Twcorpora, are due to the singleton misses, not involving any fetching overhead, leading to a miss ratio of $0\%$.

Table 4. Cache alternatives ($K=1$,$g_{2\cdots 6}$) — Miss ratio, Cache size (number ofn-grams in units of$M=10^{6}$ or$G=10^{9}$), Efficiency

Full size table

Efficiency. The glue efficiencyE, for$K\ge 1$ wrt the glue computation in an ideal (no overheads) sequential machine ($T_{0}=\left( \sum _{i=2}^{6}\left| D_{i}\right| \right) \times t_{glue}$, for$g_{2\cdots 6}$), is:

$$\begin{aligned} E=\frac{T_{0}}{K\times \hat{T}}=\frac{T_{0}}{K\times \left( \frac{T_{0}}{K}+\hat{T}_{comm}\right) }= \frac{1}{1+\frac{1}{G}}=\frac{1}{1+\frac{all_{g_{2\cdots 6}Ref}}{\sum _{i=2}^{6}\left| D_{i}\right| }\times \frac{t_{fetch}}{t_{glue}}\times mr} \end{aligned}$$

(13)

where$t_{glue}$ is the pern-gram local glue time;$\hat{T}$ is the per machine execution time;$\hat{T}_{comm}=\left( all_{g_{2\cdots 6}Ref}/K\right) \times t_{fetch}\times mr$ is the per machinen-gram misses communication time;$t_{fetch}$ is the pern-gram remote fetch time; and$G=\left( T_{0}/K\right) /\hat{T}_{comm}$ is the computation-to-communication granularity ratio, which includes: (i) the algorithm-dependent term$f_{a}=\left( \sum _{i=2}^{6}\left| D_{i}\right| \right) /all_{g_{2\cdots 6}Ref}$, i.e. the number of glue operations per memory reference, being approximately constant with thecorpus size, for each glue, e.g., around 0.10 for$g_{2\cdots 6}$; (ii) the measured implementation-dependent ratio$f_{i}=\frac{t_{fetch}}{t_{glue}}\approx 20$, staying almost constant wrt thecorpus size and number of machines used ($1\leftrightarrow 48$); (iii) and 1/mr. Thus, in LocalMaxs$G=\frac{f_{a}\times f_{i}}{mr}\approx \frac{0.10/20}{mr}=\frac{0.005}{mr}$. For example,$E\ge 90\%\implies G\ge 10\implies mr\le 0.05\%$, which can only be achieved by a static prefetching cache (Sect. 4.3). Indeed Table 4 shows that for the on-demand cache the efficiency values are very low, even with Bloom filters where$E\le 50\%$ always. In general, other methods, exhibiting higher values of the algorithm-dependent term$f_{a}$, will require less demanding (i.e., higher) miss ratio values: e.g., if the$f_{a}$ term is around 100, then$mr=50\%$ would be sufficient to ensure$E=90\%$.

5Conclusions and Future Work

We found out that for the statistical extraction method LocalMaxs the miss ratio of a dynamic on-demandn-gram cache is lower bounded by the proportion of distinctn-grams in acorpus. The proportion of distinctn-grams wrt the total number of cache references, i.e. the miss ratio, decreases monotonically with thecorpus size tending to an asymptoticplateau, e.g., ranging from$11\%$ (for the smallercorpora) to$1.5\%$ (in theplateaux region) for Englishcorpora when considering a single machine. However, these miss ratio values imply very low efficiency of the glue calculation wrt an ideal sequential machine, from$6\%$ to$26\%$. This is due to the significant amount of cold-startn-gram misses needed to fill up then-gram cache with the frequencies of all distinctn-grams. To overcome these limitations we have shown that Bloom filters or static prefetching can significantly improve on the cache miss ratio and size. Bloom filters benefits were found related to the distribution of the singletons along the entirecorpus size spectrum. By extending a previously proposed theoretical model [5], we found out that the number of singletons first increases with thecorpus size until a maximum and then it decreases gradually, tending to zero as the singletons disappear for very largecorpora, e.g., in the Tw region for the Englishcorpora. This behavior of the singletons determines the effectiveness of the Bloom filters which achieve a reduction of the miss ratio, namely, to the range from$1\%$ (for the smallercorpora) to$1.5\%$ (in theplateaux region) for Englishcorpora for a single machine. However, the corresponding efficiency is still low, always below$49\%$. Hence, using ann-gram cache with static prefetching of the nonsingletons is of utmost importance for higher efficiency. We have shown that, in a multiphase method like LocalMaxs where it is possible during a$1^{st}$ phase (in overlap with then-gram frequency counting), to anticipate and prefetch the set ofn-grams needed, then one can ensure a$0\%$ miss ratio in a$2^{nd}$ phase for glue calculation, leading to$100\%$ efficiency. For a static prefetching cache (sec.4.3), it is possible, by design, to keep a constant miss ratio, leading to a constant efficiency wrt the number of machines. The above improvements were implemented within the LocalMaxs parallel and distributed architecture (Sect. 2), experimentally validated forcorpora up to 1 Gw. Although this study was conducted in the context of LocalMaxs, the main achievements apply to other statistical multiphase methods accessing large scalen-gram statistical data, thus potentially benefiting from ann-gram cache. Forcorpora beyond 1 Gw we conjecture that the global behavior of then-gram distribution, as predicted, remains essentially valid, as the model relies on the plausible hypothesis of a finiten-gram vocabulary for each language andn-gram size, at each temporal epoch. We will proceed with this experimentation forcorpora beyond 1 Gw, although fully uncut huge Tw ($10^{12}$ words)corpora are not easily available yet [26].

References

Google Ngram Viewer.https://books.google.com/ngrams
Lin, D., et al.: New tools for web-scale n-grams. In: LREC (2010)
Google Scholar
da Silva, J.F., Dias, G., Guilloré, S., Pereira Lopes, J.G.: UsingLocalMaxs algorithm for the extraction of contiguous and non-contiguous multiword lexical units. In: Barahona, P., Alferes, J.J. (eds.) EPIA 1999. LNCS (LNAI), vol. 1695, pp. 113–132. Springer, Heidelberg (1999).https://doi.org/10.1007/3-540-48159-1_9
Chapter Google Scholar
da Silva, J.F., et al.: A local maxima method and a fair dispersion normalization for extracting multiword units. In: Proceedings of the 6th Meeting on the Mathematics of Language, pp. 369–381 (1999)
Google Scholar
da Silva, J.F., et al.: A theoretical model for n-gram distribution in big data corpora. In: IEEE International Conference on Big Data, pp. 134–141 (2016)
Google Scholar
Parallel LocalMaxs.http://cjsg.ddns.net/~cajo/phd/
Arroyuelo, D., et al.: Distributed text search using suffix arrays. Parallel Comput.40(9), 471–495 (2014)
Article Google Scholar
Goncalves, C., et al.: A parallel algorithm for statistical multiword term extraction from very large corpora. In: IEEE 17th International Conference on High Performance Computing and Communications, pp. 219–224 (2015)
Google Scholar
Goncalves, C., et al.: An n-gram cache for large-scale parallel extraction of multiword relevant expressions with LocalMaxs. In: IEEE 12th International Conference on e-Science, pp. 120–129. IEEE Computer Society (2016)
Google Scholar
Bloom, B.H.: Space/Time trade-offs in hash coding with allowable errors. Commun. ACM13(7), 422–426 (1970)
Article Google Scholar
Daille, B.: Study and implementation of combined techniques for automatic extraction of terminology. In: The Balancing Act: Combining Symbolic and Statistical Approaches to Language. MIT Press (1996)
Google Scholar
Velardi, P., et al.: Mining the web to create specialized glossaries. IEEE Intell. Syst.23(5), 18–25 (2008)
Article Google Scholar
Pearce, D.: A comparative evaluation of collocation extraction techniques. In: 3rd International Conference on Language Resources and Evaluation (2002)
Google Scholar
Church, K.W., et al.: Word association norms, mutual information, and lexicography. Comput. Linguist.16, 22–29 (1990)
Google Scholar
Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Comput. Linguist.19, 61–74 (1993)
Google Scholar
Church, K.W., et al.: Concordance for parallel texts. In: 7th Annual Conference for the new OED and Text Research, pp. 40–62 (1991)
Google Scholar
Goncalves, C.: Parallel and distributed statistical-based extraction of relevant multiwords from large corpora. Ph.D. dissertation, FCT/UNL (2017)
Google Scholar
Zipf, G.K.: The Psychobiology of Language: An Introduction to Dynamic Philology. MIT Press, Cambridge (1935)
Google Scholar
Mandelbrot, B.B.: On the theory of word frequencies and on related Markovian models of discourse. In: Structures of Language and its Mathematical Aspects, vol. 12, pp. 134–141. American Mathematical Society (1961)
Google Scholar
Kuhn, R.: Speech recognition and the frequency of recently used words: a modified Markov model for natural language. In: Proceedings of the 12th Conference on Computational Linguistics, COLING 1988, vol. 1, pp. 348–350. ACM (1988)
Google Scholar
Breslau, L., et al.: Web caching and Zipf-like distributions: evidence and implications. In: Eighteenth Annual Joint Conference of the IEEE Computer and Communications Societies, INFOCOM 1999, vol. 1, pp. 126–134, March 1999
Google Scholar
Baeza-Yates, R., et al.: The impact of caching on search engines. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2007, pp. 183–190. ACM (2007)
Google Scholar
Yang, Q., et al.: Web-log mining for predictive web caching. IEEE Trans. Knowl. Data Eng.15(4), 1050–1053 (2003)
Article Google Scholar
Balkir, A.S., et al.: A distributed look-up architecture for text mining applications using MapReduce. In: International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pp. 1–11 (2011)
Google Scholar
Luna Cloud.http://www.lunacloud.com
Brants, T., et al.: Large language models in machine translation. In: Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 858–867 (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

Instituto Superior de Engenharia de Lisboa, Lisbon, Portugal
Carlos Goncalves
NOVA Laboratory for Computer Science and Informatics, Caparica, Portugal
Carlos Goncalves, Joaquim F. Silva & Jose C. Cunha

Authors

Carlos Goncalves
View author publications
Search author on:PubMed Google Scholar
Joaquim F. Silva
View author publications
Search author on:PubMed Google Scholar
Jose C. Cunha
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence toCarlos Goncalves.

Editor information

Editors and Affiliations

University of Algarve, Faro, Portugal
João M. F. Rodrigues
University of Algarve, Faro, Portugal
Pedro J. S. Cardoso
University of Algarve, Faro, Portugal
Jânio Monteiro
University of Algarve, Faro, Portugal
Roberto Lam
University of Amsterdam, Amsterdam, The Netherlands
Valeria V. Krzhizhanovskaya
University of Amsterdam, Amsterdam, The Netherlands
Michael H. Lees
University of Tennessee at Knoxville, Knoxville, TN, USA
Jack J. Dongarra
University of Amsterdam, Amsterdam, The Netherlands
Peter M.A. Sloot

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Goncalves, C., Silva, J.F., Cunha, J.C. (2019).n-gram Cache Performance in Statistical Extraction of Relevant Terms in LargeCorpora. In: Rodrigues, J.,et al. Computational Science – ICCS 2019. ICCS 2019. Lecture Notes in Computer Science(), vol 11537. Springer, Cham. https://doi.org/10.1007/978-3-030-22741-8_6

Download citation

DOI:https://doi.org/10.1007/978-3-030-22741-8_6
Published:08 June 2019
Publisher Name:Springer, Cham
Print ISBN:978-3-030-22740-1
Online ISBN:978-3-030-22741-8
eBook Packages:Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Movatterモバイル変換

n-gram Cache Performance in Statistical Extraction of Relevant Terms in LargeCorpora

Abstract

Similar content being viewed by others

An Empirical Model for n-gram Frequency Distribution in Large Corpora

Extended N-gram Model for Analysis of Polish Texts

A Model for Predicting n-gram Frequency Distribution in Large Corpora

Keywords

1Introduction

2Background

3A Theoretical Model forn-gram Distribution

4n-gram Cache System

4.1Dynamic On-Demand n-gram Cache

4.2Bloom Filters for Singletons in an On-Demand n-gram Cache

4.3n-gram Cache with Static Prefetching

4.4Cache Alternatives

5Conclusions and Future Work

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us