Movatterモバイル変換


[0]ホーム

URL:


GB2197097A - Data entry using dynamic word predication algorithm - Google Patents

Data entry using dynamic word predication algorithm
Download PDF

Info

Publication number
GB2197097A
GB2197097AGB08626367AGB8626367AGB2197097AGB 2197097 AGB2197097 AGB 2197097AGB 08626367 AGB08626367 AGB 08626367AGB 8626367 AGB8626367 AGB 8626367AGB 2197097 AGB2197097 AGB 2197097A
Authority
GB
United Kingdom
Prior art keywords
word
words
data input
sub
vocabulary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
GB08626367A
Other versions
GB8626367D0 (en
Inventor
Adrian Pickering
Andrew Swiffen
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Dundee
Original Assignee
University of Dundee
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of DundeefiledCriticalUniversity of Dundee
Priority to GB08626367ApriorityCriticalpatent/GB2197097A/en
Publication of GB8626367D0publicationCriticalpatent/GB8626367D0/en
Publication of GB2197097ApublicationCriticalpatent/GB2197097A/en
Pendinglegal-statusCriticalCurrent

Links

Classifications

Landscapes

Abstract

A method of data input comprises establishing a database or vocabulary of words and entering a word prefix. The word prefix is used to select all words from the vocabulary commencing with that prefix to form a sub-set for subsequent processing. The subsequent processing involves ordering the members of the sub-set in accordance with their potential relevance and truncating the ordered sub-set so as to identify a predetermined number of the most potentially relevant words. The truncated ordered sub-set is subsequently presented to enable the intended input word to be selected, if present, from among the presented words. The words are ordered in accordance with their recency of use, or their historical frequency of use, or preferably by a combination of the two.

Description

SPECIFICATIONDynamic word prediction algorithmThis invention relates to an algorithm for dynamically developing a list of the most likely words that follow a given word prefix. The algorithm is adaptive in that it takes account of the past patterns of use of the words in developing its predictions. It has application in effort-efficient user interfaces in the field of Man-Machine interaction. In particular, the technique enables significant effort savings in computer interfaces for those whose physical abilities are impaired.
A 'word' is an ordered sequence of elements (normally characters) picked from an alphabet of elements bounded by word delimiters. A delimiter is normally an element from another set which is disjoint from the word alphabet set. A word prefix comprises the sub-sequences made up from zero or more of the first elements in a particular word. All the words together make up the vocabulary (set of all known words) from which higher order structures are built, such as sentences, which in turn make up a 'text'.
As words are chosen to make up sentences it is possible to record their usage statistics and make use of these to develop predictions of what is most likely to follow. The prediction subset can be refined every time further prefix information becomes available. They can be presented to the user so that a selection can be made from the subset in order to complete the word. To develop the entire subset of the vocabulary which applies at a particular instant is trivial. In practice, the cardinality of these subsets is so large that it is neither possible nor desirable to present the user with all their members. Thus it is necessary to order the set and truncate it so that the prediction list is assessable by the user and a member is selectabie.
The algorithm presently to be described seeks to ensure that the the set ordering is optimal and only the most likely words are offered to the user. The member words that are not displayed remain accessible through use of the conventional input method, normally theQWERTY keyboard of a computer terminal. However, every time another element is added to the word prefix, the subset can be further refined. The selection device for choosing the predictions can be any of the wide variety available; examples being function keys on a conventional computer terminal, a touch-sensitive screen or the use of a 'mouse'.
A word w belongs to a vocabulary V = {wI,w2,...,wnt, containing n words. A text is an ordered sequence of t words in which words may be repeated e.g. w4 w1 w4 wry . . The textis fully defined by an index set T = GkIGkCT} where T is the set of all numbers from I. .t, k = I..n and the disjoint sets Gk contain the word positions for Wk. The cardinality of Gk, IGkl, corresponds to the number of times wk was used.Typically the text demonstrates a Zipf distribution, expressed as: I kl p P= ~t k where:1 ln (n) + 7 with y being Eulers' constant (-0.57721). Note that this is a very slow function of n when n> > 100. A rank ordering over V is denoted by k with w1 being the highest frequency, lowestranking word.
All members of V, Wk, have a corresponding average distance dk between their use given by:k d = - - words P Thus, in order to capture all words that have a frequency rank position greater than some threshold value, k', text samples of dk. words must be examined. A special case of this is the average distance between the occurences of the rarest word wmin which must be the same as the text sample size. Thus a relationship exists between n and t: t = n [ in(n)+y ] The Zipf distribution is an expression of the 'least cost' behaviour of humans when generating information carrying material when it is analysed a posteriori.It is essentially a static description and takes no account of the temporal behaviour of word selection; words fall into disuse and others increase in popularity as time proceeds and context changes.
The recency of a word can be expressed by its instantaneous word distance d'k which is the span between the last position of wk and the last word in the text (the current word). If the spans are measured in words, the values of d'k are unique, lie in the range I. .t and are noncontiguous. The average of all the past d'k spans will tend towards dk.
Recency defines another ordering on V in terms of d'k; the word WX is said to be more recent than w, if d', < d'y Studies have shown words in texts with Zipf-type distributions are more quickly found on average if they are examined in recency order. The 'move to front' strategy of self-organising lists and the cacheing principle are practical embodiments this principle used in other computing domains. In the context of prediction presentation, recency ordering provides a better basis on which to perform the set truncation.
The prediction subset V' is defined:V' = iwk 1 Wk C V and all wk have a common prefixAn ordering on V and thus on V' enables the first few members of V' to be chosen to be presented to the user. These form the set of predictions P c V' such that IP|P| < p, with p being the maximum number of shown predictions.
Recency ordering does not perform so well when the d'k are comparable with d, since the members of P tend to be the same as the last few words in the text. This situation is encountered when trying to produce predictions with a zero initial prefix which yields a prediction subset of V. Here the historical ordering, based upon the frequency ordering of V, might provide a better basis for truncation. But, since the instantaneous distance d'k is of greater importance for dynamic adaption, it may not be economic to maintain information relating to absolute word frequency as well. This frequency information might take the form of word usage counts or keeping a running average dk. A discrete approximation sufficient for identifying the p most frequent words may be all that is necessary.
Each word in the vocabulary V has associated with it (a) recency information and, optionally, (b) frequency information. Both forms of information enable an ordering over V to be defined which may be used for developing the most suitable predictions P. Both need keeping up to date as the text being created is developed. Frequency information is relatively simple to update since only when an word is used is its frequency affected. However, the recency information is changing each time the text expands. If the number of words in the vocabulary, n, is large then updating the recency information can be impractical and the following methods may be employed to reduce the processing effort.
(1) The distance unit can be increased. In order that the d'k are still unique the unit can be increased to that of the most frequent word. Assuming a Zipf distribution, the formula given above will give a minimum dk of about 10 words.
Thus the recency information can be kept in units of 10 words and thus will only need updating after every tenth word. A further advantage of using a greater span unit is economy in the storage of the recency number.
(2) If a non-unique recency value is acceptable, then the span unit can be increased much further. Indeed the span can be such as to ensure that the p most regularly used words will always retain the highest recency. However, a problem with non-unique recency values is that, failing any further information, the truncation of V' to P can be somewhat arbitrary. However, if frequency information is available this can be used in conjunction with the recency information to resolve ties and produce a better P.
(3) The current value of the text size, t (in span units), can be written into the recency information field when the word is used. This can be done at the same time as the frequency information is updated. To derive d'k this value is subtracted from t. Thus to derive the recency ordering for V' the d'k are derived by subtracting the recency field from t at least each time t changes. This avoids having to explicitly run through the entire vocabulary after each span is complete updating the recency value.
The algorithm will now be presented by reference to an actual implementation embodied in a small, portable typing aid for those with motion disabilities. The microcomputer-based aid has provision for a vocabulary of up to 1000 words and the ability to display and use up to 5 predictions. The vocabulary is 'complete' as new words are captured and added when they occur for the first time. With n = 1000, ,u = 1/7.485 = 0.1336 and t = 7485 words by the above formulae. For efficiency reasons explicit updating of the recency values is performed. The average distance between the pth rank word, d = 37.42 words and thus the span between updating should well exceed this. A convenient span is the 150-200 words of a typical paragraph which provides a natural break in the text when updating can be performed.The fre quency information provides the additional information for making the exact trucation position inV'. The coordinality of the recency information should allow the current 'working' vocabulary to be regarded as being recent. If 25% of the vocabulary is to be so treated then dk for the 250th word is 1871 words or about 8-12 spans.
The frequency information is simply a word incidence count. Its size should be estimated by the formulae and allow for use well past the expected text size t. The highest expected count for the 1000 word vocabulary is 1000.
The ordering is performed by simple numeric comparison on the concatenation of recency and frequency treating the whole as an unsigned, binary coded number. Recency is represented using 1 's complement so that the ordering is compatible with that of the word count, which is represented conventionally. In the specific example cited 15 bits are used to generate the ordering, identified bo to b,4 with bo with lowest and b,4 with highest binary weighting. The highest weighted 3 bits, b,2 to bl4, are used for the recency representation and the remaining 12 bits, bo to b", are used to accommodate the count. The exact split between the available number of bits determines when the transition between recency and word count ordering occurs; minimum recency is represented by all zeros thus leaving the word count with its true weighting.
Other systems seeking to employ the ordering algorithm herein described should use the formulae and guidelines to produce an implementation to suit the specific circumstances.

Claims (13)

GB08626367A1986-11-041986-11-04Data entry using dynamic word predication algorithmPendingGB2197097A (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
GB08626367AGB2197097A (en)1986-11-041986-11-04Data entry using dynamic word predication algorithm

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
GB08626367AGB2197097A (en)1986-11-041986-11-04Data entry using dynamic word predication algorithm

Publications (2)

Publication NumberPublication Date
GB8626367D0 GB8626367D0 (en)1986-12-31
GB2197097Atrue GB2197097A (en)1988-05-11

Family

ID=10606795

Family Applications (1)

Application NumberTitlePriority DateFiling Date
GB08626367APendingGB2197097A (en)1986-11-041986-11-04Data entry using dynamic word predication algorithm

Country Status (1)

CountryLink
GB (1)GB2197097A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
EP1054329A3 (en)*1999-05-192002-04-24Sun Microsystems, Inc.Methods and apparatus for processing a document
WO2004111871A1 (en)*2003-06-182004-12-23Zi CorporationConfigurable information identification system and method
WO2006024157A1 (en)*2004-08-312006-03-09Research In Motion LimitedHandheld electronic device with text disambiguation
WO2007040872A1 (en)*2005-09-302007-04-12Motorola Inc.Method of predictive text input in which phrases are input in abbreviated form
US7475004B2 (en)2004-08-312009-01-06Research In Motion LimitedHandheld electronic device with text disambiguation
WO2009090262A3 (en)*2008-01-182010-03-25Ugochukwu AkuwudikeMethod and system for storing and retrieving characters, words and phrases

Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US4438505A (en)*1979-02-091984-03-20Sharp Kabushiki KaishaElectronic dictionary and language interpreter with auto-search key for deriving a full-length word and its associated translation word based on a partial word entered
EP0172357A2 (en)*1984-06-281986-02-26Yoshiro AkiyamaDisplay type dictionary apparatus

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US4438505A (en)*1979-02-091984-03-20Sharp Kabushiki KaishaElectronic dictionary and language interpreter with auto-search key for deriving a full-length word and its associated translation word based on a partial word entered
EP0172357A2 (en)*1984-06-281986-02-26Yoshiro AkiyamaDisplay type dictionary apparatus

Cited By (13)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
EP1054329A3 (en)*1999-05-192002-04-24Sun Microsystems, Inc.Methods and apparatus for processing a document
US8364706B2 (en)2003-06-182013-01-29Zi Corporation Of Canada, Inc.System and method for information identification
WO2004111871A1 (en)*2003-06-182004-12-23Zi CorporationConfigurable information identification system and method
US8489383B2 (en)2004-08-312013-07-16Research In Motion LimitedText disambiguation in a handheld electronic device with capital and lower case letters of prefix objects
GB2434237A (en)*2004-08-312007-07-18Research In Motion LtdHandheld electronic device with text disambiguation
US7475004B2 (en)2004-08-312009-01-06Research In Motion LimitedHandheld electronic device with text disambiguation
GB2434237B (en)*2004-08-312010-01-06Ontario Inc 2012244Handheld electronic device with text disambiguation
WO2006024157A1 (en)*2004-08-312006-03-09Research In Motion LimitedHandheld electronic device with text disambiguation
US8768685B2 (en)2004-08-312014-07-01Blackberry LimitedHandheld electronic device with text disambiguation
US9588596B2 (en)2004-08-312017-03-07Blackberry LimitedHandheld electronic device with text disambiguation
WO2007040872A1 (en)*2005-09-302007-04-12Motorola Inc.Method of predictive text input in which phrases are input in abbreviated form
WO2009090262A3 (en)*2008-01-182010-03-25Ugochukwu AkuwudikeMethod and system for storing and retrieving characters, words and phrases
US8903718B2 (en)2008-01-182014-12-02Ugochukwu AkuwudikeMethod and system for storing and retrieving characters, words and phrases

Also Published As

Publication numberPublication date
GB8626367D0 (en)1986-12-31

Similar Documents

PublicationPublication DateTitle
KR100604039B1 (en) Dynamic database reordering process and reordering device
US5485609A (en)Online background predictors and prefetchers for locality management
JP5147712B2 (en) System, computer program and method for forming data input
US7765214B2 (en)Enhancing query performance of search engines using lexical affinities
CN102893239A (en)System and method for inputting text into electronic devices
US20100211381A1 (en)System and Method of Creating and Using Compact Linguistic Data
US10303685B2 (en)Data table performance optimization
RU2006133906A (en) INTERFACES &#34;MAN-COMPUTER&#34;
EP0588921A4 (en)Data compression using multiple levels.
KR20090007343A (en) Alphanumeric Data Entry Device and Method Using Multi-Character Keys on Keypad
GB2197097A (en)Data entry using dynamic word predication algorithm
CN109542248A (en)A kind of control method and control device of incremental update dictionary data
CN105808636B (en)Hypertext link pushing system based on APP information data
Ktistakis et al.Succinct bwt-based sequence prediction
US6223174B1 (en)Method and apparatus for performing radix lookups using valid bit tables with pointers
WO2004006122A2 (en)System and method of creating and using compact linguistic data
US6185570B1 (en)Method and apparatus for performing radix lookups using transition bits and fields in transition tables
JPH0683812A (en)Kana/kanji converting device for document input device
BrixtelMaximal repeats enhance substring-based authorship attribution
CN1191702C (en)Chinese Character input method of simplified keyboard
AljehaneGrammar-based preprocessing for PPM compression and classification
WisniewskiCompression of index term dictionary in an inverted-file-orientated database: some effective algorithms
Tapsai et al.TLS-ART-MC, A new algorithm for Thai word segmentation
JPWO2004081821A1 (en) Article data search server, article data search method, and article data search program
Abate Tefera et al.Compressed Amharic Text: A Prediction by Partial Match Context-Modeling Algorithm

[8]ページ先頭

©2009-2025 Movatter.jp