CROSS-REFERENCE TO RELATED APPLICATIONThis application is a continuation of, and claims priority to, co-pending, commonly owned, U.S. patent application Ser. No. 12/142,622 filed Jun. 19, 2008, and entitled, “GENERATION AND USE OF AN EMAIL FREQUENT WORD LIST,” which is herein incorporated by reference in its entirety.
BACKGROUNDIn recent years, electronic mail (“email”) has become one of the most important forms of communication for various personal and business uses. The growth of e-mail communications has been spurred, at least in part, by the increasing number of devices capable of remotely accessing email. For example, many mobile devices, such as cellular phones, smartphones, and personal digital assistants (“PDAs”), are now capable of remotely and wirelessly accessing email through various pull-based and push-based e-mail access protocols.
A typical user's mailbox may contain hundreds or thousands of e-mails on a wide variety of topics ranging from the user's plans for lunch at her favorite cafe to the user's input regarding her workgroup's latest business project. A user's e-mails may also be utilized to infer information about the user. For example, a higher frequency of e-mails to certain people may indicate that the user has a closer relationship with those people. As a result, a user's mailbox can be a valuable source of relevant information about the user, especially for application programs that can utilize or benefit from such information.
It is with respect to these considerations and others that the disclosure made herein is presented.
SUMMARYTechnologies are described herein for generating, organizing, storing, and utilizing a frequent word list associated with a user's mailbox. In particular, through the utilization of the technologies and concepts presented herein, an application program interface (“API”) is described that is adapted to generate a frequent word list based on email messages contained in a user's mailbox and to respond to requests from external application programs requesting the frequent word list. The frequent word list may include a mapping of words to a frequency of use for each of the words.
In one method, an index scan is performed on catalogs to retrieve search data that maps words to emails containing the words. The search data is provided across multiple mailboxes. A universal frequent word list is generated based on the search data. The mailbox specific frequent word list is generated based on the universal frequent word list.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended that this Summary be used to limit the scope of the claimed subject matter. Furthermore, the claimed subject matter is not limited to implementations that solve any or all disadvantages noted in any part of this disclosure.
BRIEF DESCRIPTION OF THE DRAWINGSFIG. 1 is a block diagram showing an email server configured to generate a mailbox specific frequent word list, in accordance with one embodiment;
FIG. 2 is a block diagram showing a process flow for generating the mailbox specific frequent word list, in accordance with one embodiment;
FIG. 3 is a flow diagram showing an illustrative method for generating the mailbox specific frequent word list, in accordance with one embodiment; and
FIG. 4 is a computer architecture diagram showing aspects of an illustrative computer hardware architecture for a computing system capable of implementing aspects of the embodiments presented herein.
DETAILED DESCRIPTIONThe following detailed description is directed to technologies for generating, organizing, storing, and using a frequent word list associated with a user's mailbox. An API is described herein that is adapted to generate a frequent word list based on emails contained in a user's mailbox. This frequent word list is referred to herein as a mailbox specific frequent word list because it contains only words associated with the user's mailbox. The API may further be adapted to respond to requests from application programs or other services requesting the mailbox specific frequent word list.
While the subject matter described herein is presented in the general context of program modules that execute in conjunction with the execution of an operating system and application programs on a computer system, those skilled in the art will recognize that other implementations may be performed in combination with other types of program modules. Generally, program modules include routines, programs, components, data structures, and other types of structures that perform particular tasks or implement particular abstract data types. Moreover, those skilled in the art will appreciate that the subject matter described herein may be practiced with other computer system configurations, including hand-held devices, multiprocessor systems, microprocessor-based or programmable consumer electronics, minicomputers, mainframe computers, and the like.
In the following detailed description, references are made to the accompanying drawings that form a part hereof, and which are shown by way of illustration specific embodiments or examples. Referring now to the drawings, in which like numerals represent like elements through the several figures, aspects of a computing system and methodology for generating, organizing and storing a frequent word list for a given mailbox will be described.FIG. 1 shows anillustrative email server100 in whichmultiple caller applications102A-102B (collectively referred to as caller applications102) request a frequent word list for a specific mailbox associated with a given user. An example of a frequent word list for a specific mailbox is a mailbox specificfrequent word list104. For the sake of simplicity, in the example shown inFIG. 1, each of thecaller applications102 requests the mailbox specificfrequent word list104. However, it should be appreciated that thecaller applications102 may each request frequent word lists for other mailboxes. It should further be appreciated that other types of application programs and/or services may request the mailbox specificfrequent word list104 and the like.
According to embodiments, the mailbox specificfrequent word list104 includes a list of frequent words found in a user's mailbox and a corresponding frequency associated with each of the words. The list of frequent words may be sorted in order of frequency. For example, the most frequent words may be shown at the top of the mailbox specificfrequent word list104, and the remaining words may be shown in a descending order of frequency. The frequency may be specified as a raw frequency (e.g., the absolute number of email messages that include a word) or a percentage/ratio (e.g., the number of email messages that include a word in relation to the total number of messages across the user's mailbox).
The mailbox specificfrequent word list104 may be formatted in Extensible Markup Language (“XML”) or other suitable representation. An example of an XML data structure for an entry in the mailbox specificfrequent word list104 is shown below.
<TopNWord=“______” Frequency=“______”> </TopNWord>
The “TopNWord” tag specifies a word found in a user's mailbox. The “frequency” property specifies the frequency that the word is found in the user's mailbox. It should be appreciated that other forms for representing entries in the mailbox specificfrequent word list104 may be contemplated by those skilled in the art.
As shown inFIG. 1, theemail server100 includes a variety of application programs, such as anadvertising application108A, avoice transcription application108B, and anorganization application108C (collectively referred to as applications108). Theadvertising application108A includes afirst caller application102A, which is adapted to transmit a request for the mailbox specificfrequent word list104 to asearch API112. Thevoice transcription application108B includes asecond caller application102B, which is adapted to transmit a request for the mailbox specificfrequent word list104 to thesearch API112. Theorganization application108C includes athird caller application102, which is adapted to transmit a request for the mailbox specificfrequent word list104 to thesearch API112.
In one embodiment, theadvertising application108A may tailor advertisements to a user based on the contents of a mailbox specificfrequent word list104 associated with the user. For example, the mailbox specificfrequent word list104 may include a high frequency of baby-related words, such as “crib,” “diapers,” and “stroller.” As a result, theadvertising application108A may recognize these baby-related words and tailor advertisements to the user in accordance with baby-related products and services. For example, tailored advertisements may be displayed to the user within an ad-supported web application, such as a hosted email application.
In another embodiment, thevoice transcription application108B may supplement atranscription dictionary114 with proper nouns, slang, abbreviations, and other colloquial terminology found in the mailbox specificfrequent word list104. Voice transcription applications are increasingly included in email application programs, especially in unified messaging application programs, whereby a voicemail or other audio message is transcribed into text so that a user can “read” the voicemail. In an exemplary implementation, thevoice transcription application108B may receive an audio sequence of speech and then phonetically map the audio sequence to one or more words in thetranscription dictionary114. This implementation may be adequate when the audio sequence corresponds to words in thetranscription dictionary114. However, problems can occur when the audio sequence corresponds to words not found in thetranscription dictionary114.
In an example, an audio sequence may include the name “Gautam,” which is a name that is common in some non-U.S. countries. An American implementation of thetranscription dictionary114 may not include proper nouns or foreign names, such as Gautam. As a result, thevoice transcription application108B may incorrectly transcribe the audio representation of Gautam as “Gotham,” “got him,” or “got them.” Alternatively, thevoice transcription application108B may indicate that it does not recognize the word by providing an error message.
The mailbox specificfrequent word list104 may indicate that the name Gautam is frequently used in the user's emails. As such, thevoice transcription application108B may add Gautam to thetranscription dictionary114. In one embodiment, the voice transcription application108 may place a greater weight on words, such as Gautam, that are frequently included in the user's emails over similarly sounding counterparts, such as Gotham, that are not frequently included in the user's emails. By supplementing thetranscription dictionary114 with colloquial words associated with a user, the accuracy of thevoice transcription application108B can be significantly improved. In particular, thetranscription dictionary114 can be effectively customized for a given user by adding words from the user's own real-world vocabulary found in the mailbox specificfrequent word list104.
In yet another embodiment, theorganization application108C may generate email tags based on frequently used words found in the mailbox specificfrequent word list104. As used herein, an email tag refers to a word that is associated with emails. The email tags essentially serve as reference markers, enabling users to quickly identify, browse, and search for classes of emails as specified by the email tags. By restricting email tags to the most frequently used words, more relevant email tags can be provided for various automatic and manual tagging applications.
It should be appreciated that the applications108 described herein are merely exemplary. Other applications that can utilize or benefit from the data provided in the mailbox specificfrequent word list104 may be contemplated by those skilled in the art. It should further be appreciated that the applications108 may be external applications executed on other computers. For example, theadvertising application108A may be an external application that is capable of communicating with theemail server100 through a network (not shown).
As shown inFIG. 1, theemail server100 further includes a plurality ofcatalogs116 and a universalfrequent word list118. As described in greater detail below with respect toFIG. 2, thesearch API112 is adapted to search thecatalogs116 for frequent words across multiple mailboxes. Upon receiving the frequent words from thecatalogs116, thesearch API112 may generate the universalfrequent word list118. The universalfrequent word list118 may contain a list of frequent words across multiple mailboxes and a frequency associated with each of the words. Thesearch API112 may utilize the universalfrequent word list118 to generate mailbox specific frequent word lists, such as the mailbox specificfrequent word list104, as requested by the applications108.
Referring now toFIG. 2, additional details will be provided regarding the operation of thesearch API112. In particular,FIG. 2 shows anillustrative process flow200 for generating the mailbox specificfrequent word list104. Theprocess flow200 begins at202, where thecaller application102 transmits to the search API112 a request for a mailbox specific frequent word list, such the mailbox specificfrequent word list104, associated with a given user. In one embodiment, the request may specify, among other things, the number of entries included in the mailbox specificfrequent word list104, the minimum/maximum frequency of the entries included in the mailbox specificfrequent word list104, and the minimum/maximum age of the entries included in the mailbox specificfrequent word list104.
The process flow200 proceeds to204, where upon receiving the request for the mailbox specificfrequent word list104, thesearch API112 performs an index scan on thecatalogs116. Thecatalogs116 may includesearch data206, which contains an inverted index data structure mapping words to the emails that contain the words. The emails may be identified by a document identifier. For example, an illustrative entry in thecatalogs116 may include the following:
| |
| “apple”: | {0, 1, 3, 6, 9} |
| “bear”: | {2, 3, 5} |
| |
The conventional purpose of the inverted index data structure is to enable fast searching of emails. For example, if a user wants to find all documents that include the word apple, a search engine can access the inverted index data structure to quickly determine that emails corresponding to each of the document identifiers {0 1, 3, 6, 9} include the word “apple.” In one embodiment, the
catalogs116 are created and maintained by the
email server100. For example, the EXCHANGE SERVER 2007 email server from MICROSOFT CORPORATION maintains global catalogs containing a variety of searchable data across multiple domains.
The process flow200 proceeds to208, where thesearch API112 receives thesearch data206 in response to performing the index scan. Once thesearch API112 receives thesearch data206, theprocess200 proceeds to210, where thesearch API112 generates the universalfrequent word list118 based on thesearch data206. In one embodiment, theAPI112 generates the universalfrequent word list118 by counting the number of document identifiers associated with each of the words in thesearch data206. For example, in the example shown above, the word “apple” is included in five emails, while the word “bear” is included in three emails. As such, “apple” has a frequency of five, and “bear” has a frequency of three.
The process flow200 proceeds to212, where thesearch API112 creates the mailbox specificfrequent word list104 based on the universalfrequent word list118. The universalfrequent word list118 includes words and associated frequencies across multiple mailboxes. As such, thesearch API112 may filter the universalfrequent word list118 for only words contained in emails associated with a specific mailbox. In one embodiment, theemail server100 maintains a mapping for each mailbox and its corresponding emails. This mapping may be used by thesearch API112 to filter the universalfrequent word list118. The process flow200 then proceeds to214, where thesearch API112 provides the mailbox specificfrequent word list204 to thecaller application102.
The mailbox specificfrequent word list104 may be formatted in XML or other suitable representation. Although not so limited, the mailbox specificfrequent word list104 may be stored as a folder associated item (“FAI”) and compressed using suitable compression technology. In one embodiment, the mailbox specificfrequent word list104 may be represented by a data structure specifying a particular mailbox, which is identified by a mailbox identifier. An exemplary XML representation of the mailbox specificfrequent word list104, which is denoted as “TopNWords,” is shown below.
| |
| /// <summary> |
| /// TopNWords represents the most frequent words |
| /// occurring in a mailbox. This data may be |
| /// used for voice mail transcription and other |
| /// applications. |
| /// </summary> |
| internal sealed class TopNWords |
| { |
| /// <summary> |
| /// Constructor |
| /// </summary> |
| /// <param name=“mailboxGuid”></param> |
| internal TopNWords(Guid mailboxGuid) |
| { |
| } |
| |
As shown above, the mailbox identifier, “mailGuid,” associates the mailbox specificfrequent word list104 with a particular mailbox.
Further, the mailbox specificfrequent word list104 may include a data structure containing words and a frequency associated with each of the words. An exemplary XML representation of this data structure, which is denoted as “WordFrequency,” is shown below.
| |
| /// <summary> |
| /// Encapsulates a word and its frequency |
| /// </summary> |
| internal struct WordFrequency |
| { |
| /// <summary> |
| /// The keyword |
| /// </summary> |
| internal string Word; |
| /// <summary> |
| /// Number of documents the keyword |
| /// occurs in. |
| /// </summary> |
| internal int Frequency; |
As shown above, the data structure “WordFrequency” includes a “Word” and an associated “Frequency.”
Turning now toFIG. 3, additional details will be provided regarding the operation of thesearch API112. In particular,FIG. 3 is a flow diagram illustrating aspects of one method provided herein for generating the mailbox specificfrequent word list104. In one embodiment, thesearch API112 includes a plurality of objects or other entities capable of performing one or more of the operations described below.
It should be appreciated that the logical operations described herein are implemented (1) as a sequence of computer implemented acts or program modules running on a computing system and/or (2) as interconnected machine logic circuits or circuit modules within the computing system. The implementation is a matter of choice dependent on the performance and other requirements of the computing system. Accordingly, the logical operations described herein are referred to variously as states, operations, structural devices, acts, or modules. These operations, structural devices, acts, and modules may be implemented in software, in firmware, in special purpose digital logic, and any combination thereof. It should be appreciated that more or fewer operations may be performed than shown in the figures and described herein. These operations may also be performed in a different order than those described herein.
Referring toFIG. 3, a routine300 begins atoperation302, where thesearch API112 receives, from one of thecaller applications102, a request for a mailbox specific frequent word list, such as the mailbox specificfrequent word list104, for a given mailbox. The request may also specify, among other things, the number of entries included in the mailbox specificfrequent word list104, the minimum/maximum frequency of the entries included in the mailbox specificfrequent word list104, and the minimum/maximum age of the entries included in the mailbox specificfrequent word list104. Upon receiving the request for the mailbox specificfrequent word list104, the routine proceeds tooperation304.
Atoperation304, thesearch API112 determines whether the universalfrequent word list118 has been created. If the universalfrequent word list118 has not been created, then the routine300 proceeds tooperation306, where thesearch API112 performs an index scan on thecatalogs116 to retrieve thesearch data206. In one embodiment, thesearch data206 includes an inverted index data structure mapping words to the email identifiers corresponding to emails containing the words. Upon retrieving thesearch data206, the routine300 proceeds tooperation308, where thesearch API112 generates the universalfrequent word list118 based on thesearch data206. In one embodiment, the universalfrequent word list118 includes a mapping of the words to a frequency associated with each of the words across multiple mailboxes. The frequency may be determined by counting the number of email identifiers corresponding to each of the words. Upon generating the universalfrequent word list118, the routine300 proceeds tooperation312.
If the universalfrequent word list118 has been created, then the routine300 proceeds tooperation310, where thesearch API112 determines whether the universalfrequent word list118 is current. As previously described, the request transmitted by the callingapplications102 may specify the minimum or maximum age of the entries in the mailbox specificfrequent word list104. If the universalfrequent word list118 is not current, then the routine300 proceeds tooperation306, where thesearch API112 performs an index scan on thecatalogs116 to retrieve thesearch data206 and tooperation308 where thesearch API112 updates the universalfrequent word list118 based on thesearch data206. Upon generating the universalfrequent word list118, the routine300 proceeds tooperation312.
If the universalfrequent word list118 is current, then the routine300 proceeds tooperation312, where thesearch API112 generates the mailbox specificfrequent word list104 based on the universalfrequent word list118. In one embodiment, thesearch API112 filters the words and corresponding frequencies from the universalfrequent word list118 that are associated with only one mailbox. The filtered words and corresponding frequencies then form the mailbox specificfrequent word list104, which may be sorted according to the frequencies. Upon generating the mailbox specificfrequent word list104, the routine300 proceeds tooperation314, where thesearch API112 transmits the mailbox specificfrequent word list104 to thecaller applications102 in response to their request.
Referring now toFIG. 4, an exemplary computer architecture diagram showing aspects of acomputer400 is illustrated. An example of thecomputer400 is theemail server100. Thecomputer400 includes a processing unit402 (“CPU”), asystem memory404, and a system bus406 that couples thememory404 to theCPU402. Thecomputer400 further includes amass storage device412 for storing one ormore program modules414 and one ormore databases416. Examples of theprogram modules414 may include thesearch API112 and the applications108. Examples of thedatabases416 may include thecatalogs116, the universalfrequent word list118, the mailbox specificfrequent word list104, and thedictionary114. Themass storage device412 is connected to theCPU402 through a mass storage controller (not shown) connected to the bus406. Themass storage device412 and its associated computer-readable media provide non-volatile storage for thecomputer400. Although the description of computer-readable media contained herein refers to a mass storage device, such as a hard disk or CD-ROM drive, it should be appreciated by those skilled in the art that computer-readable media can be any available computer storage media that can be accessed by thecomputer400.
By way of example, and not limitation, computer-readable media may include volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules, or other data. For example, computer-readable media includes, but is not limited to, RAM, ROM, EPROM, EEPROM, flash memory or other solid state memory technology, CD-ROM, digital versatile disks (“DVD”), HD-DVD, BLU-RAY, or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by thecomputer400.
According to various embodiments, thecomputer400 may operate in a networked environment using logical connections to remote computers through anetwork418. Thecomputer400 may connect to thenetwork418 through anetwork interface unit410 connected to the bus406. It should be appreciated that thenetwork interface unit410 may also be utilized to connect to other types of networks and remote computer systems. Thecomputer400 may also include an input/output controller408 for receiving and processing input from a number of input devices (not shown), including a keyboard, a mouse, a microphone, and a game controller. Similarly, the input/output controller408 may provide output to a display or other type of output device (not shown).
Based on the foregoing, it should be appreciated that technologies for generating and using a mailbox specific frequent word list are presented herein. Although the subject matter presented herein has been described in language specific to computer structural features, methodological acts, and computer readable media, it is to be understood that the invention defined in the appended claims is not necessarily limited to the specific features, acts, or media described herein. Rather, the specific features, acts and mediums are disclosed as example forms of implementing the claims.
The subject matter described above is provided by way of illustration only and should not be construed as limiting. Various modifications and changes may be made to the subject matter described herein without following the example embodiments and applications illustrated and described, and without departing from the true spirit and scope of the present invention, which is set forth in the following claims.