Movatterモバイル変換


[0]ホーム

URL:


CA2236015A1 - System for customized electronic identification of desirable objects - Google Patents

System for customized electronic identification of desirable objects
Download PDF

Info

Publication number
CA2236015A1
CA2236015A1CA002236015ACA2236015ACA2236015A1CA 2236015 A1CA2236015 A1CA 2236015A1CA 002236015 ACA002236015 ACA 002236015ACA 2236015 ACA2236015 ACA 2236015ACA 2236015 A1CA2236015 A1CA 2236015A1
Authority
CA
Canada
Prior art keywords
user
target
target object
sets
objects
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
CA002236015A
Other languages
French (fr)
Inventor
Frederick S. M. Herz
Jason M. Eisner
Jonathan M. Smith
Steven L. Salzberg
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by IndividualfiledCriticalIndividual
Priority claimed from PCT/US1996/017981external-prioritypatent/WO1997016796A1/en
Publication of CA2236015A1publicationCriticalpatent/CA2236015A1/en
Abandonedlegal-statusCriticalCurrent

Links

Landscapes

Abstract

This invention relates to customized electronic identification of desirable objects, such as news articles, in an electronic media environment, and in particular to a system that automatically constructs both a "target profile"
for each target object in the electronic media based, for example, on the frequency with which each word appears in an article relative to its overall frequency of use in all articles, as well as a "target profile interest summary" for each user, which target profile interest summary describes the user's interest level in various types of target objects. The system then evaluates the target profiles against the users' target profile interest summaries to generate a user-customized rank ordered listing of target objects most likely to be of interest to each user so that the user can select from among these potentially relevant target objects, which were automatically selected by this system from the plethora of target objects that are profiled on the electronic media. Users' target profile interest summaries can be used to efficiently organize the distribution of information in a large scale system consisting of many users interconnected by means of a communication network. Additionally, a cryptographically-based pseudonym proxy server is provided to ensure the privacy of a user's target profile interest summary, by giving the user control over the ability of third parties to access this summary and to identify or contact the user.

Description

CA 0223601~ 1998-04-27 W O 97/16796 PCT~US96/179~1 SYSTEM FOR CUSTOMT7,F I> ELECTRONIC II)ENTIFICATION
OF DESIRABLE OBJE~CTS

CROSS-REFERENCE TO RELATED APPLICATIONS
This patent application is a contiml~tiQn-in-part of U.S. Patent Application ~;erial No. 08/346,425, filed November 28, 1994 and titled "SYSTEM AND METHOD FOR
SCHEDI~LING BROADCAST OF AND ACCESS TO VIDEO PROGRAMS ,~ND
OTHER DATA USING CUSTOMER PROFrr FS~, which application is assigned to the same ~c~ignee as the present application.
FlFJT n OF INVENTION
This invention relates to customized electronic identification of desirable ob jects, such as news articles, in an electronic media environment, and in particular to a system that ~lltom~tic~lly constructs both a "target profile" for each target object in the electronic rmedia based, for example, on the frequency with which each word appears in an article relative to its overall frequency of use in all articles, as well as a "target profile interest sull"llal)~" for each user, which target profile interest summary describes the user's interest level in various types of tar get objects. The system then evaluates the target profiles against the users' target profile interest summaries to generate a user-customized rank ordered listing of target objects most likely to be of interest to each user so that the user can select from among these potentially relevant target objects, which were autom~tic~lly selected by this system from the plethora oftarget objects that are profiled on the electronic media. Users' target profile interest sllmm~ries can be used to efficiently ol~ ~lize the distribution of information in a large scale system consisting of many users interconnected by means of a communic:ation network. Additionally, a cryptographically based proxy server is provided to ensure the privacy of a user's target profile interest summary, by giving the user control over the ability ofthird parties to access this sl.. ;1.y and to identify or contact the user.
PROBLEM
It is a problem in the field of electronic media to enable a user to access i,lrolll,ation of relevance and interest to the user without requiring the user to expend an excessive amount oftime and energy searching for the il~ollllaLion. Electronic media, such as on-line il~llllaLion sources, provide a vast amount of information to users, typically in the form of "articles," each of which comprises a publication item or document that relates to a specific " topic. The difficulty with electronic media is that the amount of information available to the user is overwh~lming and the article repository systems that are connected on-line are not CA 0223601~ 1998-04-27 o~ d in a manner that sufficiently simplifies access to only the articles of interest to the user. Presently, a user either fails to access relevant articles because they are not easily identified or expends a significant amount of time and energy to conduct an exhaustive search of all articles to identify those most likely to be of interest to the user. Furthermore, 5 even if the user c~ n~lllcts an ~rh~ tive search, present il ro,lllalion searching techniques do not necessarily accurately extract only the most relevant articles, but also present articles of marginal relevance due to the functional limitations of the il~ll.laLion sealchi-lg techniques. There is also no Pxi~ting system which automatically estimates the inherent quality of a n article or other target object to ~ ,"i~h among a number of articles or target 10 objects identified as of possible interest to a user.
Therefore, in the field of i~lro~ Lion retrieval, there is a long-st~n~ling need for a system which enables users to navigate through the plethora of information. Withcon~ ,ialization of communication networks, such as the Internet, the growth of available information has increased. Customization of the information delivery process to the user's 15 unique tastes and ill~elt;~L~ is the llltim~te solution to this problem. However, the techniques which have been proposed to date either only address the user's interests on a superficial level or provide greater depth and int~.lli~ton~.e at the cost of unwanted d~m~n~ls on the user's time and energy. While many resealcllel~ have agreed that traditional methods have been lacking in this regard, no one to date has succ~s~fi-lly addressed these problems in a holistic 20 manner and provided a system that can fully learn and reflect the user's tastes and interests.
This is particularly true in a practical coll..llelcial context, such as on-line services available on the Intemet. There is a need for an hlroll,laLion retrieval system that is largely or entirely passive, unobtrusive, lmdem~n~lin~ o fthe user, and yet both precise and comprehensive in its ability to leam and truly represent the user's tastes and interests. Present infommation 25 retrieval systems require the user to specify the desired information retrieval behavior through cumbersome interfaces.
Users may receive information on a computer network either by actively retrieving the infommation or by passively receiving infommation that is sent to them. Just as users of h~rolllldlion retrieval systems face the problem oftoo much information, so do users who 30 are targeted with electronic junk mail by individuals and olg~ ions. An ideal system would protect the user from unsolicited advertising, both by autom~tic~lly extracting only the most relevant messages received by electronic ma il, and by preserving the CA 0223601~ 1998-04-27 confid~-nti~lity ofthe user's ~-erele-lces, which should not be freely available to others on the network.
Researchers in the field of published article h~rolll]ia~ion retrieval have devoted considerable effort to finding efflcient and accurate methods of allowing users to select 5 articles of interest from a large set of articles. The most widely used methods of information retrieval are based on keywvld m7ll~h;ll~, the user specifies a set of keywords which the user thinks are exclusively found in the desired articles and the il~vllllc~Lion retrieval computer retrieves a3il articles which contain those keywords. Such methods are fast, but are notoriously ulueliable, as users may not think of the right keywords, or the 10 keywords may be used in unwanted articles in an irrelevant or unexpected context. .A s a result, the iilrvllllaLion retrieval computers retrieve ma~iy articles which are unwanted by the user. The logical cc,lllbi ldLion of keywords and the use of wild-card search parameters help improve the accuracy of keyword sea~ lg but do not completely solve the problern of inaccurate search results.
Starting in the 1960's, an alternate approach to information retrieval was developed:
users were presented with an article and asked if it co"l~i"ecl the information they wanted, or to quantify how close the information contained in the article was to what they wanted.
Each article was described by a profiile which comprised either a list of the words in the article or, in more advanced systems, a table of word frequencies in the article. Since a measure of similarity between articles is the distance between their profiles, the measured similarity of article profiles can be used in article retrieval. For example, a user searc]~ing for information on a subject can write a short description of the desired information. The h~ullllclLion retrieval computer generates an article profile for the request and then retrieves articles with profLles similar to the profLle generated for the request. These requests can l:hen be reflned using "relevance feedback", where the user actively or passively rates the articles retrieved as to how close the il~c,llllclLion collL~Ied therein is to what is desired. The information retrieval computer then uses this relevance feedbac~ olmaLion to refine the request profLle and the process is repeated until the user either finds enough articles or tires of the search.
A number of researchers have looked at methods for selecting articles of most interest to users. An article titled "Social IllrollllaLion filtering: algoliLI3~i]i]is for automating 'word of mouth"' was p~lbli~hed at the CHi-95 Procee-l;llg~ by Patti Maes et al and descriibes . the Ringo hlrorlll~Lion retrieval system which leco"i",ton~s musical selections. The Ringo CA 0223601~ 1998-04-27 W O 97/16796 PCTrUS96/17981 system requires active feedback from the users -- users must m~nll~lly specify how much they like or dislike each mllei~ selection. The Ringo system ..~ e a complete list of users ratings of music selections and makes ~eco,.",.enrl~tions by finding which selections were liked by mllltiple people. However, the Ringo system does not take advantage of any 5 available descriptions of the music, such as structured descriptions in a data base, or free text, such as that contained in music reviews. An article titled "Evolving agents for personalized information filtering", published at the Proc. 9th IEEE Conf. on AI for Aprlic~tic-n~ by Sheth and Maes, described the use of agents for information filtering which use genetic algolilh l,s to learn to categorize Usenet news articles. In this system, users 10 must define news categories and the users actively indicate their opinion of the selected articles. Their system uses a list of keywords to lep-ese--L sets of articles and the records of users' interests are updated using genetic algo-i~hl--s.
A number of other research groups have looked at the automatic generation and labeling of clusters of articles for the purpose of b. owsing through the articles. A group at 15 Xerox Parc published a papêr titled "Scatter/gather: a cluster-based approach to browsing large article collecti~ n~" at the 15 Ann. Int'l SIGIR '92, ACM 318-329 (Cutting et al. 1992).
This group developed a method they call "scatter/gather" for pe.r~ l.,ng information retrieval searches. In this method, a collection of articles is "scattered" into a small number of clusters, the user then chooses one or more of these clusters based on short sumrnaries 20 of the cluster. The selected clusters are then "gathered" into a subcollection, and then the process is repeated. Each iteration of this process is expected to produce a small, more focused collection. The cluster "summaries" are generated by picking those words which appear most frequently in the cluster and the titles of those articles closest to the center of the cluster. However, no feetlb~k from users is collected or stored, so no performance 25 irnprovement occurs over time.
Apple's Advanced Technology Group has developed an interface based on the concept of a "pile of articles". This int~ ce is described in an article titled "A 'pile' metaphor for supporting casual o,~ n of information in Human factors in computer systems" published in CHl '92 Conf. Proc. 627-634 by Mander, R. G. Salomon and Y.
30 Wong. 1992. Another article titled "Content awareness in a file system interface implçm~nfing the 'pile' metaphor for ~ ;,;l-g i~ro...~alion" was published in 16 Ann. Int'l SIGIR '93, ACM 260-269 by Rose E. D. et al. The Apple interface uses word frequencies to automatically file articles by picking the pile most similar to the article being filed. This -CA 0223601~ 1998-04-27 W O 97/16796 PCTrUS96/179~1 system functions to cluster articles into subpiles, determine key words for inrle~irlg by picking the words with the largest TF/IDF (where TF is term (word) frequency and rDF is the inverse document frequency) and label piles by using the determined key words.
Numerous patents address illrol .llalion retrieval methods, but none develop records of a user's interest based on passive monitQring of which articles the user accesses None of the systems described in these patents pre sent computer ar hitectllres to allow fast retrieval of articles di~ll;buLed across many comr~lt~rs. None of the systems described in these patents address issues of using such article retrieval and ,.,~cl,;.~g method.s for purposes of collllllel.;e or of m~tr~hin~ users with common interests or developing records of users' interests. U.S. Patent No. 5,321,833 issued to Chang et al. teaches a meth,od in which users choose terms to use in an i~u~"alion retrieval query, and specify the relative wei~h~ing.~ of the dilrerell~ terms. The Chang system then c~lc~ tes multiple levels of weighting criteria. U.S. Patent No. 5,301,109 issued to T .~n~ er et al. teaches a method for retrieving articles in a multiplicity of l~nf~ges by constructing "latent vectors" (S~D or PCA vectors) which represent correlations between the different words. U.S. Paten,t No.
5,331,554 issued to Graham et al. discloses a method for retrieving segment~ of a mlanual by colllpaling a query with nodes in a decision tree. U.S. Patent No. 5,331,556 addresses techniques for deriving morphological part-of-speech information and thus to make use of the similarities of dirre,l~lll forms of the same word (e.g. "article" and "articles").
Therefore, there presently is no il~rolllla~ion retrieval and delivery system operable in an electronic media environment that enables a user to access illrollllaLion of relevance and interest to the user without requiring the user to expend an excessive amount of tirne and energy.
SOLUTION
The above-described problems are solved and a technical advance achieved in the field by the system for customized electronic identification of desirable objects in an electronic media environment, which system enables a user to access target objects of relevance and interest to the user without requiring the user to expend an excessive amount of time and energy. Profiles of the target objects are stored on electronic media and are accessible via a data commllnic.~tion network. In many applications, the target obje-cts are informational I n nature, and so may themselves be stored on electronic media and be accsc~ible via a data commllnic~ti~ n network.

CA 0223601~ 1998-04-27 W O 97/16796 PCT~US96/17981Relevant definitions of terms for the purpose of this description include: (a.) an object available for access by the user, which may be either physical or electronic in nature, is termed a "target object", (b.) a digitally repf~se..Led profile indicating t hat target object's attributes is termed a "target profile", (c.) the user looking for the target object is termed a 5 "user", (d.) a profile holding that user's attributes, in~ rlin~ age/zip code/etc. is termed a "user profile", (e.) a ~.."....~.y of digital profiles of target objects that a user likes and/or dislikes, is terrned the "target profile interest s~ y" of that user, (f.) a profile consisting of a collection of attributes, such that a user likes target objects whose profiles are similar to this collection of attributes, is termed a "search profile" or in some contexts a "query" or 10 "query profile," (g.) a specific embodiment of the target profile interest ~ n. . .~ . y which cc - "l" ;~es a set of search profiles is termed the "search profile set" of a user, (h.) a collection oftarget objects with similar profiles, is termed a "cluster," (i.) an aggregate profile forrned by averaging the attributes of all tar get objects in a cluster, termed a "cluster profile," a.) a real number determined by calc~ ting the statistic~l variance of the profiles of all target 15 objects in a cluster, is terrned a "cluster variance," (k.) a real number determined by ~lclll~ting the maximum distance between the profiles of any two target objects in a cluster, is termed a "cluster ~ met~r."
The system for electronic i~lenfifi~ati~n of desirable objects of the present invention automatically constructs both a target profile for each target object in the electronic media 20 based, for example, on the frequency with which each word appears in an article relative to its overall frequency of use in all articles, as well as a "target profile interest summary" for each user, which target profile interest summary describes the user's interest level in various types oftarget objects. The system then evaluates the target profiles against the users' target profile interest ~,."""~,ies to generate a user-customized rank ordered listing of tar get 25 objects most likely to be of interest to each user so that the user can select from among these potentially relevant target objects, which were ~-tc m~tically selected by this system from the plethora of target objects available on the electronic media.
Because people have mllltiple interests, a target profile interest summary for a single user must replesen~ multiple areas of interest, for example, by consisting of a set of 30 individual search profiles, each of which id~ntifies one of the user's areas of interest. Each user is presented with those target objects whose profiles most closely match the user's interests as described by the user's target profile interest snmm~ y. Users' target profile interest summaries are automatically updated on a contimling basis to reflect each user's CA 0223601~ 1998-04-27 W O 97/16796 PCT~US96tl79~1 çh~nging~ interests. In ~d~litic)n~ target objects can be g~rouped into clusters based on their similarity to each other, for example, based on similarity of their topics in the case where the target objects are published articles, and menus autom~tic~lly generated for each cluster of target objects to allow users to navigate throughout the clusters and m~n-l~lly locate 5 target objects of interest. For reasons of c-~nfid~nti~lity and privacy, a particular user may not wish to make public all of the interests recorded in the user's target profile interest summary, particularly when these interests are determined by the user's purchasing patterns.
The user may desire that all or part of the target profile interest summary be kept confidenti~l, such as h.r~,lllaLion relating to the user's political, religious, fin~nei 1l or 10 p~,h~i..g behavior; indeed, conficl~nti~lity with respect to purchasing behavior is the user's legal right in many states. It is therefore nececc~ry that data in a user's target profile interest summary be protected from unwanted disclosure except with the user's agreement. A.t the same time, the user's target profile interest sllmm~ries must be ~cces.cible to the relevant servers that perform the m~t~hing of target objects to the users, if the benefit oi' this 15 matching is desired by both providers and con~ul,lers of the target objects. The disclosed system provides a sol~ltion to the privacy problem by using a proxy server which acts as an interrnediary between the ih~fol.l.~Lion provider and the user. The proxy server dissoc,iates the user's true identity from the pseudonym by the use of cryptographic techniques. The proxy server also permits users to control access to their target profile interest sumrnaries 20 and/or user profiles, ine~ ing provision of this h~rcjllll~Lion to marketers and advertisers if they so desire, possibly in exchange for cash or other considerations. Marketers may purchase these profiles in order to target adver~icçmçntc to particular users, or they may purchase partial user profiles, which do not include enough hlrollll~Lion to identify the individual users in question, in order to carry out standard kinds of demographic analysis 25 and market research on the res-llting cl~t~b~ce of partial user profiles.
In the ~le~ll~d embodiment of the invention, the system for customized electronic identific~tion of desirable objects uses a fundamental methodology for accurately and efficiently m~tr.hing users and target objects by ~--tom~tic~lly c~lclll~ting, using~ and updating profile i.lrc,llll~Lion that describes both the users' interests and the target objects' 30 characteristics. The target objects may be published articles, purchasable items, or even other people, and their properties are stored, and/or represented and /or denoted an the electronic media as ~digital) data. Examples of target objects can in~ludç~ but are not limited to: a newspaper story of potential interest, a movie to watch, an item to buy, e -mail CA 0223601~ 1998-04-27 to receive, or another person to collt;spol1d with. In all these cases, the information delivery process in the p~c;r~l~ed embodiment is based on deLt:lll.,lfing the eimil~rity between a profile for the target object and the profiles of target objects for which the user (or a similar user) has provided positive feeclb?~l ~ in the past. The individual data that describe a target 5 object and constitute the target object's profile are herein termed "attributes" of the target object. Attributes may in~ludP; but are not limited to, the following: (1) long pieces of text ( a newspaper story, a movie review, a product description or an adverti.e~m~nt), (2) short pieces of text (name of a movie's director, name of town from which an adverti.eem~nt was placed, name of the l~n~-~ge in which an article was written), (3) numeric measurements 10 (price of a product, rating given to a movie, reading level of a book), (4) associations with other types of objects (list of actors in a movie, list of persons who have read a document).
Any of these attributes, but especially the numeric ones, may correlate with the quality of the target object, such as measures of its popularity (how often it is ~ccessed) or of user satisfaction (number of complaints received).
The plc;relled embodiment ofthe system for customi7ed electronic i~lçntific~tion of desirable objects operates in an electronic media envho"~llenL for acceeqing these target objects, which may be news, electronic mail, other published docllmente, or product descriptions. The system in its broadest construction comprises three conceptual modules, which may be separate entities distributed across many impl~ g systems, or combined 20 into a lesser subset of physical entities. The specific embodiment of this system disclosed herein illustrates the use of a first module which autom~tic~lly constructs a "target profile"
:~or each target object in the electronic media based on various descriptive attributes of the target object. A second module uses interest feedba~ from users to construct a "target profile interest s~lmm~ryll for each user, for example in the form of a "search profile set"
25 coneieting of a plurality of search profiles, each of which corresponds to a single topic of high interest for the user. The system further inc~llldes a profile processing module which estim~tes each us er's interest in various target objects by reference to the users' target profile interest summaries, for example by comparing the target profiles of these target objects against the search profiles in users' search profile sets, and generates for each us er 30 a c~ o"..,~d rank-ordered listing oftarget objects most likely to be of interest to that user.
Each user's target profile interest summary is autom~tiC~lly updated on a contin-ling basis to reflect the user's ch~nging interests.

CA 0223601~ 1998-04-27 W O97/16796 PCT~US96/179.51 Target objects may be of various sorts, and it is somPtime~ advantageous to use a single system that delivers and/or clusters target objects of several distinct sorts at once, in a unified L~ll~w~lk. For ~x~mplç, users who exhibit a strong interest in certain novels may also show an interest in certain movies, presumably of a similar nature. A system in which - 5 some target objects are novels and other target objects are movies can discover such a correlation and exploit it in order to group particular novels with particular movies, e.g., for clustering purposes, or to l eco.... ~ n-l the movies to a user who has demons~ ed interest in the novels. Similarly, if users who exhibit an interest in certain World Wide Web sites also exhibit an interest in certain products, the system can match the products with the sites 10 and thereby recommend to the markete}s of those products that they place ad~el l ;~e~ n at those sites, e.g., in the form of hypertext links to their own sites.
The ability to measure the similarity of profiles describing target objects and a user's interests can be applied in two basic ways: filtering and browsing. Filtering is useful when large numbers of target objects are described in the electronic media s pace. These tiarget 15 objects can for ~x~mple be articles that are received or potentially received by a user, who only has time to read a small fraction of them. For example, one might potentially receive all items on the AP news wire service, all items posted to a number of news group's, all adverti~m.~nt~ in a set of n~w~a~ , or all unsolicited electronic mail, but few people have the time or inclination to read so many articles. A filtering system in the system for 20 customized electronic identification of desirable objects automatically selects a set of articles that the user is likely to wish to read. The accuracy of this filtering system improves over time by noting which articles the user reads and by gen~,.a~illg a measurement of the depth to which the user reads each article. This ;"rc., ."i~i;on is then us ed to update the user's target profile interest summary. Browsing provides an alternate method of selecting a small 25 subset of a _arge number of target objects, such as articles. Articles are org~ni7ed so that users can actively navigate among groups of articles by moving from one group to a larger, more general group, to a smaller, more specific group, or to a closely related group. ]_ach individual article forms a one-member group of its own, so that the user can navigate to and from individual article s as well as larger groups. The methods used by the systern for 30 customized electronic identification of desirable objects allow articles to be grouped into clusters and the clusters to be grouped and merged into larger and larger clusters. These hierarchies of clusters then form the basis for m~m~ing and navigational systems to allow CA 0223601~ 1998-04-27 W O 97/16796 PCT~US96/17981 the rapid se~cLi~-g of large l~unlbel:~ of articles. This same clustering technique is applicable to any type of target objects that can be profiled on the electronic media.
There are a number of variations on the theme of developing and using profiles for article retrieval, with the basic implçm.o.nt~tion of an on-line news clipping service 5 ~ s~ P the pl ~;r~,l ed embodiment of the invention. Variations of this basic system are disclosed and comprise a system to filter electronic mail, an .o~t~n~inn for retrieval of target objects such as purchasable items which may have more complex descriptions, a system to ~lltom~tic~lly build and alter m~mling systems for bl~w~ g and searching through large numbers of target objects, and a system to construct virtual cornml-nities of people with 10 common interests. These intelligent filters and browsers are necessary to provide a truly passive, int~llig~nt system interface. A user interface that permits intuitive browsing and filtering represents for the first time an intelligent system for determining the ~ffinifie~
between users and target objects. The clet~ilçd, cOlllpl ehe~ re target profiles and user-specific target profile interest summaries enable the system to provide responsive 15 routing of specific queries for user i,~""~lion access. The i,lro",.~Lion maps so produced and the application of users' target profile interest s~lmm~ries to predict the information consumption patterns of a user allows for pre-caching of data at locations on the data communication network and at times that ..,;.~i.";,e the traffic flow in the communication network to thereby efflciently provide the desired information to the user and/or conserve 20 valuable storage space by only storing those target objects (or segments thereof) which are relevant to the user's interests.
RRIEF DESCR~TION OF 1 HE D~AWING
Figure 1 illustrates in block diagram form a typical arçhitectllre of an electronic media system in which the system for customized electronic iclçntific~tinn of desirable 25 objects of the present invention can be implem~nted as part of a user server system;
Figure 2 illustrates in block diagram form one embodiment of the system for customized electronic idçntific~tion of desirable objects;
Figures 3 and 4 illustrate typical n~Lwo,k trees;
Figure 5 illustrates in flow diagram form a method for autom~tic~lly generating 30 article profiles and an associated hierarchical menu system;
Figures 6-9 illustrate examples of menu gelle,~Li.~g process;
Figure 10 illustrates in flow diagram form the operational steps taken by the system for customized electronic identification of desirable objects to screen articles for a user;

CA 0223601~ 1998-04-27 W O 97/16796 PCTAJS96/179;B1 Figure 1 1 illustrates a hiel ~ ~,lfical cluster tree e A~II~Jlf;
Figure 12 illustrates in fiow diagram form the process for de~el-nillalion of likelihood of interest by a specific user in a selected target object;
Figures 13A-B illustrate in flow diagram form the automatic clustering process;
Figure 14 illustrates in flow diagram form the use of the pseudonymous server;
Figure 15 illustrates in flow diagram form the use of the system for ~Cce~ing information in response to a user query; and Figure 16 illustrates in flow diagram form the use of the system for accessing information in response to a user query when the system is a distributed net~vork I O implf?mf?nt~tion.
J)ETAII,ED DESCRIPTION
MEASURING SIMILARII~Y
This section describes a general procedure for automatically measuring the similarity between two target objects, or, more p.~;sely, beLwt;el- target profiles that are autom~ti~ y generated for each of the two target objects. This similarity detellllinalion process is applicable to target objects in a wide variety of contexts. Target objects being colllpal ed can bê, as an example but not limited to: textual doc~mf?nt~, human beings, movies, or mutual funds. It is ~csllmed that the target profiles which describe the target objects are stored at one or more locations in a data commllnic~tif~n network on data storage media associated with a c~mp~lter system. The comr~lted~ ~iLy measurements serve as input to additional processes, which f~fnction to enable human users to locate desired target objects using a large computer system. These additional processes estim~te a human user's interest in various target objects, or else cluster a plurality of target objects in to logically coherent groups. The methods used by these additional processes might in principle be implemf?nted on either a single computer or on a computer nt;~wolk. Jointly or separately, they form the unde~ fillg for various sorts of database systems and il~ol lllaLion retrieval systems.
Target Objects and Attributes In c~ ic~l IllrollllaLion Retrieval (~) technology, the user is a literate human and the target objects in question are textual docllmf?nti stored on data storage devices interconnected to the user via a computer network. That is, the target objects consist entirely of text, and so are digitally stored on the data storage devices within the computer network. However, there are other target object f~om~in~ that present related retrieval problems that are not capable of being solved by present information retrieval technology:

CA 0223601~ 1998-04-27 ~ ~ r (a.) the user is a film bu~ and the target objects are movies available on videotape.
(b.) the user is a consumer and the target objects are used cars being sold.
(c.) the user is a consumer and the target objects are products being sold through promotional deals.
S (d.) the user is an investor and the target objects are publicly traded stocks, mutual funds and/or real estate pl opel lies.
(e.) the user is a student and the target objects are classes being offered.
(f.) the user is an activist and the target objects are Congressional bills of potential concern. (g.) the user is a direct-mail marketer and the target objects are potential 1 0 customers.
(h.) the user is a net-surfer and the target objects are pages, servers, or newsgroups available on the World Wide Web.
(i.) the user is a philanthropist and the target objects are charities.
(j.) the user is ill and the target objects are medical spe~ t~
(k.) the user is an employee and the target objects are potential employers.
(1.) the user is an employer and the target objects are potential employees.
(m.) the user is an b~lç~ red executive and the target objects are electronic mail messages addressed to the user.
(n.) the user is a lonely heart and the target objects are potential conversation pa~ . (o.) the user is in search of an expert and the target objects are users, with known retrieval habits, of an document retrieval system.
(p.) the user is a social worker and the target objects are families that may need extra visits.
(q.) the user is an oncologist and the target objects are women for whom m~mmograms may be advisable.
(r.) the user is an auto insurance company and the target objects are potential customers In all these cases, the user wishes to locate some small subset of the target objects -- such as the target objects that the user most desires to rent, buy, invçstip;~te7 meet, read, give m~mmograms to, insure, and so forth. The task is to help the user identify the most interesting target objects, where the user's interest in a target object is defined to be a numerical measurement of the user's relative desire to locate that object rather than others.

CA 0223601~ 1998-04-27 W O 97/16796 PCTAJS96/179~1 The generality of this problem motivates a general approach to solvin the information retrieval problems noted above. It is ~c~lmed that many target objects are known to the system for custQmi7ed electronic idP.ntific~tion of desirable objects, and that specifically, the system stores (or has the ability to leconsLl.lct) several pieces of - 5 i~ lalion about each target object. These pieces of illrollll~ion are termed "attributes":
collectively, they are said to form a profile ofthe target object, or a "target profile." For example, where the system for ~ o~ ed electronic id~ntific~tion of desirable objects is activated to identify movies of interest, the system is likely be concerned with the values of attributes such as these:
(a.) title of movie, (b.) name of director, (c.) Motion Picture Association of America (MPAA) child-appro~ P.nes~ rating (0=G, l=PG,...), (d.) date of release, (e.) number of stars granted by a particular critic, (f.) number of stars granted by a second critic, (g.) number of stars granted by a third critic, (h.). full text of review by the third critic, (i.). list of customers who have previously rented this movie, ( j.) list of actors.
Each movie has a dirr~l~llL set of values for these attributes. This example conveniently illustrates three kinds of attributes. Attributes c-g are numeric attributes, of the sort that might be found in a d~t~ha~e record. It is evident that they can be used to help the user identify target objects (movies) of interest. For example, the user might previously have rented many Parental Guidance (PG) films, and many films made in the 1970's. This gener~li7~tion is useful: new films with values for one or both attributes that are numerically similar to these (such as MPAA rating of 1, release date of 1975) are judged similar to the films the user already likes, and ~helt;rol~ of probable interest. Attributes a-b and h are textual attributes. They too are important for helping the user locate desired f~lms. For 30 ~.~;"~ , perhaps the user has shown a past interest in films whose review text (attribute h) contains words like "chase," "explosion," "explosions," "hero," "gripping," and "superb."
This generalization is again useful in identifying new films of interest. Attribute i is an associative attribute. It records associations between the target objects in this domain, CA 0223601~ 1998-04-27 namely movies, and ancillary target objects of an entirely di~ele,lL sort, namely h-lm~n~
A good in~lic~tiQn that the user wants to rent a particular movie is that the user has previously rented other movies with similar attribute values, and this holds for attribute I
just as it does for attributes a-h. For example, if the user has often liked movies that S customer Cl7 and customer Clgo have rented, then the user may like other such movies, which have similar values for attribute i. Attribute j is another example of an associative attribute, lecoldillg associations between target objects and actors. Notice that any of these attributes can be made subject to ~th~ntic~tion when the profile is constructed, through the use of digital ~ign~tllres; for example, the target object could be accompanied by a digitally signed note from the MPAA, which note names the target object and specifies its ~llth~.ntic value for attribute c .
These three kinds of attributes are co,.""ol- numeric, textual, and associative. In the classical information retrieval problem, where the target objects are docllm~nts (or more generally, coherent document sections extracted by a text segmçnt~tiQn method), the system might only consider a single, textual attribute when measuring similarity: the full text of the target object. However, a more sophisticated system would consider a longer target profile, including numeric and associative attributes:
(a.) full text of document (textual), (b.) title (textual), (c.) author (textual), (d.) l~n~l~ge in which document is written (textual), (e.) date of creation (numeric), (f.) date of last update (numeric), (g.) Iength in words (numeric), (h.) readinglevel (numeric), (i.) quality of document as rated by a third\_party editorial agency (mlm~ric), (3.) list of other readers who have retrieved this document (associative).
As another domain example, consider a domain where the user is an advertiser andthe target objects are potential customers. The system might store the following attributes for each target object (potential customer):
(a.) first two digits of zip code (textual), (b.) first three digits of zip code (textual), (c.) entire five-digit zip code (textual), "

CA 0223601~ 1998-04-27 W O 97/16796 PCTAUS96/179~1 (d.) ~ t~nre of reCi(lçnre from advertiser's nearest physical sto,c;flo,rL (numeric), (e.) annual family income (numeric), (f.) number of children (numeric), (g.) Iist of previous items purchased by this potential customer (associative), list of filen~mes stored on this potential cu~Lolllel's client computer (associative), list of movies rented by this potential c~lstom~r (associative), list of investm.ont~ in this potential crlstom~or's investm.ent portfolio (associative), list of docllment~ retrieved by this potential cl-~tomer (associative), written response to Rorschach inkblot test (textual), mllltiple-choice l~:~onses by this ~l.stomer to 20 self-image quçstiQn~ (20 textual attributes).
As always, the notion is that similar consumers buy similar products. It shoul.d be noted that diverse sorts of hlrc l ll,~lion are being used here to characterize con~llmers~ :Erom their consumption patterns to their literary taste s and psychological peculiarities, and that this fact illustrates both the flexibility and power of the system for customized electronic i(ientific~tion of desirable objects of the present invention. Diverse sorts of information can be used as attributes in other domains as well (as when physical, economic, psychological and interest-related questions are used to profile the applicants to a dating service, which is indeed a possible domain for the present system), and the advertiser domain is simply an example.
As a final domain example, consider a domain where the user is an stock market investor and the target objects are publicly traded corporations. A great many attributes might be used to characterize each corporation, inrlu~ling but not limited to the follov~ing:
(a.) type of business (textual), ~b.) corporate mission st~tement (textual), (c.) number of employees during each of the last 10 years (ten separate numeric alllibutes), (d.) percentage growth in number of employees during each ofthe last 10 years, (e.) dividend payment issued in each of the last 40 quarters, as a percentage ofcurrent share price, (f.) percentage appreciation of stock value during each of the last 40 quarters, Iist of shareholders (associative), (g.) composite text of recent articles about the corporation in the fin~nr.i~l press (textual). It is worth noting some additional attributes that are of interest in some CA 0223601~ 1998-04-27 WO 97tl6796 PCT/US96/17981 cic m~in~ In the case of docl ImP.nte and certain other dom~inc~ it is useful to know the source of each target object (for Px~mplç, ,GrGlGed journal article vs. UPI newswire article vs.
Usenet llGW:i~;lOLIp posting vs. question-answer pair from a question-and-answer list vs.
tabloid newspaper article vs....); the source may be represented as a single-term textual 5 attribute. Lll~,l L~ o~tive ~ttrih~ltes for a hypertext docllmPnt are the list of docllm~nfs that it links to, and the list of do~lmPnt~ that link to it. DocllmP.nt~ with similar citations are similar with respect to the former aKribute, and docllmPnt~ that are cited in the same places are similar with respect to the latter. A convention may optionally be adopted that any document also links to itself. Especially in systems where users can choose whether or not 10 to retrieve a target object, a target object's popularity (or circulation) can be usefully measured as a numeric attribute specifying the number of users who have retrieved that object. Related measurable numeric attributes that also indicate a kind of popularity include the number of replies to a target object, in the domain where target objects are mes.c~ges posted to an electronic community such as an computer bulletin board or newsgroup, and 15 the number of links leading to a target object, in the domain where target objects are interlinked hypertext docllmPnt~ on the World Wide Web or a similar system. A target object may also receive explicit numeric ev~ tion~ (another kind of numeric attribute) from various groups, such as the Motion Picture Association of America (~AA), as above, which rates movies' applup.;~t~ne~ for children, or the American Medical Association, 20 which might rate the accuracy and novelty of medical, Gse~h-;h papers, or a random survey sample of users (chosen from all users or a selected set of experts), who could be asked to rate nearly anything. Certain other types of evaluation, which also yield numeric attributes, may be carried out me~h~niç~lly. For example, the difficulty of reading a text can be a~s~s~ecl by standard procedures that count word and sentence lengths, while the vulgarity 25 of a text could be defined as (say) the number of wlgar words it collLaills, and the expertise of a text could be crudely assessed by COU11L~Ig the ..u,..ber of similar texts its author had previously retrieved and read using the invention, perhaps co~nillg this count to texts that have high approval ratings from critics. Finally, it is possible to synthesize certain textual attributes . ~ h~ c~lly~ for example to reconstruct the script of a movie by applying speech 30 recognition techniques to its soundtrack or by applying optical character recognition techniques to its closed-caption subtitles.

CA 0223601~ 1998-04-27 WO 97/16796 PCT/US96/179~,1 Decomposing Complex Attributes A1thn~gh textual and associative attributes are large and complex pieces of data, for information retrieval purposes they can be decomposed into smaller, simpler nurmeric attributes. This means that arLy set of attributes can be replaced by a (usually larger) set of S numeric attributes, and hence that any profile can be lt;presellLed as a vector of numbers denotirLg the values ofthese n11mPric attributes. In particular, a textual attribute, such as the full text of a movie review, can be replaced by a collection of numeric attributes that represent scores to denote the presence and ~i nific~nce ofthe words "aardvark," "aback,"
"abacus," and so on through "~,ylll~Llgy" in that text. The score of a word in a text may be 10 defined in numerous ways. The simplest d~finition is that the score is the rate ofthe word in the text, which is computed by computing the number of times the word occurs in the text, an d dividing this number by the total number of words in the text. This sort of sc,ore is often called the "term frequency" (TF) of the word. The definition of term frequency may optionally be modified to weight dirrelt lll portions of the text unequally: for example, any l 5 occurrence of a word in the text's title might be counted as a 3 -fold or more generally k-fold occurrence (as if the title had been repeated k times within the text), in order to reflect a heuristic assumption that the words in the title are particularly important indicators oi.~the text's content or topic.
However, for lengthy textual attributes, such as the text of an entire document, the 20 score of a word is typically defined to be not merely its term frequency, but its 1erm frequency mllltipli~d by the n~;~ted logarithm of the word's "global frequency," as measured with respect to the textual attribute in question. The global frequency of a word, which effectively measures the word's u~ cllll~Li~eness, is a fraction between 0 and l, defined to be the fraction of all target objects for which the textual attribute in question 25 contains this word. This ~ sted score is often known in the art as TF/IDF ("1:erm frequency times inverse document frequency"). When global frequency of a word is taken into account in this way, the cn.. o~-, ~lhlrolllla~ e words have scores colllp~lively close to zero, no matter how often or rarely they appear in the text. Thus, their rate has little influence on the object's target profile. Alternative methods of c~1c1-1~ting word scores 30 include latent s~m~ntic in~ ing or probabilistic models.
Ir~stead of breaking the text into its component words, one could alternatively break the text into o~ llap~ g word bigrams (sequences of 2 adj~c~nt words), or more generally, word n-grams. These word n-grams may be scored in the same way as individual words.

CA 0223601~ 1998-04-27 W O 97/16796 PCTrUS96/17981 Another possibility is to use character n-grams. For example, this sentence contains a sequence of overlapping character 5-grams which starts "for e", "or ex", "r exa", " exam", "examp", etc. The sl?ntçnce may be characterized, imprecisely but usefully, by the score of each possible character 5-gram ("aaaaa", "aaaab", ... "zzzzz") in the senf~nce Conceptually S speaking, in the character 5-gram case, the textual attribute would be decomposed into at least 265 = 11,881,376 mlm~nc~ s Of course, for a given target object, most of these numeric attributes have values of 0, since most 5-grams do not appear in the target object attributes. These zero values need not be stored anywhere. For purposes of digital storage, the value of a textual attribute could be characterized by storing the set of character 5-grams 10 that actually do appear in the text, together with the nonzero score of each one. Any 5-gram that is no t includecl in the set can be ~sllm~d to have a score of zero. The decomposition of textual attributes is not limited to attributes whose values are expected to be long texts.
A simple, one-term textual attribute can be replaced by a collection of numeric attributes in exactly the same way. Consider again the case where the target objects are movies. The 15 "name of director" attribute, which is textual, can be replaced by numeric attributes giving the scores for "Federico-Fellini," "Woody-Allen," "Terence- Davies," and so forth, in that attribute. For these one-term textual attributes, the score of a word is usually defined to be its rate in the text, without any consideration of global frequency. Note that under these conditions, one of the scores is 1, while the other scores are 0 and need not be stored. For 20 example, if Davies did direct the film, then it is "Terence-Davies" whose score is 1, since "Terence-Davies" conctit~te~ 100% of the words in the textual value of the "name of director" attribute. It might seem that nothing has been gained over simply regarding the textual attribute as having the string value "Terence-Davies." However, the trick of decomposing every non-numeric attribute into a collection of numeric attributes proves 25 useful for the clustering and decision tree methods described later, which require the attribute values of di~el~n~ objects to be averaged and/or ordinally ranked. Only numeric attributes can be averaged or ranked in this way. Just as a textual attribute may be decomposed into a number of component terms (letter or word n-grams), an associative attribute may be decomposed into a number of component associations. For instance, in a 30 domain where the target objects are movies, a typical associative attribute used in profiling a movie would be a list of customers who have rented that movie. This list can be replaced by a collection of numeric attributes, which give the "association scores" between the movie and each of the customers known to the system. For example, the 165th such numeric CA 0223601~ 1998-04-27 attribute would be the association score between the movie and customer #165, where the association score is defined to be 1 if c l~tomer #165 has previously rented the movie, and 0 otherwise. In a subtler rçfin~m~nt, this association s core could be defined to be the degree of interest, possibly zero, that customer #165 exhibited in the movie, as deterrnined by relevance feedb~rL~ (as described below). As another example, in a domain where target objects are col-lpal~ies, an associative attribute indicating the major shareholders of the COlll~ y would be deco~ osed into a collection of association scores, each of which would indicate the percentage of the COlllphlly (possibly zero) owned by some particular individual or corporate body. Just as with the term scores used in decomposing lengthy textual attributes, each association score may optionally be ~ sted by a mllltirlicative factor: for example, the association score between a movie and customer #165 might be multipli,ed by the negated logarithm of the "global frequency" of customer #165, i.e., the fraction of all movies that have been rented by custom~r #165. Just as with the term scores used in decomposing textual attributes, most association scores found when decomposing aparticular value of an associative attribute are zero, and a similar economy of storage may be gained in exactly the same manner by storing a list of only those ancillary objects with which the target object has a nonzero association score, together with their respective association scores.
Similarity Measures What does it mean for two target objects to be similar? More precisely, how should one measure the degree of similarity? Many approaches are possible and any reasonable metric that can be computed over the set of target object profiles can be used, where target objects are considered to be similar if the distance between their profiles is small according to this metric. Thus, the following ~lerel.ed embodiment of a target object similarity measurement system has m any variations.
First, define the distance between two values of a given attribute accordi]lg towhether the attribute is a numeric, associative, or textual attribute. If the attribute is numeric, then the (1i~t~nre between two values of the attribute is the absolute value of the difference between the two values. (Other d~finition~ are also possible: for example, the distance between prices pl and p2 might be defined by l(pl-p2)l/(max(pl,p2)+]), to recognize that when it comes to customer interest, $50 00 and $~020 are very similar, whereas $3 and $23 are not.) If the attribute is associative, then its value V may be decomposed as described above into a collection of real numbers, representinz the CA 0223601~ 1998-04-27 association scores between the target object in question and various ancillary objects. V
may therefore be regarded as a vector with components Vl, V2, V3, etc."c;plese~ g the association scores between the object and ancillary objects 1, 2, 3, etc., respectively. The di~t~nre between two vector values V and U of an associative attribute is then computed S using the angle distance measure, arccos (VU' /sqrt((Vv')( UlJt)). (Note that the three inner products in this c~ ion have the form XY' = Xl Yl + X2 Y2 + X3 Y3~..., and that for efflcient computation, terms of the form X; Y; may be omitted from this sum if either of the scores X; and Yj is zero.) Finally, if the attribute is textual, then its value V may be decomposed as described above into a collection of real numbers, rel~res~ g the scores 10 of various word n-grams or character n-grams in the text. Then the value V may again be regarded as a vector, and the distance between two values is again defined via the angle f~iet~nce measure. Other ~imil~rity metrics between two vectors, such as the dice measure, may be used instead. It happens that the obvious alternative metric, F.~lr.lidç~n ~ t~nçr, does not work well: even similar texts tend not to overlap subst~nti~lly in the content words 15 they use, so that texts encountered in practice are all subst~nti~lly orthogonal to each other, ~sllming that TF/IDF scores are used to reduce the inflllrnce of non-content words. The scores oftwo words in a textual attribute vector may be correlated; for example, "Kennecly"
and "JFK" tend to appear in the same docllmRnt.~ Thus it may be advisable to alter the text somewhat before computing the scores of terms in the text, by using a synonym dictionary 20 that groups together similar words. The effect of this optional pre-alteration is that two texts using related words are measured to be as similar as if they had actually used the same words. One terhnitl~e is to ~lgmrnt the set of words actually found in the article with a set of synonyms or other words which tend to co-occur with the words in the article, so that "Kennedy" could be added to every article that mf~ntionc "lFK." Alternatively, words found 25 in the article may be wholly replaced by synonyms, so that "JFK" might be replaced by "Kennedy" or by "John F. Kennedy" wherever it appears. In either case, the result is that docllm.onts about Kennedy and documents about JFK are adjudged similar. The synonym dictionary may be sensitive to the topic of the document as a whole; for example, it may recognize that "crane" is likely to have a dirre.ell~ synonym in a document that mentions 3 0 birds than in a document that mentions construction. A related technique is to replace each word by its morphological stem, so that "staple", "stapler", and "staples" are all replaced by "staple." Common function words ("a", "and", "the" ...) c an infl~lf?nc~e the calculated similarity of texts without regard to their topics, and so are typically removed from the text CA 0223601~ 1998-04-27 W O 97/16796 PCT~US96/179~il before the scores of terms in the text are computed. A more general approach to recognizing synonyms is to use a revised ll-ea~we ofthe ~1iet~n~e between textual attribute vectors V and U, namely arccos(AV(AU)' /sqrt (AV(AV)t AU(AU)t), where the matrix A
is the dimensionality-red~lcing linear ll~tllsrollllaLion (or an apploxi.~ ion thereto) - S determined by collecting the vector values of the textual attribute, for all target objects known to the system, and applying singular value decomposition to the reslllting collec,tion.
The same approach can be applied to the vector values of associative attributes. The above definitions allow us to determine how close together two target objects are with respect to a single attribute, wl~c~Lhel nllm~ric; ~eSoci~tive~ or textual. The fliet~nre between two target objects X and Y with respect to their entire multi-attribute profiles Px and Py is then denoted d(X,Y) or d(PX, Py) and defined as:
(((~ist~n~ e with respect to attribute a)(weight of attribute a))k + ((tliet~nsewith respect to attribute b)(weight of attribute b))k + ((~lict~nce with respectto attribute c)(weight of attribute C))k + ,,, )k where k is a fixed positive real number, typically 2, and the weights are non- negative real numbers indicating the relative importance of the various attributes. For ~x~mrle, if the target objects are consumer goods, and the weight of the "color" attribute is colllpal~lively very small, then price is not a consideration in determining similarity: a user who likes a brown m~es~ge cushion is predicted to show equal interest in the same cushion m~mlf~ctllred in blue, and vice-versa. On the other hand, if the weight of the "color"
attribute is col.lp~-lively very high, then users are predicted to show interest primarily in products whose colors they have liked in the past: a brown massage cushion and a blue massage cushion are not at all the same kind of target object, however similar in other attributes, and a good experience with one does not by itself inspire much interest in the other. Target objects may be of various sorts, and it is som~tim~s advantageous to use a single system that is able to compare tar get objects of distinct sorts. For example, in a system where some target objects are novels while other target objects are movies, it is desirable to judge a novel and a movie similar if their profiles show that similar users like them (an associative attribute). However, it is important to note that certain attributes specified in the movie's target profile are nnd~fined in the novel's target profile, and vice versa: a novel has no "cast list" associative attribute and a movie has no "reading level"
numeric attribute. In general, a system in which target objects fall into distinct sorts may s.,",~l;",es have to measure the similarity oftwo target objects for which somewhat di~elenl CA 0223601~ 1998-04-27 W O 97/16796 PCT~US96/17981 sets of attributes are dP-fin~d This requires an ext~n.cit n to the rliet~nr~e metric d(*,*) defined above. In certain applications, it is sufflcient when cal-yhlg out such a co-llp~ison simply to disregard ~LLLlibuLes that are not defined for both target objects: this allows a cluster of novels to be m~t~hed with the most similar cluster of movies, for example, by S cons;~l~.rin~ only those attributes that novels and movies have in common However, while this method allows comparisons between (say) novels and movies, it does not define a proper metric over the collll~illed space of novel s and movies and therefore does not allow clustering to be applied to the set of all target objects. When ntocess~ry for clustering or other purposes, a metric that allows colllp~ison of any two target objects (whether of the same or diLrerell~ sorts) can b e defined as follows. If a is an attribute, then let Max(a) be an upper bound on the distance between two values of attribute a; notice that if attribute a is an associative or textual attribute, this distance is an angle determined by arccos, so that Max( a) may be chosen to be 180 degrees, while if attribute a is a numeric attribute, a sufficiently large number must be selected by the system deeign~ors. The ~ t~n~e between two values of attribute a is given as before in the case where both values are d~fin~1; the distance between two lln-lefined values is taken to be zero; finally, the ~1iSt~nce between a defined value and an ~m-1~fined value is always taken to be Max(a)/2. This allows us to determine how close together two target objects are with respect to an attribute a, even if attribute a does not have a defined value for both target objects. The ~ t~nce d(*,*) between two target objects with respect to their entire multi-attribute profiles is then given in terms of these individual attribute ~i~t~ncçs exactly as before. It is assumed that one ~ttrih~-te in such a system specifies the sort of target object ("movie", "novel", etc.), and that this attribute may be highly weighted if target objects of diL[~ sorts are considered to be very di~elenl despite any attributes they may have in common.
Ul~ l7lNG T)IE SIMILARITY MEASUREME~T
IVls~trhin~ Buyers and Sellers A simple application of the similarity measurement is a system to match buyers with sellers in small-volume markets, such as used cars and other used goods, artwork, or employment. Sellers submit profiles of the goods (target objects) they want to sell, and buyers submit profiles of the goods (target objects) they want to buy. Partirip~nt~ may submit or withdraw these profiles at any time. The system for customized electronic identification of desirable objects computes the similarities between seller-submitted profiles and buyer-submitted profiles, and when two profiles match closely (i.e., the CA 0223601~ 1998-04-27 W O 97/16796 PCTAJS96/179~l similarity is above a threshold), the corresponding seller and buyer are notified of ~ ach other's identities. To prevent users from being flooded with responses, it may be desirable to limit the number of notifications each user receives to a fixed number, such as ten per day.
S Filtering: Relevance Fee(lh~
A filtering system is a device that can search through many target objects and estim~te a given user's interest in each target object, so as to identify those that are of greatest interest to the user. The filtering system uses relevance feed back to refine its knowledge of the user's interests: whenever the filtering system icl~ntifies a target object as 10 potentially i.ll~;. e~Lillg to a user, the user (if an on-line user) provides feedback as to whether or not that target object really is of interest. Such feedbacl~ is stored long-terrn in sull--l~.~ed form, as part of a clatab~ce of user fee~back h~c,llll~Lion, and may be provided either actively or passively. In active feedback, the user explicitly indicates his or her interest, for instance, on a scale of-2 (active distaste) through 0 (no special interest) to 10 15 (great interest). In passive feedb~ the system infers the user's interest from the user's behavior. For tox~mple7 if target objects are textual docl-m~nt~, the system might monitor which docllmçntc the user chooses to read, or not to read, and how much time the use}
spends reading them. A typical formula for ~.cçcci"g interest in a document via passive feedback, in this dom~in~ on a scale of 0 to 10, might be:
20 + 2 if the second page is viewed, + 2 if all pages are viewed, + 2 if more than 30 seconds was spent viewing the docllmf~nt + 2 if more than one minute was spent viewing the document, + 2 if the minlltes spent viewing the document are greater than half the number of pages.
If the target objects are electronic mail .. ~ , interest points might also be added in the case of a particularly lengthy or particularly prompt reply. If the target objects are purchasable goods, interest points might be added for target objects that the user actually purchases, with further points in the case of a large-quantity or high-price purchase. In any domain, further points might be added for target objects that the user accecces early in a 30 session, on the grounds that users access the object s that most interest them first. Clther potential sources of passive feedb~ include an electronic measurement of the extent to which the user's pupils dilate while the user views the target object or a description of the target object. It is possible to col.lbi,.e active and passive feedback. One option is to take CA 0223601~ 1998-04-27 a w~ighled average of the two ratings. Another option is to use passive feedbact~ by default, but to allow the user to .oY~minc and actively modify the passive feedbarl~ score. In the scenario abo ve, for instance, an uninteresting article may snmetimeS remain on the display device for a long period while the user is engaged in unrelated business, the passive ~eedbaçk score is then inapl)lopliately high, and the user may wish to correct it before contim~ing In the plerellGd embodiment of the invention, a visual in~lic~tor, such as a sliding bar or in~lic~tor needle on the user's screen, can be used to continllollcly display the passive feedb~r1~ score çstim~ted by the system for the target object being viewed, unless the user has m~ml~lly adjusted the in~ tor by a mouse operation or other means in order to refiect a di~elellL score for this target object, after which the indicator displays the active feedback score selected by the user, and this active feedbac~ score is used by the system instead of the passive feeclback score. In a variation, the user cannot see or adjust the indicator until just after the user has fini~hed viewing the target object. Regardless how a user's feedb~çk is comrutetl~ it is stored long-term as part of that user's target profile interest Sllmm~ry, Filtering: Determining Topical Interest Through Similarity Relevance feedback only determines the user's interest in certain target objects:
namely, the target objects that the user has actually had the opportunity to evaluate (whether actively or passively). For target objects that the user has not yet seen, the filtering system must çstim~te the user's interest. This estimation task is the heart of the filtering problem, and the reason that the similarity measurement is hll~ol L~IL. More concretely, the pl~r~ll ed embodiment of the filtering system is a news d;~pillg service that periodically pl esel"s the user with news articles of potential interest. The user provides active and/or passive feedback to the system relating to these presented articles. However, the system does not have feedb~c.l~ information from the user for articles that have never been presented to the user, such as new articles that have just been added to the database, or old articles that the system chose not to present to the user. Similarly, in the dating service domain where target objects are prospective romantic palLlle~, the system has only received feedb~çk on old flames, not on prospective new loves.
As shown in flow diagram forrn in Figure 12, the evaluation of the likelihood ofinterest in a particular target object for a specific user can autom~tir~lly be computed. The interest that a given target object X holds for a user U is assumed to be a sum of two q~l~ntiti~S q~U, X), the intrinsic "quality" of X plus f(U, X), the "topical interest" that users CA 0223601~ 1998-04-27 W O 97/16796 PCTAUS96/179~1 like U have in target objects like X. For any target object X, the intrinsic quality measure q(U, X) is easily ~ctim~ted at steps 1201-1203 directly from numeric attributes ofthe target object X. The computation process begins at step 1201, where certain deci~n~ted nurneric attributes of target object X are specifically selected, which attributes by their very nature - 5 should be positively or negatively correlated with users' interest. Such attributes, termed "quality attributes," have the normative plcJpel~y that the higher (or in some cases lower) their value, the more illL~l esLiilg a user is expected to find them. Quality attributes of target object X may incltl~e~ but are not limited to, target object X's popularity among users in general, the rating a particular reviewer has given target object X the age (time since authorship -- also known as out~1~tetlnecs) of target object X, the number of vulgar words used in target object X, the price of target object X, and the amount of money that the company selling target object X has donated to the user's favorite charity. At step 1202, each of the selected attrikl~tes is multiplied by a positive or negative weight indicative of the strength of user U's plerelellce for those target objects that have high values for this attribute, which weight must be retrieved from a data file storing quality attribute weights for the selected user. At step 1203, a weighted sum of the identified weighted selected attributes is compllted to det.ormine the intrinsic quality measure q(U, X). At step 1204, the summarized weighted relevance fee-lk~cL- data is retrieved, wherein some relevance fee-lbar.k points are weighted more heavily than others and the stored relevance data can be sllmm~rized to some degree, for example by the use of search profile sets. The rnore difficult part of d~Lel ll.il"llg user ~s interest in target object X is to find or compute at step 1205 the value off(U, X), which denotes the topical interest that users like U generally ]have in target objects like X. The method of determining a user's interest relies on the follo~ving hPllrictic:: when X and Y are similar target objects (have similar attributes), and U and ~,' are similar users (have similar attributes), then topical interest f(U, ~) is predicted to have a similar value to the value of topical interest f(V, Y). This heuristic leads to an effective method because e~ ed values of the topical interest function f(*, *) are actually know n for certain argllmpntc to that function: specifically, if user V has provided a relevance-feedback rating of r(V, Y) for target object Y, then insofar as that rating lc~plest;;ll~s user Vls true interest in target object Y, we have r(V, Y) = q(V, Y) + f~V, Y) and can çstim~te f(V, Y) as r(V, Y) - q(V, Y). Thus, the problem of çstim~ting topical interest at all points becomes a problem of interpolating among these estim~tes of topical interest at selected points, such as the feedback estim~te of f(V, Y) a s r(V, Y) - q(V, Y). This CA 0223601~ 1998-04-27 interpolation can be accomplished with any standard smoothing technique, using as input the known point estim~tes of the value of the topical interest function f(*, *), and determining as output a function that ~pl~x;~ leS the entire topical interest function f(*, )-Not all point estim~t~s of the topical interest function f(*, *) should be given equal weight as inputs to the smoothing algoliLhln. Since passive relevance fee~lhack is less reliable than active relevance feedbac~, point estim~tes made from passive relevance feedback should be weighted less heavily than point çstim~tÇs made from active relevance feedba~lr, or even not used at all. In most domains, a user's interests may change over time and, therefore, estim~teS oftopic al interest that derive from more recent feedback should also be weighted more heavily. A user's interests may vary according to mood, so estim~tes of topical interest that derive from the current session should be weighted more heavily for the duration of the current session, and past estim~tes of topical interest made at approxim~tçly the current time of day or on the current weekday should be weighted more heavily. Finally, in domains where users are trying to locate target objects of long-term interest (inv~s~ r~ romantic pa~le~:j, pen pals, employers, employees, suppliers, service providers) from the possibly meager il~ol ..l~lion provided by the target profiles, the users are usually not in a position to provide reliable imme~ te feedback on a target object, but can provide reliable feedback at a later date. An estim~te of topical interest f~V, Y) should 20 be weighted more heavily if user V has had more experience with target object Y. Indeed, a useful strategy is for the system to track long-term feedback for such target objects. For example, if target profile Y was created in 1990 to describe a particular investment that was available in 1990, and that was purchased in 1990 by user V, then the system solicits relevance feedba(ck from user V in the years 1990, 1991, 1992, 1993, 1994, 1995, etc., and 25 treats these as s~lccec~ively stronger inl1ications of user V's true interest in target profile Y, and thus as indications of user V's likely interest in new invectm~nte whose current profiles resemble the original 1990 illvc~ profile Y. In particular, if in 1994 and 1995 user V
is well-disposed toward his or her 1990 purchase of the investnnent described by target profile Y, then in those years ard later, the system tends to recornmend additional 30 investments when they have profiles like target profile Y, on the grounds that they too will turn out to be sati~f~ctQry in 4 to 5 years. It makes these recomm~n~tions both to user V
and to users whose inve~L..le..L portfolios and other attributes are similar to user V's. The relevance feedback provided by user V in this case may be either active (feedback =

CA 0223601~ 1998-04-27 W O 97/16796 PCT~US96/179~1s~ticf~ction ratings provided by the h~ve~Lor V) or passive (feedb~ck = difference bet~Neen average annual return of the in~ and average annual return o f the Dow Jones index portfolio since purchase of the investmPnt, for example).
To effectively apply the smoothing techniq~-ç, it is necess~ry to have a definition of 5 the similarity ~ t~n~.e between (U, X) and (V, Y), for any users U and V and any target objects X and Y. We have already seen how to define the fli~t~n~e d(X, Y) between two target objects X and Y, given their attributes. We may regard a pair such as (U, X) .lS an Pxtf~nded object that bears all the attributes of target X and all the attributes of user U; then the distance between (U, X) and (V, Y) may be computed in exactly the same way. This 10 approach requires user U, user V, and all other users to have some attributes of their o-,vn stored in the system: for example, age (numeric), social security number (textual), and list of docllments previously retrieved (associative). It is these attributes that determine the notion of "similar users." Thus it is desirable to generate profiles of users (termed ''luser profiles") as well as profiles of target objects (termed "target profiles"). Some attributes 15 employed for profiling users may be related to the attributes employed for profiling target objects: for example, using associative attributes, it is possible to characterize target objects such as X by the interest that various users have shown in them, and ~imlllt~neouslly to characterize users such as U by the interest that they have shown in various target objects.
In addition, user profiles may make use of any attributes that are useful in characteriizing 20 hllm~n~, such as those s~ggçsted in the example domain above where target object!i are potential con~llm~rs Notice that user U's interest can be estim~ted even if user U is a new user or an off-line user who has never provided any feedb~k because the relevance feedback of users whose attributes are similar to U's attributes is taken into account.
For some uses offltering systems, when çsl ;I ~ g topical interest, it is applop-iate 25 to make an additional "presumption of no topical interest" (or "bias toward zero"). To understand the ~sp~fillnçss of such a presumption, suppose the system needs to deterJ~nine whether target object X is topically interesting to the user U, but that users like user U have never provided feedb~lr on target objects even remotely like target object X. The presumption of no topical interest says that if this is so, it is because users like user U are 30 simply not interested in such target objects and therefore do not seek them out and interact with them. On this presumption, the system should estim~te topical interest f(U, X) to be low. Formally, this P.x~mple has the characteristic that (U, X) is far away from all the points (V, Y) where feedback is available. In such a case, topical interest f(U, X) is presumed to CA 0223601~ 1998-04-27 be close to zero, even if the value of the topical interest function f(*, *) is high at all the faraway surrounding points at which its value is known. When a smoothing technique is used, such a pl~su~ ,lion of no topical interest can be introduced, if applupliate, by manipulating t he input to the smoothing technique. In addition to using observed values 5 ofthe topical interest fimr.tion f~*, *) as input, the trick is to also introduce fake observations of the form topical interest f(V, Y) = O for a lattice of points (V, Y) distributed throughout the mlllti-limPn~ional space. These fake observations should be given relatively low weight as inputs to the smoothing ~ rithm The more strongly they are w~ighte-l the stronger the presumption of no interest.
The following provides another simple example of an estim~tion technique that has a prçsl-mrtiQn of no interest. Let g be a dec ~il-g function from non-negative real numbers to non-ne~,~Livt; real numbers, such as g(x) =ex or g(x) = min(l, x-k ) where k > 1. F~tim~te topical interest f(U, X) with the following g-weighted average:

U ~ VO -q( Y,Y)) *g(distance g ( U,X)A (V, Y)) ~g(distances(U, V)/~ V,Y)) Here the s~mm~tiQns are over all pairs (V, Y) such that user V has provided fee~lb~ck r(V, Y) on target object Y, i.e., all pairs (V, Y) such that relevance feedback r(V, Y) is c~fin~l Note that both with this technique and with conv~ontit-n~l smoothing techniques, the estim~te of the topical interest f(U, X) is not necessarily equal to r(U, X) -q(U, X ), even when r([J, X) is (lefinecl Filtering: Adjusting Weights and Residue Fee-lh ~k The method described above requires the filtering system to measure distances between (user, target object) pairs, such as the rli~t~n~e between (U, X) and (V, Y). Given the means described earlier for measuring the rli~t~nce between two multi-attribute profiles, the method must ILeleru,e ac~or;~te a weight with each attribute used in the profile of (user, target object) pairs, that is, with each attribute used to profile either users or target objects.
2~ These weights specify the relative importance of the attributes in establishing similarity or difference, and therefore, in determining how topical interest is generalized from one (user, target object) pair to another. Additional weights determine which attributes of a target object contribute to the quality function q, and by how much. It is possible and often desirable for a filtering system to store a di~e, ~ set of weights for each user. For example, a user who thinks of two-star films as having materially diLrelenl topic and style from CA 0223601~ 1998-04-27 W O 97/16796 PCTAUS96/179,~1 four-star films wants to assign a high weight to "number of stars" for purposes of the similarity ~ t~n~e measure d(*, *); this means that interest in a two-star f;llm does not necessarily signal interest in an otherwise similar four-star film, or vice-versa. If the user also agrees with the critics, and actually prefers four-star films, the user also wants to assign - S "number of stars" a high positive weight in the dt;Lc,.. l~ alion of the quality function q. In the same way, a user who dislikes vulgarity wants to assign the "vulgarity score" attriibute a high negative weight in the dt;l~llli~llalion of the quality fim- tion q, although the "vulgarity score" attribute does not necessarily have a high weight in determining the topical similarity of two films.
10 ~ttrihl~te weights (ofboth sorts) may be set or adjusted by the system arimini~tl-ator or the individual user, on either a temporary or a permanent basis. However, it is often desirable for the filtering system to learn attribute weights autom~tic~lly~ based on relev,ance feedback The optimal attribute weights for a user U are those that allow the most accwrate prediction of user ~s interests. That is, with the distance measure and quality func:tion 15 defined by these attrihlltc weights, user U's interest in target object X q(U, X) + f(U, X), can be accurately e~ ted by the techniques above. The effectiveness of a particular set of attrihute weights for user U can therefore be gauged by seeing how well it predicts user U's known interests.
Formally, suppose that user U has previously provided feerlkacl~ on target objects 20 Xl, ~, X3, ... Xn, and that the feedback ratings are r(U, X,), r(U, X2), r(U, X3), r(U, Xn).
Values of fee~hack ratings r(*,*) for other users and other target objects may also be known.
The system may use the following procedure to gauge the effectiveness of the set of attribute weights it currently stores for user U: (I) For each 1 <= I <= n, use the estim~tion techniques to estim~te q(U, Xl) + f(U, Xi) from all known values of feedbac~ ratings r. Call this 25 estimate ai. (ii) Repeat step (i), but this time make the estim~te for each 1 <= i ~:= n without using the feeclhack ratings r(U, ~) as input, for any j such that the distance d(X;, XJ) is smaller than a fixed threshold. That is, estim~te each q(U, X;) + f(U, X;) from other values of feedbacL- rating r only; in particular, do not use r(U, X;) itself. Call this estin-~te bi. The di~. ~nce ai - b; is herein termed the "residue feedhaf k rl~,(U, Xi) of user U on target 30 object Xi." (iii) Compute user U's error measure, (al - bl)2+ (a2 - b2)2 + (a3 - b3)2+ ... + (an -b") .
A gradient-descent or other numerical o~ n method may be used to adjust user U's attribute weights so that this error measure reaches a (local) ",i,)"",."~ This CA 0223601~ 1998-04-27 W O97/16796 PCT~US96/17981approach tends to work best if the smoothing technique used in estim~tion is such that the value of f(V, Y) is strongly affected by the point estim~te r(V, Y) - q(V, Y) when the latter value is provided as input. Otherwise, the presence or absence of the single input feedback rating r(U, Xi), in steps (i)-(ii) may not make ai and bi very different from each other. A
5 slight variation of this learning teçhniqlle adjusts a single global set of at tribute weights for all users, by ~djlleting the weights so as to . . .; I-;. . .i,e not a particular user's error measure but rather the total error measure of all users. These global weights are used as a default initial setting for a new user who has not yet provided any feedb~ck Gradient descent can then be employed to adjust this user's individual weights over time. Even when the attribute 10 weights are chosen to . . .; . .; . . .; ,e the error measure for user U, the error measure is generally still positive, m~aninp~ that residue feedb~-~.k from user U has not been reduced to 0 on all target objects. It is useful to note that high residue f~b~ from a user U on a target object X indicates that user U liked target object X unexpectedly well given its profile, that is, better than the smoothing model could predict from user U's opinions on target objects with 15 similar profiles. Similarly, low residue fee~lb~lr indicates that user U liked target object X
less than was expected. By definition, this unexplained ~-~;r~;lence or dispreference cannot be the result of topical similarity, and therefore must be regarded as an inrlir.~tiQn of the intrinsic quality of target object X. It follows that a useful quality attribute for a target object X is the average amount of residue feedba~l~ r,es(V, X) from users on that target 20 object, averaged over all users V who have provided relevance feedback on the target object.
In a variation of this idea, residue feedb~ck is never averaged indiscriminately over all users to form a new attribute, but instead is smoothed to consider users' similarity to each other.
Recall that the quality measure q(U, X) depends on the user U as well as the target object X, so that a g*en target object X may be perceived by different users to have di~e.enl 25 quality. In this variation, as before, q(U, X) is calculated as a weighted sum of various quality attributes that are dependent only on X, but then an additional term is added, namely an estim~te of r~ (U, X) found by applying a smoothing al~,oliLl~ to known values of rl~5 (V, X~. Here V ranges over all users who have provided relevance feedback on target object X and the smoothing algorithm is sensitive to the t1ict~nce~e d(U, V) from each such user V
30 to user U.
Using the Similarity Computation for Cll.sl~. ~
A method for dtofining the distance between any pair of target objects was disclosed above. Given this distance measure, it is simple to apply a standard clustering algorithm, CA 0223601~ 1998-04-27 W O 97/16796 PCT~US96/179~1 such as k- means, to group the target objects into a number of clusters, in such a way that similar target objects tend to be grouped in the same cluster. It is clear that the res ll~ting clusters can be used to improve the efficiency of msltching buyers and sellers in the application described in section "Mslt~hing Buyers and Sellers" above: it is not necçssslry to ~Illp~ every buy profile to every sell profile, but only to compale buy profiles and sell profiles that are similar enough to appear in the same cluster. As exrlslined below, the results of the cl~tering procedure can also be used to make filtering more efficient, and in the service of querying and browsing tasks.
The k-means clustering method is familiar to those skilled in the art. Briefly put, it finds a grouping of points (target profiles, in this case, whose numeric coordinates are given by numeric decomposition of their attributes as described above) to ~ e the t1i~tsln~e between points in the clusters and the centers of the clusters in which they are located. This is done by alternating between sl~igning each point to the cluster which has the nearest center and then, once the points have been 51~'; i neri, colllpu~ g the (new) center of e ach cluster by averaging the coordinates of the points (target profiles) located in this cluster.
Other clustering methods can be used, such as "soft" or "fuzzy" k-means clustering, in which objects are allowed to belong to more than one cluster. This can be cast ~ a clustering problem similar to the k-means problem, but now the criterion being opLillli~ed is a little different:

~i ~C iiC d~Xi, Xc) where C ranges over cluster numbers, i ranges over target objects, xi is the numeric vector corresponding to the profile of target object number i, xc is the mean of all the numeric vectors corresponding to target profiles of target objects in cluster number C, termed the "cluster profile" of cluster C, d(*, *) is the metric used to measure ~ t~nce between two target profiles, and ic is a value between 0 and 1 that inllic.~tçs how much target object number i is associated with cluster number C, where i is an indicator matrix with the property that for each i, Sl~M SU~ C I SUB iC = 1. For k-means clustering, iiC is either 0 or 1.
Any of these basic types of clustering might be used by the system:
1) Association-based clustering, in which profiles contain only associative attributes, and thus distance is defined entirely by associations. This kind of CA 0223601~ 1998-04-27 W O 97/16796 PCT~US96/17981 clustering generally (a) clusters target objects based on the ~imil~nty of the users who like them or (b) clusters users based on the similarity of the target objects they like. In this app.~,a.,h, the system does not need any i~ maLion about target objects or users, except for their history of interaction with each other.
2) Content-based ri~l~tçring in which profiles contain only non-associative attributes. This kind of clustering (a) clusters target objects based on the similarity of their non-associative attributes (such as word frequencies) or (b) clusters users base d on the ~llil~ il~ of their non-associative attributes (such as demographics and psychographics). In this approach, the system does not need to record any information about users' historical patterns of information access, but it does need information about the intrinsic properties of users and/or target objects.
3) Uniform hybrid method, in which profiles may contain both associative and non-associative attributes. This method combines la and 2a, or lb and 2b. The e d(PX, Py) between two profiles Px and Py may be computed by the general similarity-measurement methods described earlier.
4) Sequential hybrid method. First apply the k-means procedure to do la7 so that articles are labeled by cluster based on which user read them, then use supervised dustering (,.. ~x;.. -- likelihood dis~,lh~ L methods) using the word frequencies to do the process of method 2a described above. This tries to use knowledge of who read what to do a better job of clustering based on word frequ~ncies One could similarly combine the methods lb and 2b described above.

Hierarchical clustering of target objects is often useful. Hierarchical clustering produces a tree which divides the target objects first into two large clusters of roughly similar objects; each of these clusters is in turn divided into two or more smaller clusters, which in turn are ea ch divided into yet smaller clusters until the collection of target objects has been entirely divided into "clusters" con~i~ting of a single object each, as diagrammed in Figure 8 In this diagram, the node d denotes a particular target object d, or equivalently, a single-member cluster consi~7Lillg of this target object. Target object d is a member of the cluster (a, b, d), which is a subset ofthe cluster (a7 b, c, d, e, f), which in turn is a subset of all target objects. The tree shown in Figure 8 would be produced from a set of target objects such as those shown geometrically in Figure 7. In Figure 7, each letter lep~esellls a target object, and axes xl and x2 ~ s~lll two of the many numeric attributes on which the target CA 0223601~ 1998-04-27 W O 97/16796 PCTAJS96/179~,1 objects differ. Such a cluster tree may be created by hand, using human judgment to f'orm clusters and subclusters of similar objects, or may be created autom~tis~lly in either of two standard ways: top-down or bottom-up. In top-down hierarchical clustering, the set of all target objects in Figure 7 would be divided into the clusters (a, b, c, d, e, f) and (g, h, i, j, k).
5 The clllet~ring algorithm would then be reapplied to the target objects in each cluster, so that the cluster (g, h, i, j, k) is subpartitioned into the clusters (g, k) and (h, i, j), and so on to arrive at the tree shown in Figure 8. In bottom-up hierarchical clustering, the set of all target objects in Figure 7 would be grouped into numerous small clusters, namely (a, b), d, (c, f), e, (g,k), (h, i), and j. These clusters would then themselves be grouped into the larger 10 clusters (a, b, d), (c, e, f), (g, k), and (h, i j), acco,di"g to their cluster profiles. These larger clusters would themselves be grouped into (a, b, c, d, e, f) and (g, k, h, i, j), and so on until all target objects had been grouped togeth~.r, resulting in the tree of Figure 8 . Note that for bottom-up clustering to work, it must be possible to apply the clustering algo,il~ to a set of existing clusters. This requires a notion of the rliet~nce between two clusters. The 15 method disclosed above for measuring the ~liet~nce between target objects can be applied directly, provided that clusters are profiled in the same way as target objects. It is only n~cçee~ry to adopt the convention that a cluster's profile is the average of the target profiles of all the target objects in the cluster; that is, to determine the cluster's value for a p~ven attribute, take the mean value ofthat attribute across all the target objects in the cluster. For 20 the mean value to be well-defined, all attributes must be numeric, so it is necess~ry as usual to replace each textual or associative attribute with its decomposition into numeric attributes (scores), as described earlier. For example, the target profile of a single Woody Allen film would assign "Woody-Allen" a score of 1 in the "name-of-director" field, while giving "Federico-Fellinii" and "Terence-Davies" scores of 0. A cluster that consisted of 20 films 25 directed by Allen and 5 directed by Fellini would be profiled with scores of 0.8, 0.2, a~nd 0 respectively, bec~lleç, for example, 0.8 is the average of 20 ones and 5 zeros.
Searching for Target Objects Given a target object with target profile P, or alternatively given a search profile P, a hierarchical cluster tree of target objects makes it possible for the system to search 30 efficiently for target objects with target profiles similar to P. It is only necess~rily to navigate through the tree, autom~tic~lly, in search of such target profiles. The system for customized electronic identification of desirable objects begins by considering the largest, top-level clusters, and selects the cluster whose profile is most similar to target profile P.

CA 0223601~ 1998-04-27 In the event of a near-tie, multiple clusters may be selectec~ Next, the system considers all s~ S~ ofthe selected clusters, and this time selects the subcluster or subclusters whose profiles are closest to target profile P. This r~fin~m~nt process is iterated until the clusters selected on a given step are sufficiently small, and these are the desired clusters of target S objects with profiles most similar to target profile P. Any hierarchical cluster tree therefore serves as a decision tree for identifying target objects. In pseudo-code form, this process is as follows (and in flow diagram form in Figures 13A and 13B):
1. Tniti~1i7e list of identified target objects to the empty list at step 13A00 2. Tniti~li7e the current tree T to be the hierarchical cluster tree of all objects at step 13A01 and at step 13A02 scan the current cluster tree for target objects similar to P, using the process detailed in Figure 13B. At step 13A03, the list of target objects is returned.
3. At step 13B00, the variable I is set to 1 and for each child subtree Ti ofthe root of tree T, is retrieved.
4. At step 13B02, calculate d(P, Pi), the similarity ~ t~nce between P and Pi, 5. At step 13B03, if d(P, Pi) < t, a threshold, branch to one of two options
6. If tree Ti contains only one target object at step 13B04, add that target object to list of identified target objects at step 13B05 and advance to step 13B07.
7. If tree Ti cont~in~ multiple target objects at step 13B04, scan the ith child subtree for target objects similar to P by invoking the steps of the process of Figure 13B
recursively and then recurse to step 3 (step 13A01 in Figure 13A) with T bound for the duration ofthe recursion to tree Ti in order to search in tree Ti for target objects with profiles similar to P.
In step 5 of this pseudo-code, smaller thresholds are typically used at lower levels of the tree, for example by making the threshold an affine function or other function of the cluster variance or cluster ~i~met~r of the cluster Pi. If the cluster tree is distributed across a plurality of servers, as described in the section of this description titled "Network Context of the Browsing System", this process may be .oxecllte~l in distributed fashion as follows:
steps 3-7 are executed by the server that stores the root node of hierarchical cluster tree T, and the recursion in step 7 to a subcluster tree Ti involves the tr~n~mi~ion of a search request to the server that stores the root node of tree Ti, which server carries out the recursive step upon receipt of this request. Steps 1 -2 are carried out by the processor that CA 0223601~ 1998-04-27 W O 97/16796 PCTAUS96/179~1 initiates the search, and the server that executes step 6 must send a message identifying the target object to this initi~ting processor, which adds it to the list.
A~.. ,.g that low-level clusters have been already been formed through clustering, there are alternative search methods for identifying the low-level cluster whose profile is S most similar to a given target profile P. A standard back-propagation neural net is one such method: it should be trained to take the attributes of a target object as input, and produce as output a unique pattern that can be used to identify the app~u~ liate low-level cluster. For m~ciml1m accuracy, low-level clusters that are similar to each other (close together in the cluster tree) should be given similar idellliryillg patterns. Another approach is a ~L~l,iald 10 de-~ision bree that con~ prs the ~ttrihl~tes of target profile P one at a time until it can ide,ntify the appropriate cluster. If profiles are large, this may be more rapid than considering all attributes. A hybrid approach to seal-,lfing uses distance measurements as described above to navigate through the top few levels of the hierarchical cluster tree, until it reaches an cluster of intermedi~te size whose profile is similar to target profile P, and then continues 15 by using a decision tree speci~li7Pd to search for low-level subclusters ofthat intçrmç~ te cluster.
One use of these sealcl~illg techniques is to search for target objects that match a search profile from a user's search profile set. This form of sealching is used repeatedly in the news clipping service, active navigation, and Virtual Community Service applications, 20 described below. Another use is to add a new target object quickly to the cluster tree. An ~rieting cluster that is similar to the new target object can be located rapidly, and the new target object can be added to this cluster. If the object is beyond a certain threshold dist;mce from the cluster center, then it is advisable to start a new cluster. Several variants of this incremental clustering scheme can be used, and can be built using variants of subroutines 25 available in advanced statistical packages. Note that various methods can be used to locate t he new target objects that must be added to the cluster tree, depending on the ar~hitectllre used. In one method, a "webcrawler" program running on a central computer periodically scans all servers in search of new target objects, calculates the target profiles of these objects, and adds them to the hierarchical cluster tree by the above method. In anolther, 30 whenever a new target object is added to any ofthe servers, a software "agent" at that server calculates the target profile and adds it to the hierarchical cluster tree by the above method.

CA 0223601~ 1998-04-27 ~0 97/16796 PCTAJS96/17981 Rapid Profiling In some dom~ine~ c~ mp'~e profiles oftarget objects are not always easy to construct automatically. When target objects are wallpaper paL~ s, for example, an attribute such as "genre" (a single textual term such as "Art-Deco," "Children's," "Rustic," etc.) may be a 5 matter of j~ grn~ont and opinion"lifficlllt to det~rmin~ except by cone llting a human. More significantly, if each wallpaper pattern has an associative attribute that records the positive or negative relevance feedb~ck to that pattern from various human users (cone lm~.rs), then all the association scores of any newly introduced pattern are initially zero, so that it is initially unclear what other patterns are similar to the new pattem with respect to the users 10 who like them. Indeed, if this associative attribute is highly weightetl, the initial lack of relevance feedback h~ollllaLion may be difficult to remedy, due to a vicious circle in which users of moderate-to-high interest are needed to provide relevance feedb~cl~ but relevance feedback is needed to identify users of moderate-to-high interest. Fortunately, however, it is often possible in principle to determine certain attributes of a new target object by 15 extraordinary methods, in~.lu-lin~ but not limited to methods that consult a human. For example, the system can in principle determine the genre of a wallpaper pattern by conelllting one or more randomly chosen individuals from a set of known human experts, while to determine the numeric association score between a new wallpaper pattern and a particular user, it can in principle show the pattern to the that user and obtain relevance 20 feedback. Since such requests inconvenience people, however, it is important not to determine all difflcult attributes this way, but only the ones that are most important for purposes of classifying the docllment "Rapid profiling" is a method for selecting those numeric attributes that are most important to dt;Le-lllille. (Recall that all attributes can be decomposed into numeric attributes, such as association scores or term scores.) First, a set 25 of existing target objects that already have complete or largely complete profiles are clustered using a k-means algo-iLIIlll. Next, each of the resllhing clusters is ~eei ned a unique identifying number, and each clustered target object is labeled with the identifying number of its cluster. Standard methods then allow construction of a single deciei~n tree that can determine any target object's cluster number, with substantial accuracy, by 30 considering the attributes of the target object, one at a time. Only attributes that can if necessary be determined for any new target object are used in the construction of this decision tree. To profile a new target object, the decision tree is traversed downward from its root as far as is desired. The root of the decision tree considers some attribute of the -CA 0223601~ 1998-04-27 W O 97/16796 PCTAJS96/17981target object. If the value of this attribute is not yet known, it is determined by a mel:hod app-~,p-iale to that ~ le, for ~mpl.o, if the attribute is the association score of the target object with user #4589, then relevance feeclback (to be used as the value of this attribute) is solirited from user #4589, perhaps by the ruse of adding the possibly uninteresting target 5 object to a set of objects that the system leco-""~ e to the user's attention, in order to find out what the user thinks of it. Once the root attribute is det~rrnine-l, the rapid profiling method d~ec~n~e the de~eion tree by one level, choosing one of the decision ~ul)L ees of the root in accordance with the determined value of the root attribute. The root of this chosen subtree considers another attribute of the target object, whose value is likewise determined 10 by an ~plopli~le method. The process c an be repeated to determine as many attributes as desired, by whatever methods are available, although it is ordinarily stopped after a small number of attributes, to avoid the burden of determining too many attributes.
It should be noted that the rapid profiling method can be used to identify impoltant attributes in any sort of profile, and not just profiles of target objects. In particular, recall 15 that the disclosed method for d~lel-. i.~ g topical interest through similarity requires users as well as target objects to have profiles. New users, like new target objects, may be profiled or partially profiled through the rapid profiling process. For example, when user profiles include an associative attribute that records the user's relevance fee~lb~c k on a ll target objects in the system, the rapid profiling procedure can rapidly form a rough 20 characterization of a new user's interests by soliciting the user's feedback on a small number of eignific~nt target objects, and perhaps also by determining a small n umber of other key attributes of the new user, by on-line queries, telephone surveys, or other means. Once the new user has been partially profiled in this way, the methods disclosed above predict that the new user's interests resemble the known interests of other users with similar profiles.
25 In a variation, each user's user profile is subdivided into a set of long-term attributes, such as demographic characteristics, and a set of short-term attributes that help to identi~y the user's temporary desires and emotional state, such as the user's textual or multiple-choice answers to questions whose answers reflect the user's mood. A subset of the user's long-term attributes are determined when the user first registers with the system, through 30 the use of a rapid profiling tree of long-term attributes. In addition, each time the user logs on to the system, a subset of the user's short-term attributes are additionally determined, through the use of a separate rapid profiling tree that asks about short-term attributes Market Research CA 0223601~ 1998-04-27 W O 97/16796 PCTAUS96/17981A technique similar to rapid profiling is of interest in market lesealch (or voter research). Suppose that the target objects are con~l-mers. A particular attribute in each target profile in~ t-o~ whether the consumer described by that target profile h as purchased product X. A decision tree can be built that attempts to determine what value a consumer 5 has for this attribute, by consideration of the other attributes in the consumer's profile. This decision tree may be traversed to det~rrnine whether additional users are likely to purchase product X. More generally, the top few levels of the decision tree provide il.rol~l,alion, valuable to advertisers who are planning mass-market or direct-mail c~mp~i~n~7 about the most significant characteristics of con~llm~rs of product X.
Similar illrollllalion can alternatively be extracted from a collection of consumer profiles without recourse to a decision tree, by considering attributes one at a time, and identifying those attributes on which pro duct X's consumers differ ~i~nifiç~ntly from its non-consumers. These techniques serve to characterize consumers of a particular product;
they can be equally well applied to voter lese~.,l~ or other survey research, where the 15 objective is to characterize those individuals from a given set of surveyed individuals who favor a particular c~n~ te, hold a particular opinion, belong to a particular demographic group, or have some other set of ~lictin~li~hing attributes. Researchers may wish to purchase batches of analyzed or unanalyzed user profiles from which personal identifying information has been removed. As with any ~t~ti~tic~l database, statistical conclusions can 20 be drawn, and relationships between attributes can be elllçi~l~qted using knowledge discovery techniques which are well known in the art.
SUPPORTING ARC H l l ECTURE
The following section describes the l~lerelled computer and network architecture for implelllell~ g the methods described in this patent.
25 Electronic Media System Arrhitect--re Figure 1 illustrates in block diagram form the overall architecture of an electronic media system, known in the art, in which the system for c--~tomi~cl electronic identification of desirable objects of the present invention can be used to provide user customized access to target objects that are available via the electronic media system. In particular, the 30 electronic media system comprises a data co~ ic~tion facility that interconnects a plurality of users with a number of illrOl lnalion servers. The users a re typically individuals, whose personal computers (terminals) Tl-T" are connçcted via a data co,.,...~ ic~tions link, such as a modem and a telephone connection established in well-known fashion, to a CA 0223601~ 1998-04-27 W O97/16796 PCTrUS96/179~il teleco~ unication network N User i-~J----aLion access soIlw~ut; is resident on the user's personal computer and serves to co ~ -ir~te over the data co ;ç~tions link and the t~lecnmml-nir~tion nt:Lw~1k N with one of the plurality of network vendors Vl-Vk (America Online, Prodigy, CompuServe, other private co p~l~ies or even universities) who provide ~ S data interconnection service with selected ones ofthe i1~o.-.-alion servers Il -Im The user can, by use of the user i..fo~ Lion access sonw~ e, interact with the h~ lion servers Il ~Im to request and obtain access to data that resides on mass storage systems -SSm that are part of the information server apparatus New data is input to this system y users via their personal comr~lt~rs T, -Tn and by commercial i.~o----aLion services by populating their mass 10 storage systems SSI -SSm with commercial data Each user terminal T -~ a~d theinformation servers I, -Im have phone numbers or IP addresses on the network N which enable a data commnnir~tion link to be established between a particular user terrninal 1', -Tn and the srlected il~,----aLion server Il -Im A user's electronic mail address also uniquely if lPntifies the user and the user's network vendor V, -Vk in an industry-standard format such l 5 as ~ls~rn~me~aol com or username~netcom com The network vendors Vl -Vk provide access passwords for their subscribers (selected users), through which the users can ac,cess the information servers I, ~Im The subscribers pay the network vendors Vl -Vk for the access services on a fee schedule that typically inrl~ldes a monthly subscription fee and usage based charges A difficulty with this system is that there are numerous information 20 servers Il ~3[n~ located around the world, each of which provides access to a set of i.~o. .nation of di~fering format, content and topics and via a cataloging system that is typically wnique to the particular information server Il -3[1~ The inf'o....aLion is comprised of indi~,idual "files," which can contain audio data, video data, graphics data, text data, structured ~l~t~h~ce data and co---l,i--~Lions thereo~ In the terrninology of this patent, each target object 25 is associated with a unique file for target objects that are hlru..~.~Lional in nature and can be digitally ,t;p1~c~ 1 eA, the file directly stores the i-~----dLional content of the target object, while for target objects that are not stored electronically, such as purchasable goods, the file cl nt~in~ an identifying description of the target object Target objects stored electronically as text files can include commercially provided news articles, published docllm.ont~, letters, 30 user-generated do~ , descriptions of physical objects, or combinations ofthese classes of data The o.g~ on of the files co~ g the information and the native format of the data cont~inrd in files ofthe same concepL-Ial type may vary by information server 3[1 -3~"

CA 0223601~ 1998-04-27 WO 97/16796 PCT~US96/17981 Thus, a user can have difficulty in locating files that contain the desired i,-ro~ Lion, because the information may be contained in files whose information server cataloging may not enable the user to locate them. Ful Ll-t; ---ore, there is no standard catalog that defines the presence and services provided by all information servers I~ ,. A user therefore does not have simple access to i,lrul,llaLion but must expend a signifi~nt ~mollnt of time and energy to excerpt a se~ of the il~rc,l lll~Lion that may be relevant to the user from the plethora of illru~Lionthat is generated and populated on this system. Even if the user co.. ;l~ the nece~ry resources to this task, ~ ting information retrieval processes lack the accuracy and effiçi~ncy to ensure that the user obtains the desired i.~ol-nalion. It is obvious that 10 within the constructs of this electronic media system, the three modules of the system for custQmi7ed electronic j(lçntific~tion of desirable objects can be impl~m~nted in a distributed manner, even with various modules being impl~mented on and/or by difrelt;nt vendors within the electronic media system. For example, the h.ro...-~Lion servers Il -Im can include the target profile generation module while the net~,vork vendors Vl -V" may implement the 15 user profile generation module, the target profile interest sllmm~ry generation module, and/or the profile processing module. A module can itself be impl~mPnted in a distributed manner, with numerous nodes being present in the network N, each node serving a population of users in a particular geographic area. The totality of these nodes comprises the fi m~tion~lity of the particular module. Various other partitions of the modules and their 20 functions are possible and the examples provided herein represent illustrative examples and are not intçnded to limit the scope of the claimed invention. ~or the purposes of pseudonymous creation and update of users' target profile interest sl~mm~ries (as described below), the vendors Vl -V3~ may be aug~nented with some number of proxy servers, which provide a mççh~ni~m for ongoing pseudonymous access and profile building through the 25 method described herein. At least one trusted validation server must be in place to lmini~ter the creation of pseudonyms in the system.
An important characteristic of this system for c -etomi7~d electronic identification of desirable objects is its responsiveness, since the int.on~1ed use of the system is in an interactive mode. The system utility grows with the number of the users and this increases 30 the number of possible consumer/product relationships between users and target objects.
A system that serves a large group of users must ~ interactive performance and the disclosed method for profiling and clustering target objects and users can in turn be used for CA 0223601~ 1998-04-27 W O 97/16796 PCT~US96/17981 g the distribution of data among the members of a virtual cnmml-nity and througha data comml-nic~tions network, based on users' target profile interest s -mm~ries.
Network El~ ,..ls and System Chara~ .C.~s The various processors interconnected by the data communication network N as 5 shown in Figure 1 can be divided into two classes and grouped as illustrated in Fig~11re 2:
clients and servers. The clients Cl-Cn are individual user's c~mp~tPr systems which are connP,cted to servers Sl-S5 at various times via data comm--nications links. Each ofthe clients Ci is typically associated with a single server Sj, but these associations can change over time. The clients Cl-Cn both interface with users and produce and retrieve files to and 10 from servers. The clients Cl-Cn are not nP,CP~S~rily continuously on-line, since they typically serve a single user and can be movable systems, such as laptop computers, ~;vhich can be c~nnected to the data collllllunications nelwulh N at any of a number of local:ions.
Clients could also be a variety of other computers, such as computers and kiosks providing access to customized inrollllaLion as well as targeted advertising to many users, where the 15 users identify themselves with passwords or with smart cards. A server Si is a computer system that is presumed to be continuously on-line and functions to both collect files from various sources on the data c~mm--niC~tion network N for access by local clients C 1 -Cn and collect files from local clients Cl-Cn for access by remote clients. The server Si is equipped with pel~ ,L~llL storage, such as a m~gnPtir, disk data storage medillm, and are interconnçcted 20 with other servers via data comm--niç~tiQns links. The data c~".l....l-ir,~tions links can be of ~ubiLl~y topology and arrhitect--re, and are described herein for the purpose of ~implicity as point-to-point links or, more precisely, as virtual point-to-point links. The servers ',l-S5 comprise the network vendors Vl-Vk as well as the information servers I~ " of Fi~iure 1 and the functions performed by these two classes of modules can be merged to a greater or 25 lesser extent in a single server Si or distributed over a number of servers in the: data commllnic~qti(m network N. Prior to proceeding with the description of the pleiE~ d embodiment of the invention, a number of terms are dPfinPd Figure 3 illustrates in block diagram form a representation of an arbitrarily selected network topology for a plurality of servers A-D, each of which is interconnected to at least one other server and typically also 30 to a plurality of clients p-s. Servers A-D are interconnected by a collection of point to point data communications links, and server A is connected to client r, server B is connected to clients p-q, while server D is connected to client s. Servers transmit encrypted or unencry-pted messages amongst themselves: a message typically contains the textual and/or ~1-CA 0223601~ 1998-04-27 ~raphic informzltion stored in a particular file, and also COn~lllS data which describe the type and origin of this file, the name of the server that is supposed to receive the message, and the purpose for which the file contents are being l"ln!~."ille~1 Some messages are not associated with any file, but are sent by one server to other servers for control reasons, for S example to request L~ ".;~iQn of a file or to announce the availability of a new file.
~essages can be forwarded by a server to another server, as in the case where server A
~r~n~mits a message to server D via a relay node of either server C or servers B, C. It is generally preferable to have multiple paths through the network, with each path being characterized by its pe,rol"~allce capability and cost to enable the network N to optimize l 0 traffic routing.
Pro~y Servers and Pseudonymous Trqn~qctions While the method of using target profile interest sl~mm~ries presents many advantages t o both target object providers and users, there are important privacy issues for both users and providers that must be resolved if the system is to be used freely and without inhibition by users without fear of invasion of privacy. It is likely that user s desire that some, if not all, of the user-specific information in their user profiles and target profile interest sllmm~ries remain confid~nti~l~ to be disclosed only under certain circllm.ct~nces related to certain types of tr~n~actions and according to t heir personal wishes for ~1i~çring levels of confid~nti~lity regarding their purchases and expressed interests.
However, complete privacy and in~cce~s.~ibility of user tr~n~ctions and profile summary "~l"l~Lion would hinder impl~."e."~ n ofthe system for customized electronic identification of desirable objects and would deprive the user of many of the ad vantages derived through the system's use of user-specific information. In many cases, complete and total privacy is not desired by all parties to a transaction. For example, a buyer may desire to be targeted for certain m~iling~ that describe products that are related to his or her interests, and a seller may desire to target users who are predicted to be interested in the goods and services that the seller provides. Indeed, the ~-sçfillnecs of the technology described herein is contingent upon the ability of the system to collect and co,l-pale data about many users and many target objects. A comp~u"~ise between total user anonymity and total public disclosure of the user's search profiles or target profile interest summary is a pseudonym. A pse~ldonym is an artifact that allows a service provider to commllnic.ate with users and build and accllmul~te records of their p~ elences over time, while at the same time ~ i"ing ignorant of the users' true identities, so that users can keep their purchases CA 0223601~ 1998-04-27 W O 97/16796 PCT~US96/179'31 or p-GrGIG.lces private. A second and equally important requirement of a pseudonym sy:stem is that it provide for digital credlo.nti~l~, which are used to guarantee that the user represented by a particular pseudonym has certain properties. These credenti~l~ may be granted Oll the basis of result of activities and tr~n~actions cond~-cted by means of the system~ for customized electronic identification of desirable objects, or on the basis of other activities and transactions conducte.d on the network N of the present system, on the basis of users' activities outside of network N. For example, a service provider may require proof that the purchaser has sufflcient funds on deposit at his/her bank, which might possibly not be on a network, before agreeing to transact business with that user. The user, therefore, must provide the service provider with proof of funds (a cred~nti~l) from the bank, while still not disclosing the user's true identity to the service provider.
Our method solves the above problems by conlbining the pseudonym granting and credential L-~-~rGl methods taught by D. Chaum and J.H. Evertse, in the paper titled "A
secure and privacy-plole-;Lillg protocol for L-~ g personal i~lro~.l.aLion between olg~ ons," with the implçment~tion of a set of one or more proxy servers distributed throughout the network N. Each proxy server, for example S2 in Figure 2, is a server which communicates with clients and other servers S5 in the network either directly or through anonyl. iGillg mix paths as detailed in the paper by D.Chaum titled "Untraceable Electronic Mail, Return Addresses, and Digital Pseudonyms," published in CommlmicatiQn~ of the ACM, Volume 24, Number 2, February 1981. Any server in the network N may be configured to act as a proxy server in addition to its other functions. Each proxy server provides service to a set of users, which set is termed the "user base" of that proxy server.
A given proxy server provides three sorts of service to each user U in its user base, as follows:
1. The first function of the proxy server is to bidirectionally transfer communications between user U and other entities such as i~ro~ alion servers (possibly in~ ing the proxy server itself) and/or other users. Specifically, letting S denote the server that is directly associated with user U's client processol; the proxy server communicates with server S (and thence with user U), either throughanonymizing mix paths that obscure the identity of server S and user U, in ~,vhich case the proxy server knows user U only through a secure pseudonym, Ol else through a conventional virtual point-to-point connection, in which case the proxy -CA 0223601~ 1998-04-27 W O 97/16796 PCT~US96/17981 server knows user U by user ~s address at server S, which address may be regarded as a non-secure pseudonym for user U.
2. A second function of the proxy server is to record user-specific in~ll,a~ion associated with user U. This user-specific i,~rul--la~ion intl~ldçs a user S profile and target profile interest s~mm~ry for user U, as well as a list of access control instructions specified by user U, as described below, and a set of one-time return addresses provided by user U that can be used to send mes~ges to user U
without knowing user U's true identity. All of this user-specific information isstored in a database that is keyed by user U's pseudonym (whether secure or non-secure) on the proxy server.
3. A third function of the proxy server is to act as a selective ~- w~ding agent for ~ln~olicited communications that are addressed to user U: the proxy server forwards some such commlmir~tinn~ to user U and rejects others, in accordance with the access control instructions specified by user U.
Our combined method allows a given user to use either a single pseudonym in all transactions where he or she wishes to remain pseudonymous, or else different pseudonyms for di~eren~ types of transactions. In the latter case, each service provider might transact with the user under a di~e,e." pseudonym for the user. More generally, a coalition of service providers, all of whom match users with the same genre of target objects, might agree to transact with the user using a cn~ 1 ll l loll pseudonym, so that the target profile interest summary associated with that pseudonym would be complete with respect to said genre of target objects. When a user employs several pseudonyms in order to transact with different coalitions of service providers, the user may freely choose a proxy server to service each pseudonym; these proxy servers may be the same or di~ele..l.
From the service provider's perspective, our system provides security, in that it can guarantee that users of a service are legitim~tely entitled to the services used and that no user is using multiple ps~lclonyms to c--mm--nicate with the same provider. This uniqueness of pseudonyms is important for the purposes of this application, since the transaction ~u,."alion gathered for a given individual must represent a complete and consistent picture of a single user's activities with respect to a given service provider or coalition of service providers; otherwise, a user's target profile interest sl-mm~ry and user profile would not be able to ~ se-" the user's i."e,e~ to other parties as completely and accurately as possible.

~1 4- c CA 0223601~ 1998-04-27 W O 97/16796 PCTAJS96/179;~1 The service provider must have a means of protection from users who violate previously agreed upon terms of service. For example, if a user that uses a given pseudonym engages in activities that violate the terrns of service, then the service provider should be able to take action against the user, such as denying the user service and bl~ tin~l the user S from tr~n~tiQn~ with other parties that the user might be tempted to defraud. This type of ~it~l~tion might occur when a user employs a service provider for illegal activities or dpfr~ t~
in payments to the service provider. The method of the paper titled "Security without identification: Tr~n~acti~n systems to make Big-Brother obsolete", published in the Comm~nic~tiQns of the ACM, 28(10), Oct. 1985; pp.l030-1044, incorporated herein,provides for a ., .~ . . to enforce protection against this type of behavior through the use of resolution credçnti~l~, which are crecl~nti~ that are periodically provided to individuals contingent upon their behaving consistent with the agreed upon terms of service between the user and ;~r~J""i~l;on provider and network vendor entities (such as regular payment for services rendered, civil conduct, etc.). For the user's safety, if the issuer of a resolution credential refuses to grant this resolution credt-.nti~l to the user, then the refusal may be appealed to an adjudicating third party. The integrity of the user profiles and target profile interest sllmm~ries stored on proxy servers is important: if a seller relies on such user-specific ;,,rul ",~;on to deliver promotional offers or other material to a particular class of users, but not to other users, then the user-specific information must be accurate and untampered with in any way. The user may likewise wish to ensure that other parties not tamper with the user's user profile and target profile interest sllmm~ry, since such modification could degrade the system's ability to match the user with the most appropliate target objects. This is done by providing for the user to apply digital sign~t~lres to the control messages sent by the user to the proxy server. Each pseudonym is paired with a public cryptographic key and a private cryptographic key, where the private key is known only to the user who holds that pseudonym; when t he user sends a control message to a proxy server under a given pseudonym, the proxy server uses the pseudonym's public key to verify that the message has been digitally signed by someone who knows the pseudonym's private key. This prevents other parties from masquerading as the user.
Our approach, as disclosed in this applic~ti-~n, provides an improvement over the prior art in privacy-protected pseudonymy for n~;lwclk subscribers such as taught in U.S.
Patent 5,245,656, which provides for a name ~l~ sl~lor station to act as an int~rm~di~ry between a service provider and the user. However, while U.S. Patent 5,245,656 provides CA 0223601Ct 1998-04-27 that the information L~ ed between the end user U and the service provider be doubly encrypted, the fact that a relationship exist s between user U and the servicé provider is known to the name tr~nel~t~r, and this fact could be used to co..-plc..llise user U, for ~7mple if the selvice provider spe~ s in the provision of content that is not deemed 5 acc~le by user U's peers. The method of U.S. Patent 5,245,656 also omits a method for the convenient updating of pseudonymous user profile information, such as is provided in this application, and does not provide for assurance of unique and cre(~nti~led registration of pseudonyrns from a credenti~ling agent as is also provided in this application, and does not provide a means of access control to the user based on profile information and 10 con~lition~l access as will be subsequently described. The method described by Loeb et al.
also does not describe any provision for crecl~nti~lc~ such as might be used for ~lthçntic~ting a user's right to access particular target objects, such as target objects that are intended to be available only upon payment of a subscription fee, or target objects that are intend ed to be unavailable to younger users.
15 Pro~y Server D~s~ lion In order that a user may ensure that some or all of the information in the user's user profile and target profile interest sllmm~ry remain dissociated from the user's true identity, the user employs as an intermediary any one of a number of proxy servers available on the data communication network N of Figure 2 (for example, server S2). The proxy servers 20 function to ~ ~lice the true identity of the user from other parties on the data communication network N. The proxy server replesellLs a given us er to either single network vendors and information servers or coalitions thereof. A proxy server, e.g. S2, is a server computer with CPU, main memory, secondary disk storage and network communication function and with a ~l~t~h~e function which retrieves the target profile 25 interest summary and access control instructions associated with a particular pseudonym P, which represents a particular user U, and performs bi-directional routing of co~target objects and billing i.,rol,.~lion between the user at a given client (e.g. C3) and other network entities such as network vendors Vl-Vk and ;.,rc., ...~ion servers I1-Im. Each proxy server ..,~ an encrypted target profile interest summary associated with each allocated 30 pseudonym in its pseudonym t1~t~h~ce D . The actual user-specific information and the associated ps~ clonyms need not be stored locally on the proxy server, but may alternatively be stored in a distributed fashion and be remotely addressable from the proxy server via point-to-point connections .

CA 0223601~ 1998-04-27 W O 97/16796 PCT~US96/17981 The proxy server supports two types of bi-directional connections: point-to-point connections and pseudonymous comle-,lions through mix paths, as taught by D.Chaum in the paper titled "Untraceable Electronic Mail, Return Addresses, and Digital Pseudonyms", Communications of the ACM, Volume 24, Number 2, February 1981. The normal 5 connections between the proxy server and illr~ llaLion servers, for example a connection between proxy server S2 and il~ nl~Lion server S4 in Figure 2, are accomplished through the point-to-point Gol-l-e~ n protocols provided by n~;lwulh N as described in the "Electronic Media System Architect--re" section of this application. The normal type of point-to-point connectione may be used between S2-S4, for eY~mrle, since the dissociation 10 of the user and the pseudonym need only occur between the client C3 and the proxy server S2, where the pseudonym used by the user is available. Knowing that an information provider such as S4 communicates with a given pseudonym P on proxy server S2 does not COIllplull~~Se the true identity of user U. The bidirectional connection between the user and the proxy server S2 can also be a normal point-to-point connectiQn but it may instead be 15 made anonymous and secure, if the user desires, though the con~i~t~nt use of an anollyll~,llg mix protocol as taught by D.Chaum in the paper titled "Untraceable Electronic Mail, Return Addresses, and Digital Pseudonyms", Communications of the ACM, Volume 24, Number 2, February 1981. This mix procedure provides untraceable secure anonymous mail between to parties with blind return addresses through a set of forwarding and return 20 routing servers termed "mixes". The mix routing protocol, as taught in the Chaum paper, is used with the proxy server S2 to provide a registry of persistent secure pseudonyms that can be employed by users other than user U, by il.ru~l~laLion providers Il-Im, by vemdors V1-Vk and by other proxy servers to co~ ;r~te with the users in the proxy server~; user base on a contin~in~ basis. The security provided by this mix path protocol is distrilbuted 25 and resistant to traffic analysis attacks and other known forms of analysis which may be used by malicious parties to try and ascertain the true identity of a pseudonym bearer.
Breaking the protocol requires a large number of parties to maliciously collude or be cryptographically cul-lp--,-llised. In addition an extension to the method is taught where the user can include a return path definition in the message so the information server S4 can 30 return the requçsted information to the user's client processor C3 . We utilize this feature in a novel fashion to provide for access and reachability control under user and proxy server control.

CA 0223601~ 1998-04-27 W O 97/16796 PCTAUS96/17981V~ t~r and Allocation of a Unique Pseudonym Chaum's pseudonym and credçnt~ s~l~n~e system, as described in a publication by D. Chaum and J.H. Evertse, titled "A secure and privacy-protecting protocol for tr~n~mitting personal il~llllaLion between Olg~ ons~ll has several desirable properties 5 for use as a component in our system. The system allows for individuals to use dirre~e~
pseudonyms with dirr~ e -L ol~..;,;1liolls (such as banks and coalitions of service providers).
The ol~,~n;,~l;ons which are presented with a pseudonym have no more inrol--laLion about the individual than the pseudonym itself and a record of previous tr~n.c~cti~ns carried out underthat pseudonym. Additionally, cred~nti~lc, which l~-~se--L facts about a pseudonym 10 that an o~ ;on is willing to certify, can be granted to a particular pseudonym, and transferred to other pseudonyms that the same user employs. For, example, the user can use dirre.~ pseudonyms with dirrele~L olgani~Lions (or disjoint sets of o-g~ l;ons), yet still present credenti~l~ that were granted by one or~ni7~ti~n~ under one pseudonym, in order to transact with another b-~ aLion under another pseudonym, without revealing that the 15 two pseudonyms correspond to the same user. Credenti~l~ may be granted to provide assurances regarding the pseudonym bearer's age, fin~n~ i~l status, legal status, and the like.
For example, cred.~nti~l~ signifying "legal adult" may be issued to a pseudonym based on il rc,.lllaLion known about the corresponding user by the given is suing org~ni7~tion. Then, when the credential is Ll ~l~rel . ed to another pseudonym that 1 epl esel-Ls the user to another 20 disjoint ol~ lion, pl~se-lLaLion ofthis credential on the other pseudonym can be taken as proof of legal adulthood, which might satisfy a condition of terms of service.
Credential-issuing o,~.~ l;on~ may also certify particular facts about a user's demographic profile or target profile interest summary, for example by granting a credenti~l that asserts "the bearer of this pseudonym is either well-read or is middle-aged and works for a large 25 company"; by p~rse~ g this credential to another entity, the user can prove eligibility for (say) a discount without revealing the user 's personal data to that entity.
lition~lly, the method taught by Chaum provides for assurances that no individual may ~ll t;s~ond with a given o~ l ;Qn or coalition of o,~,~ n~ ns using more than one pseudonym; that credçnt~ may not be feasibly forged by the user; and t hat cred~nti~lc 30 may not be Ll~u sr~ d from one user's pseudonym to a dirrelt;l-L user's pseudonym. Finally, the method provides for expiration of credenti~le and for the iS~C~l~nCe of "black marks"
against Individuals who do not act accordi~ to the terms of service that they are ~o~rtçnrled This is done through the resolution credential mech~ni~m as described in Chaum's work, CA 0223601~ 1998-04-27 W O 97/16796 PCTAJS96/179~1 in which re~ tion~ are issued periodically by o~ ;,nl;ons to pseudonyms that are in ~Jood st~ndin~ If a user is not issued this rçsolution credçnti~l by a particular o,~ on or coalition of ol~ nl;on, then this user cannot have it available to be transferred to other pseudonyms which he uses with other o~ nl;ons. Therefore, the user cannot convince S these other o,~ lion~ that he has acted acco,.lance with terms of service in other dealings. If this is the case, then the O~ nl ion can use this lack of resolution crede:ntial to infer that the user is not in good st~ntling in his other de~ling~ In one approach 01 ~,n n; ~ l ;one(or other users) may issue a list of quality related cred~nti~ls based UpOll the experience of transaction (or interaction) with the user which may act similarly to a letter 10 of recommçn~tion as in a resume. If such a credçnti~l is issued from multipleo,~ lions, their values become averaged. In an ~It~rn~tive variation olg~ nl;ons may be issued credçnti~l~ from users such as customers which may be used to indicate to other future users quality of service which can be expected by subsequent users on the basis of various criteria.
In our impl~."~"l~l;on, a pseudonym is a data record con~i~ting oftwo fields. The first field specifies the address of the proxy server at which the pseudonyrn is registered.
The second field co"l~,~s a unique string of bits (e.g., a random binary number) that is associated with a particular user; cred~nti~l~ take the form of public-key digital ~ign~ltllres computed on this number, and the number itself is issued by a pseudonym ~tlmini~tering 20 server Z, as depicted in Figure 2, and detailed I n a generic form in the paper by D. Chaum and J.H. Evertse, titled "A secure and privacy-protecting protocol for tr~n.~mitting pers,onal m~lion between ol~ nl;ons.". It is possible to send il~ollllalion to the user holding a given pseudonym, by enveloping the i~llllalion in a control message that specifies the pseudonym and is addressed to the proxy server that is narned in the first field of the 25 pseudonym; the proxy server may rOl w~d the ~.rollllalion to the user upon receipt of the control message.
While the user may use a single pseudonyrn for all transactions, in the more general case a user has a set of several pseudonyms, each of which 1 epl es~l-ls the user in his or her interactions with a single provider or coalition of service providers. Each pseudonym in the 30 pseudonym set is dç~i~n~ted for transactions with a di~el~ l coalition of related service providers, and the pseudonyms used with o ne provider or coalition of providers cannot be linked to the pseudonyms used with other disjoint coalitions of providers . All of the user's tr~n~ctioni with a given coalition can be linked by virtue of the fact that they are conducted CA 0223601~ 1998-04-27 W O 97/16796 PCT~US96/17981under the same pse~ldonym, and therefore can be combined to define a unified picture, in the form of a user profile and a target profile interest s~lmm~ry, of the user's interests vis-a-vis the service or services provided by said coalition. There are other circ~lm~t~n~e~
for which the use of a pseudonym may be useful and the present description is in no way 5 intended to limit the scope of the claimed invention for example, the previously described rapid profiling tree could be used to pseudonymously acquire i~ a~ion about the user which is con~id~red by the user to be sensitive such as that h~rol~ ion which is of interest to such entities as insurance c~n~ iee, me-lic~l speçi~listc~ family counselors or dating services.
10 Detailed Protocol ~ our system, the ol~ ;on~ that the user U interacts with are the servers S l-Sn on the network N. However, rather than directly corresponding with each server, the user employs a proxy server, e.g. S2, as an intermçfli~ry between the local server of the user' s own client and the h~ Lion provider or network vendor. Mix paths as described by15 D.Chaum in the paper titled 'lunkaceable Electronic Mail, Return Addresses, and Digital Pseudonyrns", Commllnic~tiQns ofthe ACM, Volume 24, Number 2, February 1981 allow for untraceability and security between the client, such as C3, and the proxy server, e.g. S2.
Let S~M,K) lc~lesellL the digital signing of message M by modular eXponenti~tiQn with key K as detailed in a paper by Rivest, R.L., Shamir, A., and Adleman, L. Titled "A method for 20 obtaining digital ~ign~tllres and public-key cryptosystems", published in the Comm. ACM
21, 2 Feb.120-126. Once a user applies to server Z for a pseudonym P and is granted a signed ps~d~nym signed with the private key SKz of server Z, the following protocol takes place to establish an entry for the user U in the proxy server S2's database D. 1. The user now sends proxy server S2 the pse~1dcnym, which has been signed by Z to int1ic~te the 25 ~lltht~ntic~ity and llni~l~n~s ofthe pse~ nym. The user also generates a PKp, SKp key pair for use with the granted pse~ldonym, where is the private key associated with the pseudonym and PKp is the public key associated with the pseudonym. The user forms a request to establish pseudonym P on proxy server S2, by sending the signed pseudonym S(P, SKz) to the proxy server S2 along with a request to create a new d~t~h~e entry, inc1e~çd by P, and 3 0 the public key PKp. It envelopes the message and transmits it to a proxy server S2 through an ~lollyll.~ 3 mix path, along with an anonymous return envelope header. 2. The proxy server S2 receives the ~i~t~q-h~ce creation entry request and associated certified pseudonym message. The proxy server S2 checks to ensure that the requested pseudonym P is signed CA 0223601~ 1998-04-27 W O 97/16796 PCT~US96/179;Bl by server Z and if so grants the request and creates a database entry for the pseudon~m, as well as storing the user's public key PKp to ensure that on]Ly the user U can make requests in the future using pse~ldonyrn P. 3. The structure of the user's d~t~h~ce entry consists of a user profile as det~iled herein, a target profile interest ~...,....~.y as det~iled herein, and a ~ 5 Boolean co.. ~ ion of access control criteria as detailed below, along with the associated public key for the pseudonyrn P. 4. At any tiime after d~t~ba~e entry for Pseudonyrn P is established, the user U may provide proxy server S2 with credenti~l~ on that pseudonym, provided by third parties, which credçnt~ make certain assertions about that pseudonym.
The proxy server may verify those cred~nti~l~ and make app.~ iate morlific~tions to the 10 user's profilLe as required by these cred~nti~1~ such as recording the user's new demographic status as an adult. It may a]Lso store those credçnti~l~ so that it can present them to service providers on the user's behalLf.
The above steps may be repeated, with either the same or a di~e~ proxy server, each time user U requires a new pseudonym for use with a new and disjoint coa]Litiion of 15 providers. In practice there is an extremely small probability that a given pseudonym may have already been a]Llocated by due to the random nature of the pseudonym generation process carried out by Z. If this highLy un]Likely event occurs, then the proxy server S 2 may reply to the user with a signed message in-iic~ting that the generated pseudonym has already been allocated, and asking for a new pse~ldonym to be generated.
20 Pseudonymous Control of an Information Server Once a proxy server S2 has ~llthPnti-.~t~l and registered a user's pseudonym, the user may begin to use the services of the proxy server S2, in interacting with other network entities such as service providers, as exemplified by server S4 in Figure 2, an inforrnation service provider node connected to the network. The user controls the proxy server S2 by 25 forrning digitally encoded requests that the user subsequently transmits to the proxy server S2 over the network N. The nature and format of these requests will vary, since the proxy server may be used for any of the services described in this application, such as the brow~illg, querying, and other navigational functions described below.
In a generic scenario, the user wishes to communicate under pseudonym P with a 30 particular hlrulllla~ion provider or user at address A, where P is a pseudonym allocated to the user and A is either a public network address at a server such as S4, or another pseudonym that is registered on a proxy server such as S4. (In the most common v ersion of this scenario, address A is the address of an h~llllaLion provider, and the user is CA 0223601~ 1998-04-27 W O 97/16796 PCT~US96/17981 req~ hng that the in formation provlder send target objects of interest.) The user must form a request R to proxy server S2, that requests proxy server S2 to send a message to address A and to ro~w~ud the response back to the user. The user may thereby comm~lnic~te with other parties, either non-pseudonymous parties, in the case where address A is a public S network address, or pse~lclonymous parties, in the case where address A is a pseudonym held by, for example, a busille~,~ or another user who prefers to operate pseudonymously.
In other scenarios, the request R to proxy server S2 formed by the user may have.li~rt;~ contpnt For ~mple, request R may instruct proxy server S2 to use the methods described later in this description to retrieve from the most convenient server a particular piece of i,lrolll,alion that has been mlllti~ t to many servers, and to send this information to the user. Conversely, request R may instruct proxy server S2 to mll1tic~ct to many servers a file associated with a new target object provided by the user, as described below. If the user is a subscriber to the news clipping service described below, request R may instruct proxy server S2 to forward to the user all target objects that the news clipping service has sent to proxy server S2 for the user's attention. If the user is employing the active navigation service described below, request R may instruct proxy server S2 to select a particular cluster from the hierarchical cluster tree and provide a menu of its subclusters to the user, or to activate a query that temporarily affects proxy server S2's record of the user's target profile interest summary. If the user is a member of a virtual community as described below, request R may instruct proxy server S2 to forward to the user all messages that have been sent to the virtual community.
Regardless of the content of request R, the user, at client C3, initi,qtÇ~ a connection to the user's local server S 1, and instructs server S 1 to send the request R along a secure mix path to the proxy server S2, initi~ting the following sequence of actions:
1. The user's client processor C3 forms a signed message S(R, SKp), which is paired with the user's pseudonym P and (if the request R requires a response) a secure one-time set of return envelopes, to form a message M. It protects the message M with an multiply enveloped route for the outgoing path. The enveloped route s provide for secure commllnic~tif7n between Sl and the proxy server S2. The m~ g~ M is enveloped in the most deeply nested m~c~ge and is therefore difficultto recover should the message be intercepted by an eavesdropper.

CA 0223601~ 1998-04-27 W O 97/16796 PCTrUS96/179~1 2. The message M is sent by client C3 to its local server S1, and is then routed by the data comm~mic~9tiQn neLwolk N from server S1 through a set of rnixes as dictated by the outgoing envelope set and arrives at the selected proxy server S2.
3. The proxy server S2 separates the received message M into the request message R, the pseudonym P, and (if incl~lded) the set of envelopes for the rleturn path. The proxy server S2 uses pseudonym P to index and retrieve the corresponding record in proxy server S2's d~t~k~e, which record is stored in local storage at the proxy server S2 or on other distributed storage media ~cces~ible to proxy server S2 via the nt;twulk N. This record contains a public key PKp, user-specific il~ru~ tion~ and credçnti~l~ associated with pselldonym P. The proxy server S2 uses the public key PKp to check that the signed version S(R, S~; ~ ofrequest message R is valid.
4. Provided that the ~;~Lule on request message R is valid, the proxy server S2 acts on the request R. For example, in the generic scenario described al~ove,request message R incllld~s an embedded message Ml and an address A to ~,vhom message M1 should be sent; in this case, proxy server S2 sends m~s~ge M1 l:o theserver named in address A, such as server S4. The comm--nic.~tic n is done usingsigned and optionally encrypted messages over the normal point to point connections provided by the data commllnic~tion network N. When necess.~,y in order to act on embedded message Ml, server S4 may P~h~n~e or be caused to e~ch~nge further signed and optionally encrypted messages with proxy server S2, still over normal point to point connectione, in order to negotiate the release of user-specific ;.,rol"~ ;on and credenti~l~ from proxy server S2. In particular, server S4 may require server S2 to supply credP.nti~ls proving that the user is entitled to the information requested -- for example, proving that the user is a subscriber in good st~n~inp to a particular il~o,..l~tion service, that the user is old enough to legally receive adult material, and that the user has been offered a particular discount (by means of a special discount credential issued to the user's pseudonym).
5. If proxy server S2 has sent a message to a server S4 and server ',4 has created a ~e~ollse M2 to mto~ge M1 to be sent to the user, then server S4 transmits the response M2 to the proxy server S2 using normal network point-to-point connections.

CA 0223601~ 1998-04-27 6. The proxy server S2, upon receipt of the response M2, creates a return message Mr COlllpliSillg the response M2 embedded in the return envelope set that was earlier L ~ led to proxy server S2 by the user in the original message M. Itn~lll;lx the return mçs~ge Mr along the pseudonymous mix path specified by this return envelope set, so that the l~,~onse M2 reaches the user at the user's client processor C3.
7. The response M2 may contain a request for electronic payment to the information server S4. The user may then respond by means of a me~s~ge M3 L~ led by the same means as described for message Ml above, which message M3 encloses some form of anonymous paylllt;llL Alternatively, the proxy server may respond automatically with such a payment, which is debited from an account ",~ ed by the proxy server for this user.
8. Either the response mec~ge M2 from the i.-ro----ation server S4 to the user, or a subsequent message sent by the proxy server S2 to the user, may contain advertising material that is related to the user's request and/or is targeted to the user.
Typically, if the user has just retrieved a target object X, then (a) either proxy server S2 or hlru---laLion server S4 det~rmines a weighted set of adver~i~ementc that are "associated with" target object X, (b) a subset of this set is chosen randomly, where the weight of an advertie~m~nt is proportional to the probability that it is inr~ ded in the subset, and (c) proxy server S2 selects from this subset just those advertie~m~ntc that the user is most likely to be interested in. In the variation where proxy server S2 deterrnines the set of advertisements associated with target ob3ect X then this set typically consists of all advertisements that the proxy server's owner has been paid to ~liss~min~te and whose target profiles are within a threshold similarity distance of the target profile of target object X. In the variation where proxy server S4 determines the set of advertisell.e..Ls associated with target object X, advertisers typically purchase the right to include adverti~çm~nt~ in this set. In either case, the weight of an adverti~m~nt is deterrnined by the arnount that anadvertiser is willing to pay. Following step (c), proxy server S2 retrieves the s~1-octed advertising m~t~ri~l and transmits it to the user's client processor C3, where it will be displayed to the user, within a specified length of time after it is received, by a trusted process running on the user's client processor C3. When proxy server S2 L-~n5i--~iL~ an adverti~ment, it sends a message to the advertiser, indicating that CA 0223601~ 1998-04-27 W O 97/16796 PCTrUS96/179'31 the adverti~pmçnt has been L~ ed to a user with a particular predicted le~vel of interest. The message may also in~1is~te the identity of target object X. In return, the advertiser may L.~-slliL an electronic payment to proxy server S2; proxy server S2 retains a service fee for itself, optionally fo,w~uds a ser~vice fee to i..roll"alion S server S4, and the balance is folw~led to the user or used to credit the user's account on the proxy server.
9. Ethe ~t;~onse M2 ~,.~ or identifies a target object, the passive and/or active relevance feedb~c1~ that the user provides on this object is tabulated by a process on the user's client processor C3. A s--mm~ry of such relevance feedbackinformation, digitally signed by client processor C3 with a proprietary private key SKC3, is periodically tr~n~mitted through an a secure mix path to the proxy server S2, whereupon the search profile gene-~lion module 202 resident on serv~er S2 updates the approp-iate target profile interest :,.. ~.y associated with pseudonym P, provided that the signature on the summary message can be ~lthPnticated with the corresponding public key PKc3 which is available to all tabulating process that are ensured to have integrity.
When a con~nmpr enters into a fin~nri~l relationship with a particular i.~,----aLion server based on both parties agreeing to terms for the relationship, a particular pseudonym may be PYten~led for the con~nmPr with respect to the given provider as detailed :in the 20 previous section. When entering into such a relationship, the consumer and the service provider agree to certain ternns. However, if the user violates the terms of this relatio nship, the service provider may decline to provide service to the pseudonym under which it transacts with the user. In addition, the service provider has the recourse of refusing to provide resolution credenti~l~ to the pseudonym, and may choose to do so until the 25 pseudonym bearer returns to good st~ntling Pre-Fetching of Target Objects In some circ~m~t~nces~ a user may request access in sequence to many files, which are stored on one or more h~oll..alion servers. This behavior is common when navigating a hypertext system such as the World Wide Web, or when using the target object browsing 30 system described below.
In general, the user requests access to a particular target object or menu of target objects; once the corresponding file has been Ll;n~ ed to the user's client processor, the user views its contents and makes another such request, and so on. Each request may take CA 0223601~ 1998-04-27 W O 97/16796 PCT~US96/17981 many seconds to satisfy, due to retrieval and tr~n~mi~ion delays. However, to the extent that the seq~lrnce of requests is predictable, the system for c~ o"~ cl electronic identification of desirable objects can respond more quickly to each request, by retrieving or starting to retrieve the apl)lopliate files even before the user requests them. This early 5 retrieval is terrned "pre-fetrhing of files."
Pre-fetching of locally stored data has been heavily studied in memory hierarchies, including CPU caches and secondary storage (disks), for several dec~des A leader in this area has been A. J. Smith of Berkeley, who i~ ntified a variety of srh~mes and analyzed opportunities using ~"~Lc~ivc traces in both cl~t~b~es and CPU caches. His conclusion was
10 that general schemes only really paid off where there was some re~son~kle chance that seq~lrnti~l access was occurring, e.g, in a seq~lrnti~l read of data. As the h~l~nces between various latencies in the memory hierarchy shifted during the late 1980's and early l990's, J. M. Smith and others idrntifird further opportunities for pre-fetr.hing of both locally stored data and network data. In particular, deeper analysis of patterns in work by Blaha showed 1~ the possibility of using expert systems for deep pattern analysis that could be used for pre-fetrhinp Work by J. M. Smith proposed the use of reference history trees to anticipate references in storage hierarchies where there was some historical data. Recent work by Touch and the Berkeley work addressed the case of data on the World-Wide Web, where the large size of images and the long l~tenf.ies provide extra incentive to pre-fetch; Touch's 20 technique is to pre-send when large bandwidths permit some speculation using HTML
storage lcrclcnces rmh~1ded in WEB pages, and the Berkeley work uses techniques similar to J. M. Srnith's reference histories speci~li7ecl to the srm~ntics of HTML data.
Successful pre-fetching depends on the ability of the system to predict the nextaction or actions of the user. In the context of the system for customized electronic 25 idrntifir,~tion of desirable objects, it is possible to cluster users into groups according to the similarity of their user profiles. Any of the well-known pre-fetching methods that collect and utilize agglc~le st~tietirs on past user behavior, in order to predict future user behavior, may then be implrmented in so as to collect and utilize a separate set of statistics for each cluster of users. In this way, the system generalizes its access pattern statistics from each 30 user to similar users, without generalizing among users who have subst~nti~lly di~clclll interests. The system may further collect and utilize a similar set of statistics that describes the aggregate behavior of all users; in cases where the system cannot confidently make a prediction as to what a particular user will do, because the relevant st~ti~tic~ concerning that CA 0223601~ 1998-04-27 W O 97/16796 PCTnJS96/1791~1 user's user cluster are derived from only a small amount of data, the system may instead make its predictions based on the aggregate st~ticti(~s for all users, which are derived from a larger amount of data. For the sake of concretçn~ss, we now describe a particular inst~nti~tion of a pre-fetching system, that both employs these inciphtc and that makes its 5 pre-fetching decisions through accurate measurement of the expected cost and benefit of each potential pre-fetch.
Pre-fetr.hing exhibits a cost-benefit tradeoff. Let t denote the approx;...~te. n~lmber of minntes that pre-fetched files are retained in local storage (before they are delel:ed to make room for other pre-fetched files). If the system elects to pre-fetch a file corresponding 10 to a target object X then the user benefits from a fast ~ onse at no extra cost, provided that the user explicitly requests target object X soon lhele~[ler. However, if the user does not request target object X within t mimltes of the pre-fetch, then the pre-fetch was worthless, and its cost is an added cost that must be borne (directly or indirectly) by the user. The first scenario therefore provides benefit at no cost, while the second scenario incurs a cost at no 15 benefit. The system tries to favor the first scenario by pre-fetching only those files th~at the user will access anyway. Depending on the user's wishes, the system may pre-fetch either conservatively, where it controls costs by pre-fet~hing only files that the user is extremely likely to request explicitly (and that are relatively cheap to retrieve), or more aggressively, where it also pre-fetches files that the user is only moderately likely to request explicitly, 20 thereby increasing both the total cost and (to a lesser degree) the total benefit to the user.
In the system described herein, pre-fetching for a user U is accomplished 13y the user's proxy server S. Whenever proxy server S retrieves a user-requested file F frt)m an information server, it uses the identity of this file F and the characteristics of the user, as described below, to identify a group of other files Gl...Gk that the user is likely to access 25 soon. The user's request for file F is said to "trigger" files GlGk. Proxy server S
pre-fetches each of these triggered files Gi as follows:
1. Unless file Gi is already stored locally (e.g., due to previous pre-fetch), proxy server S retrieves file Gi from an appropriate information server and stores it locally.
302. Proxy server S tim~ct~mps its local copy of file Gi as having just been pre-fetched, so that file Gi will be retained in local storage for a ...;1l;,,,..,,. of app.ux.,n~tçly t mimltes before being deleted.

CA 0223601~ 1998-04-27 WO 97/16796 PCT~US96/17981 Whenever user U (or, in principle, any other user registered with proxy server S) requests proxy server S to retrieve a file that has been pre-fetched and not yet deleted, proxy server S Gan then retrieve the file from local storage rather than from another server. In a variation on steps 1-2 above, proxy server S pre-fetches a file Gi somewhat dirr~re~lLly, so that 5 pre-fetched files are stored on the user's client processor q rather than on server S:
1. If proxy server S has not pre-fetched file Gi in the past t mimltee it retrieves file Gi and Ll~ SllllLS it to user U's client processor q.
2. Upon receipt of the message sent in step 1, client q stores a local copy of file Gi if one is not currently stored.
3. Proxy server S notifies client q that client q should l;",~ ",p its local copy of file Gi; this notification may be combined with the message tr~nemitted in step 1, if any.
4. Upon receipt of the message sent in step 3, client q timest~mrs its local copy of file Gi as having just been pre-fetched, so that file Gi will be retained in local storage for a minimllm of al)plox~l"aLely t minutes before being deleted.
During the period that client q retains file Gi in local storage, client q can respond to any request for file Gi (by user U or, in principle, any other user of client q) immerli~tely and without the ~ceist~n~e of proxy server S.
The difflcult task is for proxy server S, each time it retrieves a file F in response to 20 a request, to identify the files Gl...Gk that should be triggered by the request for file F and pre-fetched imme~ t~ly. Proxy server S employs a cost-benefit analysis, pe,ro",i~llg each pre-fetch whose benefit exceeds a user-determined multiple of its cost; the user may set the multiplier low for aggressive prefetching or high for conservative pr~fetching These pre-fetches may be pelrc,lllled in parallel. The benefit of pre-fetching file Gi imme~i~tely 25 is defined to be the expected number of seconds saved by such a pre-fetch, as col"pa,~d to a situation where Gi is left to be retrieved later (either by a later pre-fetch, or by the user's request) if at all. The cost of pre-fetching file Gi imme~ t~ly is defined to be the expected cost for proxy server S to retrieve file Gi, as determined for example by the network locations of server S and file Gi and by hlro~naLion provider charges, times 1 m-inus the 30 probability that proxy server S will have to retrieve file Gi within t mim~tes (to satisfy either a later pre-fetch or the user's explicit request) if it is not pre-fetched now.
The above ~ finitiQne of cost and benefit have some attractive properties. For example, if users tend to retrieve either file Fl or file F2 (say) after file F, and tend only in CA 0223601~ 1998-04-27 W O 97/16796 PCTrUS96/179~11 the former case to s~lbseql~Pntly retrieve file Gl, then the system will generally not pre--fetch Gl imm~1i~tPly after retrieving file F: for, to the extent that the user is likely to retrieve file F2, the cost of the pre-fetch is high, and to the extent that the user is likely to retrie~e file Fl instead, the benefit ofthe pre-fetch is low, since the system can save as much or nearly as much time by waiting until the user chooses Fl and pre-fetching Gl only then.The proxy server S may e~ e the nec~s~.y costs and benefits by adhering ~o the following ~i~ciplin~.:
1. Proxy server S m~intAins a set of disjoint clusters of the users in it ~ userbase, clustered according to their user profiles.
2. Proxy server S .. ,;~ i.. c an initially empty set PFT of "pre-fetch triples"
<C,F,G>, where F and G are files, and where C i(i~ntifi~s either a cluster of users or the set of all users in the user base of proxy server S. Each pre-fetch triple in the set PFT is associated with several stored values specific to that triple. Pre-fetch lriples and their associated values are IllAi~Ail~çd accoldillg to the rules in 3 and 4.3. Whenever a user U in the user base of proxy server S makes a request R2 for a file G, or a request R2 that triggers file G, then proxy server S takes the following ac.tion~
a. For C being the user cluster co~-l Ai~ g user U, and then again for C being the set of all users:
b. For any request R0 for a file, say file F, made by user U durin~i the t minlltes strictly prior to the request R2:
c. If the triple <C,F,G> is not currently a member of the set PFT, it is added to the set PFT with a count of 0, a trigger-count of 0, a target-count of 0, a total benefit of 0, and a tim~stAmr whose value is the current date and time.
d. The count of the triple <C,F,G>is increased by one.
e. If file G was not triggered or explicitly retrieved by any request that user U made strictly in between requests RO and R2, then the target-count of the triple <C,F,G> is increased by one.
f. If request R2 was a request for file G, then the total benefit of triple cC,F,G> is increased either by the time elapsed between request RO and request R2, or by the expected time to retrieve file G, whichever is less.
g. If request R2 was a request for file G, and G was triggered or explicitly retrieved by one or more requests that user U made strictly in between requests R0 and R2, _59_ CA 0223601~ 1998-04-27 W ~ 97/16796 PCT~US96/17981 with Rl denoting the earliest such request, then the total benefit of triple <C,F,G>
is decreased either by the time elapsed between request Rl and request R2, or by the expected time to retrieve file G, whichever is less.
4. If a user U requests a file F, then the trigger-count is incrernçntecl by onefor each triple currently in the set PFT such that the triple has form <C,F,G>, where user U is in the set or cluster id~ntifiecl by C.
5. The "age" of a triple <C,F,G> is defined to be the number of days elapsed between its tim~t~mp and the current date and time. If the age of any triple <C,F,G>
exceeds a fixed coll:,L~-L number of days, and also eyçee~l~ a fixed coll~L~.I multiple of the triple's count, then the triple may be deleted from the set PFT.
Proxy server S can therefore decide rapidly which files G should be triggered by a request for a given file F from a given user U, as follows.
1. Let C0 be the user cluster ~..~ p user U, and C1 be the set of all users.
2. Server S constructs a list L of all triples <CO,F,G> such that <CO,F,G>
appears in set PFT with a count exceefling a fixed threshold.
- 3. Server S adds to list L all triples <Cl,F,G> such that <CO,F,G> does not appear on list L and <Cl,F,G> appears in set PFT with a count exceeding another fixed threshold. 4. For each triple <C,F,G> on list L:
5. Server S computes the cost of triggering file G to be expected cost of retrieving file Gi, times 1 minus the quotient of the target-count of <C,F,G> by the trigger-count of <C,F,G>.
6. Server S computes the benefit of triggering file G to be the total benefit of <C,F,G> divided by the count of <C,F,G>.
7. Finally, proxy server S uses the computed cost and benefit, as described earlier, to decide whether file G should be triggered. The approach to pre-fetching just described has the advantage that all data storage and manipulation concerning pre-fetching decisions by proxy server S is handled locally at proxy server S.
However, this "user-based" approach does lead to d~lplic~ted storage and effort across proxy servers, as well as incomplete data at each individual proxy server.
That is, the hlrollllaLion inclic~ting what files are frequently retrieved after file F is scattered in an uncoordinated way across numerous proxy servers. An alternative,"file-based" approach is to store all such h~.,llllalion with file F itself. Thedifference is as follows. In the user-based approach, a pre-fetch triple <C,F,G> in CA 0223601~ 1998-04-27 W O 97/16796 PCTAUS96/179~1 server S's set PFT may mention any file F and any file G on the n~lwol]~, but isr~ l to clusters C that are subsets of the user base of server S. By contrast, in the file-based approach, a pre-fetch triple <C,F,G> in server S's set PFT may mention any user cluster C and any file G on the net-work, but is restrirte~l to files F that are stored on server S. (Note that in the file-based approach, user clustering is n~;lwolk wide, and user clusters may include users from di~Te.ellL proxy servers.) When a proxy server S2 sends a request to server S to retrieve file F for a user U, server S2 inrlic~tes in this m~ss~e the user U's user cluster C0, as well as the user U's value for the user-d~le-llfmed multiplier that is used in cost-benefit analysis.
Server S can use this h.rolll-~lion, together -with all its triples in its set PFT of the form <CO,F,G> and <Cl,F,G>, where C1 is the set of all users everywhere on the network, to deterrnine (exactly as in the user-based approach) which files Gl...Gk are triggered by the request for file F When server S sends file F back to proxyserver S2, it also sends this list of files Gl...Gk, so that proxy server S2 can proceed to pre-fetch files G1.. Gk.
The file-based approach requires some additional data tr~n~mi~ion Recall that under the user-based approach, server S must execute steps 3c-3g above for any ordered pair of requests R0 and R2 made within t minllte~s of each other by a user who employs server S as a proxy server. Under the file-based approach, server S must ~c~lte steps 3c-3g above 20 for any ordered pair of requests R0 and R2 made within t mim ltes of each other, by any user on the network, such that R0 requests a file stored on server S. Therefore, when a user makes a request R2, the user's proxy server must send a notification of request R2 to all servers S such that, during the prece-ling t mim-tes (where the variable t may now depend on server S), the user has made a request R0 for a file stored on server S. This notification 25 need not be sent immetli~t~ly, and it is generally more efficient for each proxy server to buffer up such notifications and send them periodically in groups to the appl-Jp.iate servers.
Access And Reachability Control of Users and User-Specific Information Although users' true i~lentities are protected by the use of secure mix paths, pseudol.yl~ y does not guarantee complete privacy. In particular, advertisers can in 30 principle employ user-specific data to barrage users with unwanted solicitations. The general solution to this problem is for proxy server S2 to act as a representative on behalf of each user in its user base, perrnitting access to the user and the user's private data only in CA 0223601~ 1998-04-27 W O 97/16796 PCT~US96/17981 acco,.lance with criteria that have been set by the user. Proxy server S2 can restrict access in two ways:
1. The proxy server S2 may restrict access by third parties to server S2's pse~ldonymous cl~t~h~e of user-specific inrc~ lalion. When a third party such as an advertiser sends a message to server S2 requesting the release of user-specific information for a pseudonym P, server S2 re fuses to honor the request unless the ....-.~ge in~ dçs credçnti~lc for the accessor adequate to prove that the accessor is entitled to this il~ollllaLion. The user ~ccori~te-l with pseudonym P may at any time send signed control messages to proxy server S2, specifying the credenti~l~ or Boolean combinations of cred~onti~l~ that proxy server S2 should thenceforth consider to be adequate grounds for releasing a specified subset of the information associated with pseudonym P. Proxy server S2 stores these access criteria with its d~t~k~e record for pseudonym P. ~or example, a user might wish to proxy server S2 to release purchasing illrollllalion only to selected i~ lion providers, to ~,l~iLal~le ol~ ;r nC (that is, olg~ ;Qn~ that can provide a g~>v~ ment-issued credential that is issued only to registered charities), and to market reseal~,hel., who have paid user U for the right to study user U's purchasing habits.
2. The proxy server S2 may restrict the ability of third parties to send electronic messages to the user. When a third party such as an advertiser attempts to send information (such as a textual message or a request to enter into spoken or written real -time co--~ -ir~ti~n) to pseudonym P, by sending a message to proxyserver S2 requesting proxy server S2 to forward the information to the user at pseudonym P, proxy server S2 will refuse to honor the request, unless the message in~ d~s credtonti~l~ for the accessor ~deq~l~te to meet the requirements the user has chosen to impose, as above, on third parties who wish to send hlrolllla~ion to the user. If the message does include adequate credçnfi~lc~ then proxy server S2 removes a single-use pseudonymous return address envelope from it s database record for pseudonym P, and uses the envelope to send a message co..l~i.,;..g the specified inrc,llllaLion along a secure mix path to the user of pseudonym P. If the envelope being used is the only envelope stored for pseudonym P, or more generally if the supply of such envelopes is low, proxy server S2 adds a notation to this message before sending it, which notation in~1iC~tÇS to the user's local server that it should send additional envelopes to proxy server S2 for future use.

CA 0223601~ 1998-04-27 W O 97/16796 PCT~US96/17981 In a more general variation, the user may instruct the proxy server S2 to irnpose more eo~ ~ ~plPY requirements on the ~ulLillg of requests by third parties, not simply boolean combinations of required credçnti~l~ The user may impose any Boolean colllbhlalion of simple requirements that may in~ de7 but are not limited to, the following:
(a.) the accessor (third party) is a particular party (b.) the ~cce~or has provided a particular credPnti~l (c.) satisfying the request would involve ~ sllre to the accessor of a certain fact about the user's user proffle (d.) satisfying the request would involve disclosure to the ~ccessor of the user's target profile interest s ll--l-,~.y (e.) satisfying the request would involve disclosure to the accessor of statistical summary data, which data are computed from the user's user profile or target profile interest sllmm~ry together with the user profiles and target profile interest s-lmm~ries of at least n other users in the user base of the proxy server (~) the content of the request is to send the user a target object, and this target object has a particular attribute (such as high reading level, or low vulgarity, or an authenticated Parental Guidance rating from the MPAA) (g.) the content ofthe request is to send the user a target object, and this target object has been digitally signed with a particular private key (such as the private key used by the National Pharm~ceutis~1 Association to certify approved doc~lmçnt~) (h.) the content of the request is to send the user a target object, and the target l~rofile has been digitally signed by a profile ~lthlontic~tion agency, guar~ntePing that the target profile is a true and accurate profile of the target object it claims to describe, with all attributes ~llth~ntiç~ted ~1.) the content ofthe request is to send the user a target object, and the target profile of this target object is within a specified distance of a particular search profile specified by the user (j.) the content ofthe request is to send the user a target object, and the proxy server S2,.by using the user's stored target pro~1e interest sl~....,.~.y, estim~tes the user's likely interest in the target object to be above a specified threshold (k.) the accessor indicates its willingn~c~ to make a particular payment to the user in .oxçh~nge for the fillfillm~nt of the request CA 0223601~ 1998-04-27 W O 97/16796 PCT~US96/17981 The steps l~u~l~d to create and ~ the user's access-control re4ui~elllt;"Ls are as follows:
1. The user composes a boolean colllbinaLion of predic~tes that apply to requests; the reslllting complex predicate should be true when applied to a request that the user wants proxy server S2 to honor, and false other~,vise. The complexpredicate may be encoded in another form, for efflciency.
2. The cc , t pledicaLe is signed with SKp, and ~ ed from the user's client processor C3 to the proxy server S2 through the mix path Pnf~losed in a packet that also collL~ IIs the user's pseudonym P.
3. The proxy server S2 receives the packet, verifies its ~lth~nticity using PKp and stores the access control instructions specified in the packet as part of its database record for pseudonym P.
The proxy server S2 enforces access control as follows:
1. The third party (acc~ccor) Ll~ll~ls a request to proxy server S2 using the l S normal point-to-point connections provided by the network N . The request may be to access the target profile interest sllmm~ries associated with a set of pseudonyms Pl...Pn, or to access the user profiles associated with a set of pseudonyms Pl ...Pn, or to rOIwald a message to the users associated with pseudonyms P1...Pn. The accessor may explicitly specify the pseudonyms Pl...Pn, or may ask that Pl...Pn be chosen to be the set of all pselldonyms registered with proxy server S2 that meet specified conditions.
2. The proxy server S2 indexes the r1~t~h~e record for each pseudonym Pi (1 <= I <= n), retrieves the access requilell~llLs provided by the user associated with Pi, and determines whether and how the tr~ncmitted request should be satisfied for Pi. If the requirements are s~tiefie~l~ S2 proceeds with steps 3a-3c.
3a. If the request can be s~ti~fied but only upon payment of a fee, the proxy server S2 Ll~ lLS a payment request to the accessor, and waits for the accessor to send the payment to the proxy server S2. Proxy server S2 retains a service fee and forward s the balance of the paylllellL to the user associated with pseudonym Pi, via an anonymous return packet that this user has provided.
3b. If the request can be satisfied but only upon provision of a cred~nfi~l~ theproxy server S2 transmits a credential request to the accessor, and waits for the accessor to send the credential to the proxy server S2.

CA 0223601~ 1998-04-27 W O 97/16796 PCTAUS961179~1 3c. The proxy server S2 ~atiefiçs the request by disclosing user-specific inro,n~alion to the accessor, by providing the accessor with a set of single-useenvelopes to comml~n: ,ate directly with the user, or by ~1 w~uding a message to the user, as requested. 4. Proxy server S2 optionally sends a m.o.e~ge to the accessor, ;, .. I;e~ g why each of the denied requests for Pl .. Pn was denied, and/or indic,ating how many requests were saticfied S. The active and/or passive relevance feedbac~ provided by any user U with respect to any target object sent by any path from the acc~sssor is tabulated by the above-described tabulating process resident on user U's client processor C3. As described above, a sl-mm~ny of such information is periodically tr~n~mitte.d lo the proxy server S2 to enable the proxy server S2 to update that user's target profile interest ~Ullllll~ly and user profile.
The access control criteria can be applied to solicited as well as unsolicited tr~n~mi~ion~ That is, the proxy server can be used to protect the user from inapplopliate 15 or misrepresented target objects that the user may request. If the user requests a ~arget object from an information server, but the target object tums out not to meet the access controlcriteria, thentheproxyserverwillnotpemlittheillrollll~LLionserverto Ll~l~ iL the target object to the user, or to charge the user for such tr~n~mi~sion For example, to guard against target objects whose profiles have been tampered with, the user may specify an 20 access control criterion that requires the provider to prove the target profile's accuracy by means of a digital sign~tllre from a profile ~llthçntication agency. As another example, the parents of a child user may instruct the proxy server that only target objects that have been digitally signed by a recognized child protection ol~ iQn may be ~ ed to theuser; thus, the proxy server will not let the user retrieve pornography, even from a rogue 25 i lrollllalion server that is willing to provide pornography to users who have not supplied an adulthood credçnti~l Distribution of Information with Mnlti~qct Trees The graphical repres~nt~tion of the network N presented in Figure 3 shows that at least one of the data comml-nications links can be ~ (ed, as shown in Figure 4, while 30 still çn~hling the network N to ~ ll--l messages among all the servers A-D. By ~limin~tinn we mean that the link is unused in the logical design of the network, rather than a physical disconnection of the link. The graphs that result when all red--nd~nt data communications links are el;",;l-~ed are termed "trees" or "connecte.d acyclic graphs." A

W O 97/16796 PCTnUS96/17981 graph where a m~ ge could be L ~ ed by a server through other servers and then return to the Ll~ g server over a di~elellL ori in~ting data co---,-~ -ic~ti~ns link is termed a "cycle." A tree is thus an acyclic graph whose edges (links) connect a set of graph "nodes" (servers). The tree can be used to ~ffici~ntly broadcast any data file to selected servers in a set of i-.~e~col-l-c~,Led servers.
The tree structure is attractive in a comm-lnic~tit)ns network because much information distribution is multicast in nature -- that is, a piece of i~rol"ialion available at a single source must be distributed to a multiplicity of points where the information can be ~ccesse~l This technique is widely known: for example, "FAX trees" are in common use in political o,~ l;t)n~, and mllhi~ t trees are widely used in distribution of mnltimçrli~ data in the Internet; for example, see "Scalable Fee~b~k Control for Multicast Video Distribution in the Internet," (Jean - Chrysostome Bolot, Thierry Turletti, & Ian W~k~m~n Ccmr~lt~r Co"-l"unication Review, Vol. 24, # 4, Oct. '94, Procee~1ing~ of SIGCOMM'94, pp. 58 - 67) or "An Ar~hitech~re For Wide-Area ~ -ltic~t Routing," (Stephen Deering, Deborah Estrin, Dino Farinacci, Van Jacobson, Ching-Gung Liu, & Liming Wei, Computer Communication Review, Vol. 24, # 4, Oct. '94, Procee~in~ of SIGC0MM'94, pp. 126 -135). While there are many possible trees that can be overlaid on a graph reprçs~nt~tion of a network, both the nature of the networks (e.g., the cost of tr~n~mitting data over a link) and their use (for example, certain nodes may exhibit more frequent intt;~;o".,."-ni~tion) can make one choice of tree better than another for use as a m--ltic~ct tree. One of the most difiicult problems in practical neLw~,k design is the construction of "good" m~ltic~ct trees, that is, tree choices which exhibit low cost (due to data not l-~ve~sil-g links llnneces~rily) and good pelrc.l~ ce (due to data frequently being close to where it is needed) Constructing a M~ cast Tree Algorithms for constructing mllltic~ct trees have either been ad-hoc, as is the case of the Deering, et al. Internet m--ltic~t kee, which adds clients as they request service by grafting them into the existing tree, or by construction of a minim--m cost sp~nning tree. A
distributed algorithm for creating a sp~nning tree (defined as a tree that connects, or "spans,"
all nodes of the graph) on a set of Ethernet bridges was developed by Radia Perlman ("Interconn~ction~ Bridges and Routers," Radia Perlman, Addison-Wesley, 1992). Creating a minimal-cost sr~nning tree for a graph depends on having a cost model for the arcs of the graph (corresponding to communications l inks in the com~unications network). In the case of Fth.ornet bridges, the default cost (more complic~te~l costing models for path costs are CA 0223601~ 1998-04-27 W O 97/16796 PCTAUS96/179~11 cllc~ed on pp. 72-73 of Perlman) is calculated as a simple ~ t~n~.e measure to the root;
thus the ~IJ~ g tree . ~ es the cost to the root by first electing a unique root ancl then constructing a ~ tree based on the ~ n~xc from the root. In this algorithm, the root is elected by leco~ ,e to a numeric ID cont~ined in "configuration messages": the server w 5 hose ID has ...;~ .....~ ."- numeric value is chosen as the root. Several problems exist with this algorithm in general. First, the method of using an ID does not necees~rily select the best root for the nodes illLe~o~ ected in the tree. Second, the cost model is simplistic.
We first show how to use the similarity-based methods described above to select the servers most interested in a group of target objects, herein termed "core servers" for that 10 group. Next we show how to construct an unrooted m~ ic~ct tree that can be used to broadcast files to these core servers. Finally, we show how files corresponding to target objects are actually broadcast through the multicast tree at the initiative of a client, and how these files are later retrieved from the core servers when clients request them.Since the choice of core servers to distribute a file to depends on the set of users who 15 are likely to retrieve the file (that is, the set of users who are likely to be interested in the corresponding target object), a separate set of core servers and hence a separate multicast tree may be used for each topical group of target objects. Throughout the description below, servers may communicate among themselves through any path over which messages can travel; the goal of each multicast tree is to O~ ~iGe the multicast distribution of files 20 corresponding to target objects of the corresponding topic. Note that this problem is completely distinct from selecting a multiplicity of spanning trees for the complete set of interconnected nodes as disclosed by Sincoskie in U.S. Patent No. 4,706,080 and the publication titled "F~n~led Bridge Algorithms for Large Networks" by W. D. Sincoskie and C. J. Cotton, published January 1988 in IEEE Network on pages 16-24. The trees in 25 this disclosure are intentionally deeigned to interconnect a selected subset of nodes in the system, and are sl~cce~fill to the degree that this subset is relatively small.
Multicast Tree Construction r, oce~lure A set of topical multicast trees for a set of homogenous target objects may be constructed o r reconstructed at any time, as follows. The set of target objects is grouped 30 into a fixed number of topical clusters Cl...Cp with the methods described above, for example, by choosing Cl ...Cp to be the result of a k-means clustering of the set of target objects, or alternatively a covering set of low-level clusters from a hierarchical cluster tree .

CA 0223601~ 1998-04-27 WO 97/16796 PCT~US96/17981 of these target objects. A mllltic~ct tree MT(c) is then constructed from each cluster C in Cl...Cp, by the following procedure:
1. Given a set of proxy servers, Sl...Sn, and a topical cluster C. It is ~sumçd that a general mllltic~et tree MTfi"~ that co..~ all the proxy servers Sl...Sn has S previously been constructed by well-known methods.
2. Each pair <Si C> is associated with a weight, w(Si, C), which is int~n~led to covary with the ~ ectçrl number of users in the user base of proxy server Si who will subsequently access a target object from cluster C This weight is computed by proxy server Si in any of several ways, all of which make use of the similarity I0 measurement computation described herein.
One variation makes use ofthe following steps: (a) Proxy server Si randomly selects a target object T from cluster C. (b) For each pseudonym in its local d~st~b~cç, with associated user U, proxy server Si applies the techniques disclosed above to user U's stored user profile and target profile interest summary in order to estim~te the interest w(U, T) that user U has in t he selected target object T. The aggregate interest w(Si, T) that the user base of proxy server Si has in the target object T is defined to be the sum of these interest values w(U, T). Alternatively, w(Si, T) may be defined to be the sum of values s(w(U, T)) over all U in the user base. Here s(*) is a sigmoidal function that is close to 0 for small ar~m~nt~
and close to a consL~ll P,~ for large ar~-m~nt~; thus s(w(l~J, T)) estim~tes the probability that user U will access target object T, which probability is assumed to be independent of the probability that any other user will access target object T. In a variation, w(Si, T) is made to estim~te the probability that at least one user from the user base of Si will access target object T: then w(Si, T) may be defined as the maximum of values w(U, T), or of 1 minus the product over the users U of the quantity (1 - s(w(U, T))). (c)Proxy server Si repeats steps (a)-(b) for several target objects T selected randomly from cluster C, and averages the several values of w(Si, T) thereby compl-ted in step (b) to determine the desired quantity w(Si C), which quantity lt;~lc;s~ the expected aggregate interest by the user base of proxy server Si in the target objects of cluster C.
In another V~ri~tion~ where target profile interest ~, - " ", li1 l ies are embodied as search profile sets, the following procedure is followed to compute w(Si, C): (a). For each search profile Ps in the locally stored search profile set of any user in the user base of proxy server Si, proxy server Si computes the ~ t~n~e dOES, Pc) between the search profile and the cluster profile Pc of cluster C. (b). w(Si,C) is chosen to be the m~imllm value of (-d(PS

CA 0223601~ 1998-04-27 WO 97/16796 PCT~US96/179,Bl ,Pc)/r) across all such search profiles Ps~ where r is computed as an affine function of the cluster ~ meter of cluster C. The slope and/or i,lLel cep~ of this affine function are chosen to be smaller (thereby il~ g w(Si, C)) for servers Si for which the target object provider wishes to improve pelrollllance~ as may be the case if the users in the user base of proxy 5 server Si pay a premium for improved pelr~Jllllance, or if pelrOIlllallce at Si will otherwise be unacceptably low due to slow network connections.
In another variation, the proxy server Si is modified so that it ..,~;"~ c nol: only target profile interest ~u~ ll~ies for each user in its user base, but also a single aggregate target profile interest summary for the entire user base. This aggregate target profile interest sl-mm~ry is delellnilled in the usual way from relevance fee~lb~k, but the relevance feedb~ on a target object, in this case, is considered to be the frequency with which users in the user base retrieved the target object when it was new. Whenever a user retri~ves a target object by means of a request to proxy server Si, the aggregate target profile interest sl-mm~ry for proxy server Si is updated. In this variation, w(Si, C) I s e~l;",~led by the following steps:
(a) Proxy server Si randomly selects a target object T from cluster C.
(b) Proxy server Si applies the teçhniques disclosed above to its stored ag~é~aLe target profile interest sl-mm~ry in order to estim~te the aggregate interest w(Si, T) that its ag~egaLed user base had in the selected target object T, when new;
this may be interpreted as an estim~te of the likelihood that at least one memlber of the user base will retrieve a new target object similar to T.
(c)Proxy server Si repeats steps (a)-(b) for several target objects T selected randomly from cluster C, and averages the several values of w(Si, T) thlereby computed in step (b) to determine the desired quantity w(Si, C), which quantity leylèsellL~ the expected aggregate interest by the user base of proxy server Si in the target objects of cluster C
3. Those servers Si from among Sl...Sn with the greatest weights w(Si, (') are deeign~te~ "core servers" for cluster C. In one variation, where it is desired to select 2 fixed number of core servers, those servers Si with the greatest values of w(Si, C) are selected.
30 In another variation, the value of w(Si, C) for each server Si is compared against a. fixed threshold w,~"", and those servers Si such that w(Si, C) equals or PX( ee~ls w,~"~, are selected as core servers. If cluster C lep,ese"Ls a narrow and spe~ i7ecl set of target objects, as often happens when the clusters Cl...Cp are numerous, it is usually ~dequ~te to selec:t only CA 0223601~ 1998-04-27 W O 97/16796 PCT~US96/17981 a small number of core server cluster C, thereby obtaining substantial advantages in computational efficiency in steps 4-5 below .4. A complete graph G(C) is constructed whose vertices are the decign~ted core servers for cluster C. For each pair of core servers, the cost of Lln.l~lll;ll;l.~ a message S b~:lwwll those core servers along the ~he~pe~l path is e~l;" ,~1 e~l, and the weight of the edge connecting those core servers is taken to be this cost. The cost is determined as a suitable function of average Lln~ ;on charges, average tr~n~mi~ion delay, and worst-case or near-worst-case tr~n~miecion delay.
5. The ml~ltic~t tree MT(C) is computed by standard methods to be the minimllm 10 spanning tree (or a near-minim-lm sp~nning tree) for G(C), where the weight of an edge between two core servers is taken to b e the cost of lln~ a message between those two core servers. Note that MT(C) does not contain as vertices all proxy servers S 1. . . Sn, but only the core servers for cluster C.
6. A message M is formed describing the cluster profile for cluster C, the core 15 servers for cluster C and the topology ofthe mlllti~ct tree MT(C) constructed on those core servers. Message M is broadcast to all proxy servers S1...Sn by means of the general mllltic~t treeMlfi"~. Each proxy server Si, upon receipt of message M, extracts the cluster profile of cluster C, and stores it on a local storage device, together with certain other information that it determines from message M, as follows. If proxy server Si is named in 20 m~s~ge M as a core server for cluster C, then proxy server Si extracts and stores the subtree of MT(C) in~1~1ced by all core servers whose path ~ t~nee from Si in the graph MT(C) is less than or equal to d, where d is a co~ positive integer (usually from 1 to 3). If m~c~ge M does not name proxy server Si as a core server for Ml (C), then proxy server Si extracts and stores a list of one or more nearby core servers that can be inexpensively 25 contacted by proxy server Si over virtual point-to-point links.
In the network of Figure 3, to illustrate the use of trees, as applied to the system of the present invention, consider the following simple example where it is assumed that client r provides on-line information for the network, such as an electronic newspaper. This illfo~ aLion can be structured by client r into a preall~lged form, comprising a number of 30 files, each of which is associated with a di~elGlll target object. In the case of an electronic newspaper, the files can contain textual lepl~s~ ;nns of stock prices, weather forecasts, editorials, etc. The system determines likely clem~nrl for the target objects associated with these files in order to oplil~ e the distribution of the files through the network N of CA 0223601~ 1998-04-27 W O 97/16796 PCTrUS96/17981 interconnected clients p-s and proxy servers A-D. Assume that cluster C consists of text articles relating to the aerospace industry; further assume that the target profile ir~terest - ;es stored at proxy servers A and B for the users at clients p and r intiic~te that these users are strongly illlelc~,led in such articles. Then the proxy servers A and B are selected 5 as core servers for the mllltic~ct tree MT(C). The mllltic~et tree MT(C) is then computed to consist of the core servers, A and B, ct nnected by an edge that lepl~sellLs the least costly virtual point-to-point link between A and B (either the direct path A-B or the indirect path A-C-B, depending on the cost).
Global Requests to Mnlti~~ct Trees One type of message that may be tr~ncmitted to any proxy server S is tenmed a "global request m~Cc~ Such a m~oCc~ge M triggers the broadcast of an embedded request R to all core servers in a mllltir.~ct tree MT(C). The content of request R and the iclentity of cluster C are incl~ded in the message M, as is a field indicating that message M is a global request message. In addition, the message M contains a field S",t wblich is unspecified except under certain circ~mct~n~es described below, when it names a specific core server. A global request message M may be ll~ led to proxy server S by a user registered with proxy server S, which tr~ncmicci~)n may take place along a pseudonymous mix path, or it may be Ll;~ ed to proxy server S from another proxy server, along a virtual point-to-point connection.
When a proxy server S receives a message M that is marked as a global request message, it acts as follows: 1. If proxy server S is not a core server for topic C, it retrieves its locally stored list of nearby core servers for topic C, selects from this list a nearby core server S', and L1~I 711U~S a copy of mess~ge M over a virtual point-to-point cnnntoctiQn to cor e server S'. If this tr~ncmiscion fails, proxy server S repeats the procedure with other core servers on its list. 2. If proxy server S is a core server for topic C, it executes the following steps: (a) Act on the request R that is embedded in message M. (b) Set S~;"Tr to be S(C) Retrieve the locally stored subtree of MT(C), and extract from it a list L of all core servers that are directly linked to S,~ in this subtree. (d) If the message M specifies a value For S,,~t and S~ ~.,t appears on the list L, remove S~y;t from the list L. Note that list L may be empty before this step, or may become empty as a result of this step. (e) For each server Si in list L, transmit a copy of m~ss~ge M from server S to server Si over a virtual point-to-point connection, where the S,jUt field of the copy of message M has been altered to S~""r If Si cannot be reached in a reasonable amount of time by any virtual point-to-point com1ection CA 0223601~ 1998-04-27 W O 97/16796 PCT~US96/17981(for ~ , '-, server Si is broken), recurse to step (c) above with SO~,g bound to Sc"~ and Sc"rr bound to S {\sub I} for the duration of the recursion.
When server S' in step 1 or a server Si in step 2(e) receives a copy of the global request . . .~ e M, it acts according to exactly the same steps. As a result, all core servers S eventually receive a copy of global request m~ss~ge M and act on the embedded request R, unless some core servers carmot be reached. Even if a core server is unreachable, step (e) ensures that the broadcast can continue to other core servers in most circ~lm~t~nces provided that d > l; higher values of d provide additional insurance against unreachable core servers.
10 M-~lti~ Files The system for custom,7ed electronic information of desirable objects ~xec~ltes the following steps in order to introduce a new target object into the system. These steps are ini~i~ted by an entity E, which may be either a user entering colll~ s via a keyboard at a client processor q, as illustrated in Figure 3, or an automatic software process resident on 15 a client or server processor q. 1. Processor q forms a signed request R, which asks the receiver to store a copy of a file F on its local storage device. File F, which is ~ ed by client q on storage at client q or on storage ?~cc~c~ le by client q over the network, contains the i~ ,n~Lional content of or an identifying description of a target object, as described above. The request R also in~llld.o~ an address at which entity E may be contacted 20 (possibly a pseudonymous address at some proxy server D), and asks the receiver t o store the fact that file F is ~A;~ -çd by an entity at said address. 2. Processor q embeds request R in a message M1, which it pseudonymously ~ sll,.ls to the entity E's proxy server D as described above. Message Ml instructs proxy server D to broadcast request R along an appropriate multicast tree. 3. Upon receipt of me~s~ge M1, proxy server D examines the 25 doubly embedded file F and computes a target profile P for the corresponding target object.
It compares the target profile P to each of the cluster profiles for topical clusters C 1. . .Cp described above, and chooses Ck to be the cluster with the smallest similarity distance to profile P. 4. Proxy server D sends itself a global request message M instructing itself to broadcast request R along the topical multicast tree MT(Ck). 5. Proxy server D notifies 30 entity E through a pseudonymous cc,,,,,,~ ic~tion that file F has been multicast along the topical multicast tree for cluster Ck.
As a result of the procedure that server D and other servers follow for acting on global request messages, step 4 eventually causes all core servers for topic Ck to act on CA 0223601~ 1998-04-27 W O 97/16796 PCT~US96/179$1 request R and the~ ~rOl e store a local copy of file F. In order to make room for file F on its local storage device, a core server Si may have to delete a less useful file. There are several ways to choose a file to delete. One option, well known in the art, is for Si to choose to delete the least recently accessed file. In another variation, Si deletes a file that it believes few users will access. In this v~ri~tion~ whenever a server Si stores a copy of a file F, il: also computes and stores the weight w(Si, CF), where CF is a cluster coneieting of the single target object associated with file F. Then, when server Si needs to delete a file, it chooses to delete the file F with the lowest weight w(Si, CF). TO refiect the fact that files are accessed less as they age, server Si periodically multiplies its stored value of w(Si, CF) by a decay factor, such as 0.95, for each file F that it then stores. Alter natively, instead of using a decay factor, server Si may periodically leco-l,p.lte aggregate interest w(Si, CF) for each file F that it stores; the aggregate interest changes over time because target objects typically have an age attribute that the system considers in çstim~ting user interest, as described above.
If entity E later wishes to remove file F from the network, for example because it has just multicast an updated version, it pseudonymously transmits a digitally signed grlobal request mece~ge to proxy server D, reqU~eting all proxy servers in the multicast tree Ml'(Ck) to delete any local copy of file F that they may be storing.
Queries to Multicast Trees In ad(litiQn to global request m~Ss~ree~ another type of message that may be Ll~"~;l1l;1led to any proxy server S is termed a "query message." When Ll~ ed to a proxy server, a query ml~e~ causes a reply to be sent to the originator of the message; this reply will contain an answer to a given query Q if any of the servers in a given mllltic~et tree MT(C) are able to answer it, and will otherwise intlie?lte that no answer is available. The query and the cluster C are named in the query message. In ~d~liticn, the query message contains a field S~ t which is unspecified except under certain ~;h~ CeS described below, when it names a specific core server. When a proxy server S receives a message M
that is marked as a query message, it acts as follows:l. Proxy server S sets Ar to be the return address for the client or server that ll~ Led message M to server S. Ar may be either a network address or a pseudonymous address 2. If proxy server S is not a core server for cluster C, it retrieves its locally stored list of nearby core servers for to]pic C, selects from this list a nearby core server S', and transmits a copy of the locate mess.~ge M
t over a virtual point-to-point connection to core server S'. If this tr~nemieeif,n fails, proxy CA 0223601~ 1998-04-27 W O 97/16796 PCT~US96/~7981 server S repeats the procedure with other core servers on its list. Upon receiving a reply, it fol w~ds this reply to address A,. 3 . If proxy server S is a core server for cluster C, and it is able to answer query Q using locally stored il~.~ Lion, then it Ll~ls~ a "positive"
reply to Ar cc . .l~;";llg the answer. 4. If proxy server S is a core server for topic C, but it is 5 unable to answer query Q using locally stored il~lll~Lion, then it carries out a parallel depth-first search by e~e~ tin~ the following steps: (a) Set L to be the empty list. (b) Retrieve the locally stored subtree of MT(C). For each server Si directly linked to S~ulr in this subtree, other than S"" (if specified), add the ordered pair (Si, S) to the list L. (c) If L
is empty, transmit a "negative" reply to address Ar saying that server S cannot locate an 10 ans~ver to query Q, and t~ AIÇ the execution of step 4; otherwise proceed to step (d). (d) Select a list Ll of one or more server pairs (Ai, Bi) from the list L. For each server pair (Ai, Bi) on the list L1, form a locate message M(Ai, Bi), which is a copy of message M whose S"st field has been modified to specify Bi, and Ll~ls~ this message M(Ai, Bi) to server Ai over a virtual point-to-point connection. (e~ For each reply received (by S) to a message 15 sent in step (d), act as follows: (I) If a "positive" reply arrives to a locate message M(Ai, Bi), then forward this reply to A, and telll~illa~e step 4, immçf~iAtrly. (ii) If a "negative"
reply arrives to a locate message M(Ai, Bi), then remove the pair (Ai, Bi) from the list Ll.
(iii) If the mçssAge M(Ai, Bi) could not be s~cces~fillly delivered to Ai, then remove the pair (Ai, Bi) from the list Ll, and add the pair (Ci, Ai) to the list Ll for each Ci other than 20 Bi that is directly linked to Ai in the locally stored subtree of MT(C). (f) Once Ll no longer contains any pair (Ai, Bi) for which a message M(Ai, Bi) has been sent, or after a fixed period of time has elapsed, return to step (c).
Rel, ;e~ Files from a Multicast Tree When a processor q in the network wishes to retrieve the file associated with a given 25 target object, it e~ceclltçs the following steps. These steps are initiAted by an entity E, which maybeeitherauserentrringco.. ,An~lsviaakeyboardataclientq, asillustratedinFigure 3, or an ~llt- mAtic so~ware process resident on a client or server processor q. 1. Processor q forms a query Q that asks whether the recipient (a core server for cluster C) still stores a file F that was previously mllltiçAct to the mllltir.Act tree MT(C); if so, the recipient server 3 0 should reply with its own server name . Note that processor q must already know the name of file F and the identity of cluster C; typically, this illrullllalion is provided to entity E by a service such as the news clipping service or browsing system described below, which must identify files to the user by (name, mllltir,Aet topic) pair. 2. Processor q forms a query CA 0223601~ 1998-04-27 W O 97/16796 PCT~US96/179~;1 m~e~e M that poses query Q to the m11ltic~et tree MT(C). 3. Processor q pseudonymously s message M to the user's proxy server D, as described above. 4; Processor q receives a response M'2 to message M. 5. If the response M2 is "positive," that is, it naLmes a server S that still stores file F, then processor q pseudonymously instructs the user's proxy - Sserver D to retrieve file F from server S. If the retrieval fails because server S has deleted file F since it answered the query, then client q returns to step l. 6. If the response ~2 is "negative," that is, it inrlic~tes that no server in MT(C) still stores file F, then processor q forms a query Q that asks the reçipient for the address A of the entity that " ,~ file F;
this entity will ordinarily ...~ ;.;.. a copy of file F in~l~finite1y All core servers in MT(C) 10 ordinarily retain this i-~,r.,--nation (unless instructed to delete it by the "~ i,.g entity), even if they delete file F for space reasons. Therefore, processor q should receive a response providing address A, whereupon processor q pseudonymously instructs the user's proxy server D to retrieve file F from address A.
When multiple versions of a file F exist on local servers throughout the data lS communication network N, but are not marked as alternate versions of the same file, the system's ability to rapidly locate files similar to F (by llc;alhlg them as target objects and applying the methods disclosed in "Searching for Target Objects" above) makes it possible to find all the alternate versions, even if they are stored remotely. These related data files maythenbe.e~n.~i1çdbyanymethod. Inasimple;..~.,l;s.l;on, allversionsoftheda1tafile 20 would be replaced with the version that had the latest date or version number. In another inet~nfiRtion, each version would be a~tQm~tically annotated with l~relellces or pointers to the other versions.
NEWS CLIPPING SERVICE
The system for c11~tomi7çd electronic identific~tion of desirable objects of the25 present invention can be used in the electronic media system of Figure l to implement an ~tom~tic news clipping service which learns to select (filter) news articles to match a user's interests, based solely on which articles the user chooses to read. The system for c11ctomi7ed electronic identification of desirable objects generates a target profile for each article that enters the electronic media system, based on the relative fTequency of occulTence 30 of the words con~ained in the article. The system for customized electronic identific~tion of desirable objects also generates a search profile set for each user, as a function ~f the target profiles of the articles the user has accessed and the relevance feedbaç~ the user has provided on these articles. As new articles are received for storage on the mass s1:orage CA 0223601~ 1998-04-27 W O 97/16796 PCTAUS96/17981systems SSl -SSm of the i,lrol",~Lion servers I~ the system for cuslo"~ed electronic identification of desirable objects generates their target profiles. The generated target profiles are later co,llp~t;d to the search profiles in the users' search profile sets, and those new articles whose tar get profiles are closest (most similar) to the closest search profile in 5 a user's search profile set are id~ntified to that user for possible reading. The computer program providing the articles to the user monitors how much the user reads (the number of screens of data and the number of . . .;. ~ çs spent reading), and adjusts the search profiles in the user's search profile set to more closely match what the user appalellLly prefers to read. T he details of the method used by this system are disclosed in flow diagram form in 10 Figure 5. This method requires selecting a specific method of c~lc~ ting user-specific search profile sets, of me~ ring similarity between two profiles, and of updating a user's search profile set (or more generally target profile interest s--mm~ry) based on what the user read, and the examples disclosed herein are ~ mples of the many possible impl~m~nt~tions that can be used and should not be construed to limit the scope of the system.
5 ~nit;~ e Users' Search Profile Sets The news clipping service in~t~nti~tes target profile interest sllmm~ries as search profile sets, so that a set of high-interest search profiles is stored for each user. The search profiles associate d with a given user change over time. As in any application involving search profiles, they can be initially determined for a new user (or explicitly altered by an 20 existing user) by any of a number of procedures, in~ ling the following pl~r~lled methods : (1) asking the user to specify search profiles directly by giving keywords and/or numeric attributes, (2) using copies of the profiles of target objects or target clusters that the user in-lic.~tes are representative of his or her interest, (3) using a standard set of search profiles copied or otherwise determined from the search profile sets of people who are 25 demographically similar to the user.
Retrieve New Articles from Article Source Articles are available on-line from a wide variety of sources. In the ~l~rel,ed embodiment, one would use the current days news as supplied by a news source, such as the AP or Reuters news wire. These news articles are input to the electronic media system by 30 being loaded into the mass storage system SS4 of an information server S4. The article profile module 201 of the system for customized electronic id.~ntifiç~tion of desirable objects can reside on the i~ rOl ll.ation server S4 and operates pursuant to the steps illustrated in the flow diagram of Figure 5, where, as each article is received at step 501 by the CA 0223601~ 1998-04-27 W O 97/16796 PCTAUS96/179~1 information server S4, the article profile module 201 at step 502 generates a target profile for the article and stores the target profile in an article inrl.eYing memory (typically part of mass storage system SS4 for later use in selectively delivering articles to users. This method is equally useful for s~lecting which articles to read from electronic news groups and S electronic bulletin boards, and can be used as part of a system for screening and O~ .;.Ig electronic mail ("e-mail").
C~lc~ te Article Profiles A target profile is computed for each new article, as described earlier. The most important ~ttri~lte of the target profile is a textual attribute that stands for the entire teYt of 10 the article. This teYtual attribute is represented as described earlier, as a vector of numbers, which numbers in the pref~ d embodiment include the relative frequencies (TE~/IDF
scores) of word occurrences in this article relative to other comparable articles. The server must count the frequency of occurrence of each word in the article in order to con~u~e the TF/IDF scores.
These news articles are then hierarchically clustered in a hierarchical cluster tree at step 503, which serves as a decision tree for detelllli~ g which news articles are closest to the user's interest. The resulting clusters can be viewed as a tree in which the top of the tree incl~ldes all target objects and branches further down the tree represent divisions of t]he set of target objects into sllccec~ively smaller subclusters of target objects. Each cluster has a cluster profile, so that at each node of the tree, the average target profile (centroid) of all target objects stored in the subtree rooted at that node is stored. This average of target profiles is computed over the repr~s~nt~ti~ n of target profiles as vectors of numeric attributes, as described above.
Compare Current Articles' Target Profiles to a User's Search Profiles The process by which a user employs this appal~lus to retrieve news articles of interest is illustrated in flow diagram form in Figure 1 1. At step 1 101, the user logs into the data comml-nir~tion network N via their client processor Cl and activates the news reading program. This is accomplished by the user establishing a pseudonymous data communications comle~ion as described above to a proxy server S2, which providesfront-end access to the data c--mml-n; ~tion network N. The proxy server S2 I~ .i a list of authorized pseudonyms and their corresponding public keys and provides access and billing control. The user has a search profile set stored in the local data storage m~ m on the proxy server S2. When the user requests access to ,"news" at step 1102, the ]profile CA 0223601~ 1998-04-27 m~t~hinp~ module 203 resident on proxy server S2 seq~lenti~lly considers each search profile p" from the user's search profile set to deterrnine which news articles are rnost likely of interest to the user. The news articles were autom~tic~lly clustered into a hierarchical cluster tree at an earlier step so that the dete....iil~l;Qn can be made rapidly for each user.
The hierarchical cluster tree serves as a decision tree for de~e ----ni--g which articles' target profiles are most similar to search profile Pk: the search for relevant articles begins at the top of the tree, and at each level of the tree the branch or b-~hes are s~lected which have cluster profiles closest to Pk This process is recursively executed until the leaves of the tree are reached, identif~ing individual articles of interest to the user, as described in the section "Sealclli--g for Target Objects" above.
A variation on this process exploits the fact that many users have similar h~le~ls.
Rather than carry out steps 5-9 of the above process separately for each search profile of each user, it is possible to achieve added ~ffici~n-,y by carrying out these steps only once for each group of similar search profiles, thereby satisfying many users' needs at once. In this variation, the system begins by non-hierarchically clustering all the search profiles in the search profile sets of a large number of users. For each cluster k of search profiles, with cluster profile Pk, it uses the method desc-il)ed in the section "Sea cl~ , for Target Objects"
to locate articles with target profiles similar to Pk Each located article is then identified as of interest to each user who has a search profile 1 epl esenled in cluster k of search profiles.
Notice that the above variation attempts to match clusters of search profiles with similar clusters of articles. Since this is a symmetrical problem, it may instead be given a symmetrical solution, as the following more general variation shows. At some point before the m~tr~hin~ process commences, all the news articles to be considered are clustered into a hierarchical tree, termed the "target profile cluster tree," and the search profiles of all users to be considered are clustered into a second hierarchical tree, termed the "search profile cluster tree." The following steps serve to find all m~t~hes between individual target profiles from any target profile cluster tree and individual search profiles from any search prof le cluster tree: 1. For each child subtree S of the root of the search profile cluster tree (or, let S be the entire search profile cluster tree if it co--lains only one search profile): 2.
Compute the cluster profile Ps to be the average of all search profiles in subtree S 3. For each subcluster (child subtree) T of the root of the target profile cluster tree (or, let T be the entire target profile cluster tree if it contains only one target profile): 4. Compute the cluster profile PT to be the average of all target profiles in subtree T 5. Calculate d(~'s~ PT~

CA 0223601~ 1998-04-27 W O 97/16796 PCTAUS96/179$1 the cli~t~nce between Ps and PT 6. If d(PS, PT) < t, a threshold, 7. If S COI.I;1;nC only one search profile and T conLaills only one target profile, decl are a match between that search profile and that target profile,8. otherwise recurse to step 1 to find all m~tçhes bel:ween search profiles in tree S and target profiles in tree T.
- S The threshold used in step 6 is typically an affine fim~tion or other function of the greater of the cluster variances (or cluster t~ ) of S and T. Whenever a match is declared bt;Lwcen a search profile and a target profile, the target object that contributed the target profile is ic1~ntifiecl as being of interest to the user who contributed the search p:rofile.
Notice that the process can be applied even when the set of users to be considered or the set of target objects to be considered is very small. In the case of a single user, the process reduces to the method given for identifying articles of interest to a single user. In the case of a single target object, the process conctit~ltes a method for identifying users to whom that target object is of interest.
Present List of Articles to User Once the profile correlation step is completed for a selected user or group of users, at step 1104 the profile processing module 203 stores a list of the identified articles for presentation to each user. At a user's request, the profile processing system 203 retrieves the gel,t;l ~Led list of relevant articles and presents this list of titles of the selected articles to the user, who can then select at step 1105 any article for viewing. (If no titles are available, then the first s~ntenc.e(s) of each article can be used.) The list of article titles is sorted according to the degree of similarity of the article's target profile to the most similar search profile in the user's search profile set. The resll1ting sorted list is either ll~n~ (ccl in real time to the user client processor Cl, if the user is present at their client processor Cl, or can be Ll~ l ";l l ecl to a user's mailbox, resident on the user's client processor Cl or stored within 25 the server S2 for later retrieval by the user; other methods of tr~n~mi~ion include f~c,cimile L~ iQn of the printed list or telephone tr~n~mi~ n by means of a text-to-speech system. The user can then L,~,sll,il a request by computer, f~ccimile, or telephone to indicate which of the identified articles the user wishes to review, if any. The user can still access all articles in any i,~",laLion server S4 to which the user has authorized access, 30 however, those lower on the generated list are simply further from the user's interests, as determined by the user's search profile set. The server S2 retrieves the article from the local data storage meflil-m or from an i,~c,l"~aLion server S4 and presents the article one screen at -7~-CA 0223601~ 1998-04-27 W O97/16796 PCTAUS96/17981a time to the user's client processor Cl. The user can at any time select another article for reading or exit the process.
Monitor Which Articles Are Read The user's search profile set generator 202 at step 1 107 monitors which articles the 5 user reads, keeping track of how many pages of text are viewed by the user, how much time is spent viewing the article, and whether all pages of the article were viewed. This information can be combined to measure the depth of the user's interest in the article, yielding a passive relevance feedb~cl~ score, as described earlier. Although the exact details depend on the length and nature of the articles being sealclled, a typical formula might be:
10 measure of article attra~iliveness =0.2 if the second page is ~ccçssed + 0.2 if all pages are ~ccesse~l + 0.2 if more than 30 seconds was spent on the article + 0.2 if more than one minute was spent on the article + 0.2 if the mimltçs spent in the article are greater than half the number of pages.
The computed measure of article attractiveness can then be used as a w~ighting 15 function to adjust the user's search profile set to thereby more accurately reflect the user's dyn~mic~lly ~h~nging interests.
Update User Proflles Updating of a user's generated search profile set can be done at step 1 108 using the method described in copending U.S. Patent Application Serial No. 08/346,425. When an 20 article is read, the server S2 shifts each search profile in the set slightly in the direction of the target profiles of those nearby articles for which the computed measure of article attractiveness was high. Given a search profile with attributes u,k from a user's search profile set, and a set of J articles available with attributes d,k (~e~llmed correct for now), where I indexes users, j indexes articles, and k indexes attributes, user I would be predicted 25 to pick a set of P distinct articles to I l lil~il ";,e the sum of d(ul, bj) over the chosen articles j. The user's desired attributes u;" and an article's attributes d,k would be some form of word freq~l~n~ies such as TF/IDF and potentially other attributes such as the source, reading level, and length of the article, while d(ul, dj) is the distance between these two attribute vectors (profiles) using the similarity measure described above. If the user picks a different set of 30 P articles than was predicted, the user search profile set generation module should try to adjust u and/ord to more accurately predict the articles the user selected. In particular, u and/or dj should be shifted to increase their similarity if user I was predicted not to select aTticle j but did select it, and perhaps also to decrease their ~imil~rity if user I was predicted CA 0223601~ 1998-04-27 to select article j but did not. A pl~;rt; -c~d method is to shift u for each wrong prediction that user I will not select article j, using the formula: Ujk' = ujk - e(ujk d~,) Here ul is chosen to be the search profile from user I's search profile set that is closest to target profile. If e is positive, this adjll~tm~nt increases the match between user 5 I's search profile set and the target profiles of the articles user I actually selects, by rnaking UI closer to dj for the case where the algorithm failed to predict an article that the ~fiewer selected. The size of e determines how many example articles one must see to change the search profile subst~nti~lly. If e is too large, the algorithm becomes unstable, but for sufficiently small e, it drives u to its correct value. In general, e should be plopolLicnal to 10 the measure of article attractiveness; for ~mple, it should be relatively high if user I
spends a long time reading article j. One could in theory also use the above formula to decrease the match in the case where the algorithm predicted an article that the user did not read, by making e negative in that case. However, there is no guarantee that u will move in the correct direction in that case. One can also shift the attribute weights WI of user I by 15 using a similar al~ iLI-l--: wj~ = (Wik - elUjk - d~ k (wik - e¦ u; ~ - d,k¦~ T h i s i s palticularly important if one is co-~ i.. g word frequencies with other attributes. As b~efore, this increases the match if e is positive -- for the case where the algorithm failed to predict an article that the user read, this time by decreasing t he weights on those characteristics for which the user's target profile ul differs from the article's profile dj Again, the size of e 20 determines how many example articles one must see to replace what was originally believed. Unlike the procedure for adjusting u, one also make use of the fact that the above algorithm decreases the match if e is negative -- for the case where the algo-i~ -. predicted an article that the user did not read. The denominator of the c;A~ression prevents weights from ~hrinkin~ to zero over time by renorm~li7ing the modified weights WI' SO that they sum 25 to one. Both u and w can be adjusted for each article acce~e~l When e is small, as it should be, there is no confiict between the two parts of the algo. iLhl... The selectedi user's search profile set is updated at step 1108.
Further Applications of the Fill~ .g Technolog~r The news clipping ser~rice may deliver news articles (or adverti~ment~ and 30 coupons for purchasables) to off-line users as well as to users who are on-line. All:hough the off-line users may have no way of providing relevance feedb~l~.k the user profile of an off-line user U may be similar to the profiles of on-line users, for example because user U
, is demographically similar to these other users, and the level of user U's interest in CA 0223601~ 1998-04-27 particular target objects can thelt:ro-~ be estim~ted via the general interest-estim~ti~n methods described earlier. In one application, the news clipping service chooses a set of news articles (respectively, ad~ ;e~ "1~; and coupons) that are predicted to be of interest to user U, thereby determining the content of a customized newspaper (respectively, S advertising/coupon circular) that may be printed and physically sent to user U via other methods. In general, the target objects in~ decl in the printed document delivered to user U are those with the highest median predicted interest among a group G of users, where group G consists of either the single off-line user U, a set of off-line users who are demographically similar to user U, or a set of off-line users who are in the same geographic 10 area and thus on the same newspaper delivery route. In a variation, user group G is clustered into several subgroups G1...Gk; an average user profile Pi is created from each subgroup Gi; for each article T and each user profile Pi, the interest in T by a hypothetical user with user profile Pi is pre~icte~, and the interest of article T to group G is taken to be the maximum interest in article T by any of these k hypothetical users; finally, the 15 customized newspaper for user group G is constructed from those articles of greatest interest to group G.
The filtering technology of the news clipping service is not limited to news articles provided by a single source, but may be ~ ncle~l to articles or target objects collected from any number of sources. For exarnple, rather than identifying new news articles of interest, 20 the technology may identify new or updated World Wide Web pages of interest. In a second appli~tion termed "broadcast clipping," where individual users desire to broadcast mess~g~s to all interested users, the pool of news articles is replaced by a pool of messages to be broadcast, and these me~ges are sent to the broadcast-clipping-service subscribers most interested in them. In a third application, the system scans the transcripts of all 25 real-time spoken or written rli~c~s~ions on the network that are ~;ullellLly in progress and design~ted as public, and employs the news-clipping technology to rapidly identify ~i~c~ CionS that the user may be interested in joining, or to rapidly identify and notify users who may be i-llelt;~led in joining an on~oing ~li scl lecion. In a fourth application, the method is used as a post-process that filters and ranks in order of interest the many target objects 30 found by a convçntion~l database search, such as a search for all homes selling for under $200,000 in a given area, for all 1994 news articles about Marcia Clark, or for all Italian-l~n~-~ge films. In a fifth application, the method is used to filter and rank the links in a hypertext document by estim~ting the user's interest in the document or other object -CA 0223601~ 1998-04-27 W O 97/16796 PCTAJS96/179~1 associated with each link. In a sixth application, paying advertisers, who may be companies or individuals, are the source of adverti~ç~ or other messages, which take the place of the news articles in the news clipping service. A consumer who buys a product is deemed to have provided positive re1evance feedbac.k on adverti~ for that product, and a 5 consumer who buys a product app~uG-lLly because of a particular adve.~;~ç.~ (for example, by using a coupon clipped from that adverti~çm~nt) is de~med to have provided particularly high relevance feedbacl~ on that advG. ~ Such feedback may be communicated to a proxy server by the consumer's client processor (if the cqn~11mer is making the purchase electronically), by the retail 10 vendor, or by the credit-card reader (at the vendor's establi~hm~nt) that the co~ ..e~ uses to pay for the purchase. Given a d~t~b~e of such relevance feedba~.k, the disclosed technology is then used to match adverfi~çm~nte with those users who are most interested in them; advG l;c~ selected for a user are presented to that user by any one of several means, inc~ ling electronic mail, ~1tom~fic display on the users screen, or printing them 15 on a printer at a retail establi~hm~nt where the consumer is paying for a purchase. The threshold t1i~t~n~e used to identify interest may be increased for a particular advertisement, ca~1sing the system to present that adverti~m~nt to more users, in accordance with the amount that the advertiser is willing to pay.
A further use of the capabilities of this system is to manage a user's hlve~ lle~l 20 portfolio. Instead of reco~ 1ing articles to the user, the system leco"",.en-1~ target objects that are investm~nts As illustrated above by the example of stock rmarket inv~ , many dirre~elll attributes can be used together to profile each investment. The user's past invçstm~nt behavior is characterized in the user's search profile set or target profile interest ~ .y, and this h~llllalion is used to match the user with stock 25 opportunities (target objects) similar in nature to past inve~ The rapid profiling method described above may be used to determinç a rough set of pl ~;rel ellces for new users.
Quality attributes used in this system can include negatively weighted attributes, such as a measurement of fl~1ct~1~fi~-ns in dividends historically paid by the investmçnt, a quality attribute that would have a strongly negative weight for a conservative investor dependent 30 on a regular flow of investm~nt income. Furthermore, the user can set filter parameters so that the system can rnonitor stock prices and ~11tom~tic~11y take certain actions, such as placing buy or sell orders, or e-mailing or paging the user with a notific~tion when certain stock pelr~,lll~lce charact~ori~tiçs are met. Thus, the system can imme~ tely notify th,e user CA 0223601~ 1998-04-27 W 0 97/16796 PCTnUS96/17981 when a st~iected stock reaches a predc~c--llined price, without the user having to monitor the stock market activity. The user's inv~etm~nts can be profiled in part by a "type of hlv~ Pnt" attribute (to be used in coninnr~ti~n with other attributes), which rlietinf~ieh.qe among bonds, mutual funds, growth stocks, income stocks, etc., to thereby seglllGIll the 5 user's portfolio according to invf~ type. Each illVt~ ll type can then be m~n~ged to identify invc iLlllclll opportunities and the user can identify the desired ratio of invc~l",c"l capital for each type.
E-mail Filter In ~ddition to the news clipping service described above, the system for customized 10 electronic i~l~ontific~qtit~n of desirable objects filnr.fionc in an e\_mail c"vi~u~ ent in a similar but slightly different manner. The news clipping service selects and retrieves news information that would not otherwise reach its subscribers. ~ut at the same time, large numbers of e-mail mess~ges do reach users, having been generated and sent by humans or automatic programs. These users need an e-mail filter, which autom~tir.~lly processes the 15 messages received. The necees~ry ploces~ ; inr.l~lde~s a de~c""h,aLion ofthe action to be taken with each mçss~gç~ inr.l~-tlin~ but not limited to: filing the message, notifying the user of receipt of a high priority message, ~tom~tic~lly responding to a message. The e-mail filter system must not require too great an inve~ .l on the part of the user to learn and use, and the user must have c~ nfidt-nr.e in the applopl;~ n~ee ofthe actions autom~fic~lly 20 taken by the system. The same filter may be applied to voice mail meee~ges or facsimile m~ee~?,e that have been converted into electronically stored text, whether autom~fic~lly or at the user's request, via the use of w ell-known techniques for speech recognition or optical character recognition.
The filtering problem can be defined as follows: a message processing function 25 MPF(*) maps from a received message (document) to one or more of a set of actions. The actions, which may be quite specific, may be either pred~fined or customized by the use r.
Each action A has an a~pl~,pli~tçn~e.e function FA (*,*) such that FA (U,D) returns a real number, l~e~. .l ;. .g the appl Opl ;~tçn~ce of selecting action A on behalf of user U when us er U is in receipt of message D. For example, if D comes from a credible source and is 30 marked urgent, then discarding the message has a high cost to the user and has low appropriateness, so that Fdj~C2rd (U,D) is small, whereas alerting the user of receipt of the message is highly appropriate, so that F,l~, (U,D) is large. Given the determined CA 0223601~ 1998-04-27 W O 97/16796 PCTAUS96/179~1 applopl;~t~ness function, the function MPF~)) is used to autom~tic~lly select the appropliate action or actions. As an example, the following set of actions might be useful:
1. Urgently notify user of receipt of mecc~ge 2. Insert mecc~ge into queue for user to read later 3. Insert m~cc~g~ into queue for user to read later, and S~lg~eSt that user reply 4. Insert messagç into queue for user to read later, and suggest that user forvvard it to individual R
5. Sullll~ e message and insert sllmm~ry into queue 6. Forward message to user's secleL~y 7. File m~ss~ge in directory X
8. File message in directory Y
9. Delete message (i.e., ignore message and do not save) 10. Notify sender that further messages on this subject are unwanted Notice that actions 8 and 9 in the sample list above are dçci~ned to filter out mP.C.S~s that are undesirable to the user or that are received from undesirable sources, such as pesky salespersons, by deleting the ullw~led message and/or s~-n-ling a reply that indicates that messages of this type will not be read. The applv~ t~n~sc functions must be tailored to describe the ap~lopl;~t~ness of carrying out each action given the target profile for a particular document, and then a message processing function MPF can be found which is in some sense optimal with respect to the ~plopl;atçnPss function. One reasonable choice of MPF always picks the action with highest appropl;~t~nesc, an d in cases where mllltiple actions are highly applupli~e and are also cnmp~tihle with each other, selects more than one action: for example, it may automatically reply to a message and also file the same m~Cc~e in du~;~o-y X, so that the value of MPF(D) is the set \{reply, file in directory X\}.
In cases where the applvp~;~tçness of even the most appropliate action falls below a user-specified threshold, as should happen for messages of an ~mf~mili~r type, the c;ystem asks the user for co"~" ",.1 ;on of the action(s) selected by MPF. In addition, in cases where MPF selects one action over another action that is nearly as app~upliate, the system also asks the user for colLllllaLion: for example, mail should not be deleted if it is nearly as 30 applupliate to let the user see it.
It is possible to write ~lOpl ;~ s functions m~ml~lly, but the time necessary and lack of user expertise render this solution impractical. The automatic training of this system is preferable, using the automatic user profiling system described above. Each received CA 0223601~ 1998-04-27 W O 97/16796 PCT~US96/17981 document is viewed as a target object whose profile in~ çs such attributes as the entire text ofthe doc~lmtont (.~.~sen~ed as TF/IDF scores), doc~-m~nt sender, date sent, document length, date of last document received from this sender, key words, list of other addl essees, etc. It was disclosed above how to estim~te an interest function on profiled target objects, 5 using relevance fee~lb~c~ together with ...ea~u-~d similarities among target objects and among users. In the con text of the e-mail filter, the task is to ~Stim~te several applop.;~ ss functions FA (*,*), one per action. This is h~n~1led with exactly the same method as was used earlier to estim~te the topical interest function f(*,*). Relevance feedback in this case is provided by the user's observed actions over time: whenever user 10 U chooses action A on document D, either freely or by choosing or co..rll...ing an action recomm~nded by the system, this is taken to mean that the app-op-;~t~ne~s of action A on document D is high, particularly if the user takes this action A immerli~tçly after seeing document D. A pre~umpLion of no approp~ n~c (corresponding to the earlier presumption of no interest) is used so that action A is considered i.lapplup.iate on a 15 document unless the user or similar users have taken action A on this document or similar documents. In particular, if no similar document has been seen, no action is considered especially appl~,p.iate, and the e-mail filter asks the user to specify the a~ ,p.i~Le action or confirm that the action chosen by the e-mail filter is the applop-iate one.
Thus, the e-mail filter learns to take particular actions on e-mail messages that have 20 certain attributes or combinations of attributes. For example, mess~gçs from John Doe that originate in the (212) area code may prompt the system to forward a copy by fax n to a given fax number, or to file the me~ge in directory X on the user's client processor. A variation allows active requests of this form from the us er, such as a request that any message from John Doe be rO. wa. ~ied to a desired fax number until further notice.
25 This active user input requires the use of a natural l~n~ ge or form-based intçrf~ce for which specific co,~ tls are associated with particular attributes and coll.billaLions of attributes.
Update Notification A very illlpOI ~ll and novel characteristic of the ar~ hitectllre is the ability to identify 30 new or llptl~ed target objects that are relevant to the user, as determined by the user's search profile set or target profile interest summary. ("Updated target objects" include revised versions of docllmPnts and new models of purchasable goods.) The system may notify the user ofthese relevant target objects by an electronic notification such as an e-mail message CA 0223601~ 1998-04-27 W O 97/16796 PCT~US96/179l~1 or f~c~imile tr~n~mi~eion. In the variation where the system sends an e-mail mese~e~ the user's e-mail filter can then respond app-up-i~Lely to the notification, for instance, by bringing the noti~c~tion imme~ tely to the user's personal attention, or by autom~tic~11y sublnillillg an electronic request to purchase the target object named in the notification. A
S simple ~x~mr)1e of the 1atter response is for the e-mail filter to retrieve an on-line document at a nominal or zero charge, or request to buy a purchasable of limited quantity such as a used product or an a~ction~ble.
AC~TIVE NAVIGATION (BROWSING) B~ g by Navigating Through a Cluster Tree A hierarchical cluster tree imposes a useful olg,.. ~ ion on a collection oftarget objects. The tree is of direct use to a user who wishes to browse through all the target objects in the tree. Such a user may be exploring the collection with or without a well-specified goal. The tree's division of target objects into coherent clusters provides an efficient method whereby the user can locate a target object of interest. The user first chooses one of the highest level (largest) clusters from a menu, and is presented with a menu listing the subclusters of said cluster, whereupon the user may select one of these subclusters. The system locates the subcluster, via the app-~pliate pointer that was stored with the larger cluster, and allows the user to select one of its subclusters from another menu. This process is . epea~ed until the user comes t o a leaf of the tree, which yields the details of an actual target object. Hierarchical trees allow rapid selection of one target object from a large set. In ten menu selections from menus of ten items (subclusters) each, one can reach 101~ =10,000,000,000 (ten billion) items. In the p-e~ed embodimellt, the user views the menus on a computer screen or tern~inal screen and selects from them with a keyboard or mouse. However, the user may also make selections over the telephone., with a voice synthe~i7~r reading the menus and the user selecting subclusters via the telephone's touch-tone keypad. In another variation, the user sim~llt~neously m~int~in~ two connections to the server, a t~le!phc ne voice conn~ion and a fax connection; the server sends successive menus to the user by fax, while the user selects choices via the telephone's toucln-tone keypad.
Just as user profiles commonly include an associative attribute indicating the user's degree of interest in each target object, it is useful to a~lgm~nt user profiles with an additional associative attribute indicating the user's degree of interest in each cluster in the hierarchical cluster tree. This degree of interest may be estim~ted numerically as the CA 0223601~ 1998-04-27 W O 97/16796 PCTAJS96/17981number of subclusters or target objects the user has selected from menus associated with the given cluster or its subclusters, eA~ressed as a proportion of the total l-ul~ber of subclusters or target objects the user has s~lecte(l This associative attribute is particularly valuable if the hierarchical tree was built using "soft" or "fuzzy" rlllstering~ which allows a subcluster 5 or target object to appear in multiple clusters: if a target document appears in both the "sports" and the "humor" clusters, and the user selects it from a menu associated with the "humor" cluster, then the system increases its association between the user and the "humor"
cluster but not its association between the user and the "sports" cluster.
Labeling Clusters Since a user who is navigating the cluster tree is repeatedly expected to select one of several subclusters from a menu, these subclusters must be usefully labeled (at step 503), in such a way as to s~l~gest their content to the human user. It is strai~;L~r~ w~ .1 to include some basic h~ ion about each subcluster in its label, such as the number of target objects the subcluster co~ ills (possibly just l) and the number of these that have been 15 added or updated recently. However, it is also necess~ry to display ad-litiQn~ lrollllaLion that indicates the cluster's content. This content-desçriptive illr~ Lion may be provided by a human, particularly for large or frequently acce~ed clusters, but it may also be generated autom~tic~lly The basic automatic technique is simply to display the cluster's "char~c.teri~tic value" for each of a few highly weighted attributes. With numeric attributes, 20 this may be taken to mean the cluster's average value for that attribute: thus, if the "year of release" attribute is highly weighted in predicting which movies a user will like, then it is useful to display average year of release as part of each cluster's label. Thus the user sees that one cluster consists of movies that were released around 1962, while another consists of movies from around lg82. For short textual attributes, such as "title of movie" or "title 25 of document," the system can display the attribute's value for the cluster member (target object) whose profile is most similar to the cluster's profile (the mean profile for all 1ll~lll1)~l ~ of the cluster), for example, the title of the most typical movie in the cluster. For longer textual attributes, a useful technique is to select those terms for which the amount by which the term's average TF/IDF score across members of the cluster exceeds the term's 30 ~verage TF/IDF score across all tar get objects is greatest, either in absolute terms or else as a fraction of the standard deviation of the term's TFtIDF score across all target objects.
The selected terms are replaced with their morphological stems, ~limin;qting duplicates (so that if bot h "slept" and "sleeping" were selected, they would be replaced by the single term CA 0223601~ 1998-04-27 W O 97/16796 PCT~US96/179~1 'sleep") and optionally ~ g close ~yllollylns or collocates (so that i~both llnurseql and m~iiç~lll were selected, they might both be replaced by a single term such as 'Imlrse,'l llmedical,ll llmç.~ ine,ll or 'Ihospitalll). The reslllting set ofterms is displayed as part ofthe label. Finally, if freely redistributable thumbnail photographs or other graphical images are associated with some of the target objects in the cluster f or labeling purposes, then the system can display as part of the label the image or images whose associated target objects have target profiles most similar to the cluster profile.
Users' navi~tion~l patterns may provide some useful feedb~ as to the quality of the labels. In particular, if users often select a particular cluster to explore, but then quickly backtrack and try a di~t;-enl cluster, this may signal that the first clusterls label is miQ1~ding Insofar as other terms and ~ttrihutes can pro vide llnext-best" ~lt~rn~tive labels for the first cluster, such 'Inext-bestll labels can be automatically substituted for the mi~le~qtling label. In addition, any user can locally relabel a cluster for his or her own convenience. Although a cluster label provided by a user is in general visible only to that user, it is possible to make global use of these labels via a "user labels'l textual attribute for target objects, which attribute is defined for a given target object to be the conc~tçn~tion of all label s provided by any user for any cluster coll~ g that target object. This attlibute infl~nr~es similarity j~dgm~nt~: for example, it may induce the system to regard target articles in a cluster often labeled "Sports News" by users as being mildly similar to alticles in an otherwise ~ imil~r cluster often labeled "Intern~tion~l News'l by users, precisely because the "user labels" attribute in each cluster profile is strongly associated with the term IlNews.l' The "user label" attribute is also used in the ~tom~tiC generation of label<;, just as other textual attributes are, so that if the user-generated labels for a cluster often include "Sports," the term "Sports" may be inclllded in the automatically generated label as well.
It is not ~-çces~ . y for menus to be displayed as simple lists of labeled options,; it is possible to display or print a menu in a form that shows in more detail the relation of the di~e~t;nL menu options to each other. Thus, in a variation, the menu options are visually laid out in two dimensions or in a perspective drawing of three tlimçn~ ns. Each option is displayed or printed as a textual or graphical label. The physical coordinates at which the options are displayed or printed are generated by the following seq~ .nc.e of steps: (1) construct for each option the cluster profile of the cluster it represents, (2) construc~: from each cluster profile its decomposition into a numeric vector, as described above, (3) apply singular value decomposition (SVD) to determine the set of two or three orthogonal linear CA 0223601~ 1998-04-27 W O97/16796 PCT~US96/17981 axes along which these numeric vectors are most greatly di~ te-1 and (4) take the coordinates of each option to be the projected cooldhla~es of that option's numeric vector along said axes. Step (3) may be varied to det~rmine a set of, say, 6 axes, so that step (4) lays out the options in a 6-(1imP-n~ional space; in this case the user may view the ~omf~tric~
5 projection of the 6-~lim~n~ional layout onto any plane passing through the origin, and may rotate this viewing plane in order to see differing configurations of the options, which emphasize similarity with respect to riiff~.ring alllibuLes in the profiles of the associated clusters. In the visual representation, the sizes of the cluster labels can be varied according to the number of objects col " ~ d in the col,e~ollding clusters. In a further variation, all 10 options from the parent menu are displayed in some number of dim~neion~, as just described, but with the option corresponding to the current menu replaced by a more prominent subdisplay of the options on the current menu; optionally, the scale of this composite display may be gradually increased over time, thereby increasing the area of the screen devoted to showing the options on the current menu, and giving the visual hllpl ~ssion 15 that the user is regarding the parent cluster and l'zooming in" on the current cluster and its subclusters.
Further Navigational It should be appreciated that a hierarchical cluster-tree may be configured withm~ iple cluster selections b~ cl.;.~g from each node or the same labeled clusters presented 20 in the form of single branches for multiple nodes ordered in a hierarchy. In one variation, the user is able to perform lateral navigation between neighboring clusters as well, by requesting that the system search for a cluster whose cluster profile resembles the cluster profile of the ~;ullellLly selected cluster. If this type of navigation is performed at the level of individual objects (leaf ends), then automatic hyperlinks may be then created as 25 navigation occurs. This is one way that nearest neighbor clustering navigation may be performed. For example, in a domain where target objects are home pages on the World Wide Web, a collection of such pages could be laterally linked to create a " virtual mall. "
The simplest way to use the automatic m~mling system described above is for the user to begin brow~ g at the top of the tree and moving to more specific subclusters.
30 However, in a variation, the user optionally provides a query consisting of textual and/or other attributes, from which query the system constructs a profile in the manner described herein, optionally altering textual attributes as described herein before decomposing them into numeric attributes. Query profiles are similar to the search profiles in a user's search CA 0223601~ 1998-04-27 W O 97/16796 PCT~US96/179~1 profile set, except that their attributes are ~.~li-itly spe~-ified by a user, most ofte:n for one-time usage, and unlike search profiles, they aLre not ~utom~tic~lly updated to reflect ~-.h~l~..g inLel~L~. A typical query in the domain of text articles might have "Tell me a~bout the relation between Galileo and the Medici family" as the value of its "text of ar~,icle"
S ~ttrib~lt~; and 8 as the value of its "reading difficlllty" attribute (that is, 8th-grade level). The system uses the method of section "Searching for Target Objects" above to autom~tic~lly locate a small set of one or more clusters with profiles similar to the query profile, for example, the articles they contain are written at roughly an 8th-grade level and tend to m~.ntion Galileo and the Medicis. The user may start blow~ g at any ofthese clusters, and 10 can move from it to subclusters, superclusters, and other nearby clusters. For a user who is looking for something in particular, it is generally less efficient to start at the largest cluster and repe~te~11y select smaller subclusters than it is to write a brief description of what one is looking for and then to move to nearby clusters if the objects initially leco...."en-led are not precisely those desired.
~ltho~lgh it is cll~tom~ry in il.rollllaLion retrieval systems to match a query to a document, an interesting variation is possible where a query is m~t-~.he(l to an already al~wt;lt d q~lesfi~n The relevant domalin is a c~lstom~r service center, eleckonic news~,roup , or Better B~l~iness Bureau where q~lestiQn~ are frequently answered. Each new question-answer pair is recorded for future reference as a target object, with a textual attribute that spe~.ifi~ the question together with the answer provided. As explained earlier with }eference to document titles, the question should be weighted more heavily than the answer when this textual attribute is decomposed into TF/IDF scores. A query specifying "Tell me about the relation between Galileo and the Medici family" as t he value of this attribute therefore locates a cluster of similar questions together with their answers. In a variation, each question-answer pair may be profiled with two separate textual attributes, one for the question and one for the answer. A query might then locate a cluster by specifying only the question attribute, or for completenes~, both the question attribute and the ~lower-weighted) answer attribute, to be the text "Tell me about the relation between Galileo and the Medici family."
The filtering teçhnolQgy described earlier can also aid the user in navigating among the target objects. When the system pl~s~llL~ the user with a menu of subclusters of a cluster C oftarget objects, it can ~imlllt~neously present an additional menu ofthe most interesting target objects in cluster C, so that the user has the choice of ~ccec~ing a subcluster or CA 0223601~ 1998-04-27 W O 97/16796 PCT~US96/17981directly ~cces~ing one of the target objects. If this additional menu lists n target objects, then for each I between 1 and n inclusive, in increasing ord er, the I'h most ploll~inent choice on this ~rlrlition~l menu, which choice is denoted Top(C,i), is found by considering all target objects in cluster C that are further than a threshold distance t from all of Top(C,l), S Top(C,2), ... Top(C, I-l), and selecting the one in which the user's interest is estim~ted to be highest If the threshold tli~t~nf~e t is 0, then the menu rçs llting from this procedure simply displays the n most i--lc;-~Li-lg objects in cluster C, but the threshold ~ t~nce may be increased to achieve more variety in the target objects displayed. Generally the threshold tli~t~n~e t is chosen to be an affine function or other function of the cluster variance or 10 cluster ~ meter ofthe cluster C.
As a novelty feature, the user U can "masquerade" as another user V, such as a pr~)min~nt inte1lectu~1 or a celebrity supermodel; as long as user U is m~quprading as user V, the filtering technology will reco.. ~ 1 articles not according to user U's ~rerele.lces, but rather according to user V's pr~rl;rellces. Provided that user U has access to the l S user-specific data of user V, for example because user V has leased these data to user U for a fin~nci~l consideration, then user U can m~querade as user V by instructing user U's proxy server S to temporarily substitute user V's user profile and target profile interest summary for user U's. In a variation, user U has access to an average user profile and an composite target profile interest summary for a group G of users; by instructing proxy server 20 ~ to substitute these for user ~s user-specific data, user U can m~qllçrade as a typical member of group G, as is useful in exploring group plerelellces for sociological, political, or market research. More generally, user U may "partially masquerade" as another user V
or group G, by instructing proxy server S to temporarily replace user ~s user-specific data with a weighted average of user U~s user-specific data and the user-specific data for user V
25 and group G.
Menu Or~ni7~t~cn Although the topology of a hierarchical cluster tree is fixed by the techniques that build the tree, the hierarchical menu presented to the user for the user's navigation need not be exactly isomorphic to the cluster tree. The menu is typically a somewhat modified 30 version of the cluster tree, reorganized m~ml~lly or ~ntom~tically so that the clusters most interesting to a user are easily accessible by the user. In order to autom~tic~lly reorganize the menu in a user-specific way, the system first attempts automatically to identify ~Yi.~ting clusters that are of interest to the user. The system may identify a cluster as interesting CA 0223601~ 1998-04-27 W O 97/16796 PCT~US96/179l~1 because the user often accecces target objects in that cluster -- or, in a more sophi~tic~ted variation, because the user is predicted to have high interest in the cluster's profile, using the methods disclosed herein for e,l;,.,nl;,-g interest from relevance feedb?~ck Several techniq~ es can then be used to make interesting clusters more easily S accessible. The system can at the user's request or at all times display a special list of the most intcl c~lhlg clusters, or the most hllelc~Lii~Lg subclusters of the current cluster, so that the user can select one ofthese clusters based on its label and jump directly to it. In general, when the system constructs a list of interesting clusters in this way, the I~ most plolll;l~
choice on the list, which choice is denoted Top(I), is found by considering all appl~pliate 10 clusters C that are further than a threshold rlict~nre t from all of Top(1), Top(2), ... Topl~I-l), and selecting the one in which the user's interest is ~stim~ted to be hi~hest Here the threshold tiict~nce t is optionally dependent on the colll~uLed cluster variance or cluster diameter of the profiles in the latter cluster. Several techniques that reol~ e the hierarchical menu tree are also useful. First, menus can be leol~r.~i,ed so that the most 15 interesting subcluster choices appear earliest on the menu, or are visually marked as interesting; for example, their labels are displayed in a special color or type face, or are displayed together with a number or graphical image in~lic~ting the likely level of int~erest.
Second, illLc-c~Lillg clusters can be moved to menus higher in the tree, i.e., closer to the root of the tree, so that they are easier to access if the user starts browsing at the root of the tree.
20 Third, uninteresting clusters can be moved to menus lower in the tree, to make roomL for interesting clusters that are being moved higher. Fourth, clusters with an especiall~y low interest score (lc~lese~ g active dislike) can simply be suppressed from the menus; thus, a user with children may assign an cAll elllely negative weight to the "vulgarity" attribute in the determination of q, so that vulgar clusters and doc -m~nte will not be available at a]l. As 25 the interesting clusters and the docum~ntc in them migrate toward the top of the tree, a c~lstomi7ed tree develops that can be more efflciently navigated by the particular user. If menus are chosen so that each menu item is chosen with approximately equal probability, then the expected number of choices the user has to make is .";,~i",i~ed If, for examLple, a user frequently ~cceseed target objects whose profiles resembled the cluster profile of cluster 30 (a, b, d) in Figure 8 then the menu in Figure 9 could be modified to show the structure illustrated in Figure 10.
In the variation where the general techniques disclosed herein for e~ 1; .-g a user's interest from relevance feedback are used to identify interesting clusters, it is possible for CA 0223601~ 1998-04-27 WO 97/16796 PCT~US96/17981 a user U to supply "temporary re1evance fçedb~clr" to in~lic~te a temporary interest that is added to his or her usual illlele~L~. This is done by entering a query as described above, i.e., a set of textual and other attributes that closely match the user's interests of the moment .
This query becomes "active," and affects the :,y~Lelll's detellllillaLion of interest in either of two ways. In one approach, an active query is treated as if it were any other target object, and by virtue of being a query, it is taken to have received relevance fee~lb~cl~ that in~lic~tçs especially high interest. In an alternative approach, target objects X whose target profiles are similar to an active query's profile are simply considered to have higher quality q(U, X), in that q(U, X) is incr~m~nted by a term that increases with target object X's similarity to the query profile. Either strategy affects the usual interest ~etim~tes: clusters that match user U's usual interests (and have high quality q(*)) are still considered to be of interest, and clusters w hose profiles are similar to an active query are adjudged to have especially high interest. Clusters that are similar to both the query and the user's usual interests are most interesting of all. The user may modify or deactivate an active query at any ti me while browsing. In ~ itinn, if the user discovers a target object or cluster X of particular interest while browsing, he or she may replace or ~ mf~nt the original (perhaps vague) query profile with the target profile of target object or cluster X t hereby amplifying or reffning the original query to in~1ic~te an particular interest in objects similar to X. For example, suppose the user is bl~w~h~g through doc~1nn~nte, and specifies an initial query co~ g the word "Lloyd's," so that the system predicts docllm~nte co~ g the word "Lloyd's" to be more interesting and makes them more easily ~ccloeeihle, even to the point of listing such doc:~1mf nte or clusters of such documents, as described above. In particular, certain articles about insurance colll~;ll;llg the phrase "Lloyd's of London" are made more easily accessible, as are certain pieces of Welsh fiction col.~ ;,,g phrases like "Lloyd's father." The user browses while this query is active, and hits upon a useful article des.ilibing the relation of Lloyd's of London to other British insurance houses; by replacing or ~11gm~nting the query with the full text of this article, the user can turn the attention of the system to other documents that resemble this article, such as doc11mPnte about British insurance houses, rather than Welsh folk tales.
In a system where queries are used, it is useful to include in the target profiles an associative attribute that records the associations between a target object and whatever terms are employed in queries used to find that target object. The association score of target object X with a particular query term T is defined to be the mean relevance feedback on CA 022360l~ l998-04-27 W O 97/16796 PCTAUS96/179~1 target object X, averaged over just those ~ccçcc~c of target object X that were made while a query co~ g term T was active, multiplied by the n~.g~tetl logarithrn of term T's global frequency in all queries. The effect of this associative attribute is to increase the measured similarity of two docllm~nts if they are good responses to queries that contain the ;same 5 terms. A further maneuver can be used to improve the accuracy of .es~ollses to a quel y: in the sllmm~tion used to det~rmine the qualit,v q(U, X) of a target object X, a term is incl~lded that is proportional to the sum of association scores between target object X and each term in the active query, if any, so that target objects that are closely associated with terms in an active query are d~lel--,il-ed to have higher quality and therefore higher interest for the user.
10 To complement the system's ~uLomalic reo~ n ofthe hierarchical cluster kee, the user can be given the ability to lt;ol~a~ e the tree m~ml~lly, as he or she sees fit. Any changes are optionally saved on the user's local storage device so that they will affec,t the presentation of the tree in future sessions. For example, the user can choose to mo ve or copy menu options to other menus, so that useful clusters can thereafter be chosen directly lS from the root menu ofthe tree or from other easily ~cceesed or topically applopliate m~enus.
In an other example, the user can select clusters Cl, c, ... Ck listed on a particular me]lu M
and choose to remove these clusters from the menu, replacing them on the menu with a single aggregate duster M' co"~ ;"g all the target objects from clusters Cl, c., ... C~ . In this case, the imme~ te subclusters of new cluster M' are either taken to be clusters ('1, c, 20 ... Ck themselves, or else, in a variation similar to the "scatter-gather" method, are ~lltom~tic~lly computed by clustering the set of all the subclusters of clusters Cl, C2, ... Ck accol dillg to the similarity of the cluster profiles of these subclusters.
Ti l~c onic Mall In one application, the blow~ g te~-.hni~les described above may be applied~ to a 25 domain where the target objects are purchasable goods. When shoppers look for goods to purchase over the TntP.rnet or other electronic media, it is typically n-oceCc~ry to display thousands or tens of thousands of products in a fashion that helps consumers _nd the items they are looking for. The current practice is to use hand-crafted menus and sub-merrus in which similar items are grouped together. It is possible to use the ~l-tom~ted clustering and 30 browsing methods described above to more effectively group and present the items.
Pu~cllasable items can be hierarchically clustered using a plurality of di~t;lc~lll criteria.
Useful attributes for a purchasable item include but are not limited to a textual descri,ption and pred~fined category labels (if available), the unit price of the item, and an associative CA 0223601~ 1998-04-27 attribute listing the users who have bought this item in the past. Also useful is an associative attribute indicating which other items are often bought on the same shopping "trip" as this item; items that are often bought on the same trip will be judged similar with respect to this attribute, so tend to be grouped together. Retailers may be interested in 5 ~ltili7ing a similar teehnique for purposes of predicting both the nature and relative quantity of items whieh are likely to be popular to their partieular elientele. This pre-liction may be made by using aggregate ~ul-;ha~ g records as the seareh profile set from whieh a eollection of target objeets is recommPncled Fstim~ted c~lstompr d~m~n(l which is indicative of (relative) inventory quantity for each target object item is ~letPrmined by I0 measuring the eluster varianee of that item eol,lpalt;d to another target objeet item (whieh is in stock).
As described above, hierarchically clustering the purehasable target objeets results in a hierarehical menu system, in which the target objects or clusters of target objects that appear on eaeh menu ean be labeled by names or ieons and displayed in a two tlim~n.cional 15 or three--limçncional menu in which similar items are displayed physically near each other or on the same graphieally represented "shelf." As described above, this grouping occurs both at the level of speeific items (such as standard size Ivory soap or large Breck sharnpoo) and at the level of elasses of items (sueh as soaps and shampoos). When the user selects a elass of items (for inct~nce, by elieking on it), then the more speeific level of detail is 20 displayed. It is neither necessary nor desirable to limit each item to appearing in one group;
customers are more likely to find an object if it is in multiple categories. Non-purchasable objects such as artwork, adverticçmentc~ and free samples may also be added to a display of purchasable objects, if they are associated with (liked by) subst~nti~lly the same users as are the purehasable objeets in the display.
25 Network Contest of the Bn~ System The files assoeiated with target objeets are typieally distributed aeross a large number of different servers Sl-So and clients Cl-Cn. Each file has been entered into the data storage me~ m at some server or client in any one of a number of ways, inclll~in~, but not l~mited to: sc.~nning, keyboard input, e-mail, FTP tr~ncmiccion automatic synthesis from 30 another file under the control of another computer program. While a system to enable users to efficiently locate target objects may store its l~iel~cl-ical cluster tree on a single centralized m~chin~, greater efficiency can be achieved if the storage of the hierarchical cluster tree is distributed across many m~ehinPs in the network. Eaeh eluster C, incl~lcling CA 0223601~ 1998-04-27 W O 97/16796 PCT~US96/17981 single-member clusters (target objects), is digitally ~ t;se lled by a file F, which is mllltic~ct to a topical mlllt1c~et tree MT(Cl); here cluster Cl is either cluster C itself or some supercluster of cluster C. In this way, file F is stored at multiple servers, for red--n-l~n~.y. The file F that lGp.esellls cluster C co,llaills at least the following data:
~ S 1. The cluster profile for cluster C, or data sufflcient to ~eco~ L-llct this cluster profile. 2.
The number oftarget objects contained in cluster C. 3. A human-readable label for c]iuster C, as described in section "Labeling Clusters" above. 4. If the cluster is dividedl into subclusters, a list of pointers to files reprPs~nting the subclusters. Each pointer is an ordered pair cot~ g n~minp first, a file, and second, a mllltic~t tree or a specific server where that file is stored. 5. If the cluster Col.si~ls of a single target object, a pointer to the file corresponding to that target object.
The process by which a client m~hine can retrieve the file F from the m-lltis~ct tree MT(Cl) is described above in section "Retrieving Fi]ies from a ~Illtic~ct Tree." Once it has retrieved file F, the client can perform further tasks pertaining to this cluster, such as displaying a labeled menu of subclusters, from which the user may select subclusters for the client to retrieve next.
The advantage of this distributed imple~ l;on is threefold. First, the system can be scaled to larger cluster sizes and numbers oftarget objects, since much more se~cl~ g and data retrieval can be carried out con ;u- - ~l-lly. Second, the system is fault-tolerant in that partial m~t-~.hing can be achieved even if portions of the system are temporarily unavailable.
It is important to note here the robustness due to recll-ntl~ncy inherent in our design - data is replic~t~ at tree sites so that even if a server is down, the data can be located else~;vhere.
The distributed hierarchical cluster tree can be created in a distributed fashion, that is, with the participation of many processors. Indeed, in most applications it should be recreated from time to time, because as users interact with target objects, the associative attributes in the target profiles of the target objects change to reflect these interactions; the system's similarity measurements can therefore take these interactions into account when judging similarity, which allows a more perspicuous cluster tree to be built The key technique is the following procedure for merging n disjoint cluster trees, represented 30 respectively by files FlFn in distributed fashion as described above, into a connbined cluster tree that collL~ s all the target objects from all these trees. The files Fl...Fn are described above, except that the cluster labels are not incl~ldecl in the repres~nt~ti( n. The following steps are ~xec~lted by a server Sl, in response to a request message from another ~97~

CA 0223601~ 1998-04-27 W O 97/16796 PCT~US96/17981server S0, which request mess~A~e incl~ldes pointers to the f~les F1...Fn. 1. Retrieve files F1. ..Fn. 2. Let L and M be empty lists. 3 . For each fiie Fi from among F1 ..;Fn: 4. If file Fi co~ s pointers to subcluster files, add these pointers to list L. 5. If file Fi ~ SelllS
a single target object, add a pointer to file Fi to list L. 6. For each pointer X on list L, 5 retrieve the file that pointer P points to and extract the cluster profile P(X) that this file stores. 7. Apply a t~ stering algo.iLI.I.l to group the pointers X on list L accoldillg to the (1ietAn~ç~e between their les~,e~ e cluster profiles P(X). 8. For each (nonempty) rçs~llting group C of pointers: 9. If C contains only one pointer, add this pointer to list M; 10.
otherwise, if C co~ ;"e exactly the same subcluster pointers as does one ofthe files Fi from 10 among F1...Fn, then add a pointer to file Fi to list M; 11. otherwise: 12. Select an a l,iLlaly server S2 on the network, for example by randomly selecting one ofthe pointers in group C and choosing the server it points to. 13. Send a request message to server S2 that includes the subcluster pointers in group C and requests server S2 to merge the corresponding ~lhr~ ter trees. 14. Receive a response from server S2, col~ l n;l l;l~g a pointer 15 to a file G that represents the merged tree. Add this pointer to list M. 15. For each file Fi from among F1...Fn: 16. If list M does not include a pointer to file Fi, send a message to the server or servers storing Fi instructing them to delete file Fi. 17. Create and store a file F
that represents a new cluster, whose subcluster pointers are exactly the subcluster pointers on list M. 18. Send a reply message to server S0, which reply message contains a pointer 20 to file F and indicates that file F I ep~ esenla the merged cluster tree.
With the help of the above procedure, and the mllltic.Aet tree MT full that incl~ldçs all proxy servers in the network, the distributed hierarchical cluster tree for a particular ,domain of target objects is constructed by merging many local hierarchical cluster trees, as follows. 1. One server S (plcrcl~bly one with good connectivity) is elected from the tree.
25 2. Server S sends itself a global request m~eeAge that causes each proxy server in MTfi~"
(that is., each proxy server in the n~lwull~) to ask its clients for files for the cluster tree. 3.
The clients of each proxy server transmit to the proxy server any files that they ~I~A;l~ >
which files lc~lcacllL target objects from the applo~liate domain that should be added to the cluster tree. 4. Server S forms a request R1 that, upon receipt, will cause the recipient 30 server S 1 to take the following actions: (a) Build a hierarchical cluster tree of all the files stored on server S1 that are ~A;Il~ ed by users in the user base of S1. These files correspond to target objects from the app~ liate domain. This cluster tree is typically stored entirely on S 1, but may in principle be stored in a distributed fashion.

CA 0223601~ 1998-04-27 W O 97/16796 PCTnJS96/179~11 (b) Wait until all servers to which the server Sl has propagated request R have sent the .. , ~ reply . . .~ges cc,. .l i~;~ .;ng pointers to cluster trees. (c) Merge together the cluster tree created in step 5(a) and the cluster trees supplied in step 5(b), by s~.ntling any server (such as S1 itself) a message req~lesting such a merge, as described above. (d) IJpon 5 receiving a reply to the message sent in (c), which reply in-~.ludes a pointer to a file 1 eplese~ p the merged cluster tree, rc,l w~d this reply to the sender of request R1, unless this is S 1 itself. 5 . Server S sends itself a global request message that causes all servers in MTfi" to act on ~mhed(led request Rl . 6. Server S receives a reply to the mess~pe it sent in 5(c). This reply inc~ es a pointer to a file F that lep.t;senls the completed hierarchical 10 cluster tree. Server S mllltic~etc file F to all proxy servers in MTfi~. Once the hierarchical cluster tree has been created as above, server S can send ~ lition~l messages through the cluster tree, to arrange that mllltic~ct trees MT(C) are created for sufficiently large clu.sters C, and that each file F is multicast to the tree MT(C), where C is the cm~llçst cluster CU~ ;II;IIg file F.
l~ TCHING USERS FOR VIRTUAL COMMUNITIES
Virtual Communities Computer users frequently join other users for ~1iccllCcions on cQmr~lt~or bulletin boards, n~w~g.oups, mailing lists, and real-time chat sessions over the computer net~;vork which may be typed (as with Internet Relay Chat (IRC)), spoken (as with Internet phone), 20 or videocol-re-enced. These forums are herein termed "virtual commllniti~s " In current practice, each virtual co.. ~ y has a specified topic, and users discover c~.. -~.;l ies of interest by word of mouth or by c~ a long list of co.. ~ es (typically hundreds or thollc~n~ls). The users then must decide for Illt;lllselves which of thousands of messages they find interesting from among those posted to the s~lecte~l virtual co.. ~.. ,.;lies, that is, made publicly available to members of those co.. ;l;es If they desire, they mav also write ad~iition~l messages and post them to the virtual co~ .;lies of their choice. The existence of thousands of Internet bulletin boards (also termed newsgroups) and countless more Internet mailing lists and private bulletin board services (BBS's) demonstrates the very strong interest among members of the electronic community in forums for the rliccllccil)n of 30 ideas about almost any subject im~gin~hle. Presently, virtual comm~nity creation proceeds in a haphazard form, usually inetig~ted by a single individual who decides that a topic is worthy of ~liccuccion There are protocols on the Internet for voting to determine whether CA 0223601~ 1998-04-27 W O 97/16796 PCTrUS96/17981 a newsgroup should be created, but there is a large hierarchy of ncw.,~uu~s ~which begin with the prefix "alt.") that do not follow this protocol.
The system for customized electronic itlentific~tiQn of desirable objects described herein can of course fimctiQn as a browser for bulletin boards, where target objects are taken 5 to be bulletin boards, or subtopics of bulletin boards, and each target profile is the cluster profile for a cluster of docl-m~nts posted on some bulletin board. Thus, a user can locate bulletin boards of interest by all the navigational teçhniq~les described above, inçlllrling bluw~ g and querying. However, this method only serves to locate ~xicting virtual comm--nitieS Because people have varied and varying complex il.Lere,Ls, it is desirable to 10 automatically locate groups of people with common interests in order to form virtual comm~mitieS~ The Virtual Community Service (VCS) described below is a network-based agent that seeks out users of a network with common interests, dynamically creates bulletin boards or electronic mailing lists for those users, and introduces them to each other electronically via e-mail. It is useful to note that once virtual comm~mities have been 15 created by VCS, the other bluw.,illg and filtering technologies described above can subsequently be used to help a user locate particular virtual comm--nities (whether pre-existing or automatically generated by VCS); similarly, since the m.q~g~s sent to a given virtual comrnunity may vary in interest and urgency for a user who has joined that u-mmllnity, these browsing and filtering technologies (such as the e-mail filter) can also be 20 used to alert the user to urgent messages and to screen out uninteresting ones.
The fim~tiQn~ of the Virtual Community Service are general functions that could be implemented on any network ranging from an office network in a small co".p~"y to the World Wide Web or the Internet. The four main steps in the procedure are: 1. Scan postings to existing virtual ct~mm--nities 2. Identify groups of users with common ~llLelt;~Ls~ 3 . Match users with virtual comm~mities~ creating new virtual comm--nities when necessary. 4. Continue to enroll additional users in the ~icting virtual co,.""..~ es More generally, users may post messages to virtual comm~mities pseudonymously, even employing di~elellL pseudonyms for di~el~ virtual commllnities. (Posts not employing a pseudonymous mix path may, as usual, be considered to be posts employing 30 a non-secure pseudonym, namely the user's true network address.) Therefore, the above steps may be expressed more generally as follows 1. Scan pseudonymous postings to existing virtual commlmities. 2. Identify groups of pseudonyms whose associated users have common interests. 3. Match pseudonymous users with virtual comm--nitie CA 0223601~ 1998-04-27 W O 97/16796 PCT~US96/17981 creating new virtual co.. ~ s when necec~.y 4. Continue to enroll ad~iition~
pseu~lonymous users in the eYi~ting virtual comm-lniti~
Each of these steps can be carried out as described below.
Scanning Using the terhnolo~y described above, Virtual Col.. I-;ly Service constantly scans all the messages posted to all the nt;w~gloups and electronic mailing lists on a given network, and constructs a target profile for each message found. The network can be the Tnt.ornet or a set of bulletiin boards l";.;~ ;"r(l by America Online, Prodigy, or CompuServe, or a smaller set of bulletin boards that might be local to a single o~ ;on, for exarnple 10 a large coll-pally, a law firm, or a university. The sç~nning activity need not be confined to bulletin boards and mailing lists that were created by Virtual Co.~.. .;ty Service, but may also be used to scan the activity of commlmities that predate Virtual Commllnity Serv~ice or are otherwise created by means outside the Virtual Comml~nity Service system, provided that these commlmities are public or otherwise grant their permi~sion The target profile of each message includes textual attributes specifying the title and body text of the message. In the case of a spoken rather than written message, the Lltter attribute may b e computed from the acoustic speech data by using a speech recognition system. The target profile also incl~ldes an associative attribute listing the author(s) and ign~ted reririent(s) of the message, where the recipients may be individuals and/or e:ntire 20 virtual commllnities; if this attribute is highly wçighte-l, then the system tends to regard messages among the same set of people as being similar or related, even if the topical silllila i~y of the messages is not clear from their content, as may happen when some of the m~cS~s are very short. Other important attributes include the fraction of the message that co~ s of quoted material from previous messages, as well as attributes that are generally 25 useful in characterizing docllment~7 such as the mes~ge's date, length, and reading l~vel.
Virtual Community ~(ientification Next, Virtual Commllnity Service ~Ut~lllp~S to identify groups of pseudonymous users with co.l.,no.l interests. These groups, herein termed "pre-commlmiti~," are ~ senLed as sets of pseudonyms. Whenever Virtual Commllnity Service identifies a pre-co,l....ulli~y~
30 it will subsequently attempt to put the users in said pre-community in contact with each other, as described below. Each pre-community is said to be "determined" by a cluster of messages, pseudonymous users, search profiles, or target objects.

CA 0223601~ 1998-04-27 W O 97/16796 PCT~US96/17981In the usual method for determir~ing pre-comml-nities, Virtual Co.l...~ y Service clusters the messages that were sç~nned and profiled in the above step, based on the similarity of those messages ' computed target profiles, thus automatically finding threads of rli~c~ls~iQn that show common interests among the users. Naturally, riiecuscions in a 5 single virtual comml-nity tend to show coll..nol1 interests; however, this method uses all the texts from every available virtual community, inclu~ling bulletin boards and electronic mailing lists. Indeed, a user who wishes to initiate or join a ~ c~ls~iQn on some topic may send a "feeler message" on that topic to a special mailing list clesi n~tecl for feeler mess ages; as a con~equrnce of the sç~nning procedure described above, the feeler message is 10 automatically grouped with any similarly profiled messages that have been sent to this special mailing list, to topical mailing lists, or to topical bulletin boards. The clustering step employs "soft clustering," in which a message may belong to multiple clusters and hence to multiple virtual co.. ".. ;l;es. Each cluster of messages that is found by Virtual Commllnity Service and that is of sllffirient size (for ~ mple7 10 -20 di~elelll messages) 15 detrrmin~c a pre-community whose members are the pseudonymous authors and recipients of the me~g~c in the cluster. More precisely, the pre-community consists of the various pseudonyms under which the messages in the cluster were sent and received.
Alternative methods for determining a pre-community, which do not require the sn~nning step above, include the following: 1. Pre-comml-nities can be generated by 20 grouping together users who have similar interests of any sort, not merely Individuals who have already written or received m.oc~grs about similar topics. If the user profile associated with each pseudonym indicates the user's lll~GlG~S, for example through an associative attribute that in~1ir.~tes the doc ~mrntc or Web sites a user likes, then pseudonyms can be clustered based on the similarity of their associated user profiles, and each of the resulting 25 clusters of pseudonyms ~letrrmines a pre-community comprising the pseudonyms in the cluster. 2. If each pseudonym has an associated search profile set formed through participation in the news clipping service described above, then all search profiles of all pseudonymous users can be clustered based on their similarity, and each cluster of search profiles determines a pre-community whose members are the pseudonyms from whose 30 search profile sets the search profiles in the cluster are drawn. Such groups of people have been reading about the same topic (or, more generally, accessing similar target objects~ and so presumably share an interest. 3. If users participate in a news clipping service or any other filtering or l)row~ing system for target objects, then an individual user can CA 0223601~ 1998-04-27 W O 97/16796 PCT~US96/179$1 pse~lclonymously request the fr rm~tion of a virtual co~ ulllly to discuss a particular cluster of one or more target objects known to that system. This cluster of target objects determines a pre-comm-mity consisting ofthe pseudonyms of users determined to be most interested in that cluster (for example, users who have search profiles similar to the cluster 5 pro file), together with the pseudonym of the user who requested formation of the ~,irtual community.
t~hin~ Users with Communities Once Virtual Comm--nity Service identifies a cluster C of messages, users, search profiles, or target objects that determines a pre-co~ "" " ~l-; Ly M, it attempts to arrange for the 10 members of this pre-co,.~""""ly to have the chance to participate in a common virtual comm~1nity V. In many cases, an existing virtual community V may suit the needs of the pre-comm--nity M. Virtual Community Service first attempts to find such an ~ ting c~-mm--nity V. In the case where cluster C is a cluster of mçss~ge~, V may be chosen to be any existing virtual c~-mmllnity such that the cluster profile of cluster C is within a threshold 15 distance ofthe mean profile of the set of messages recently posted to virtual community V;
in the case where cluster C is a cluster of users, V may be chosen to be any ~cieting virtual comm--nity such that the cluster profile of cluster C is within a threshold distance of the mean user profile of the active members of virtual comm--nity V; in the case where the cluster C is a cluster of search profiles, V may be chosen to be any ~xi~tin~ vir~ual 20 community such that the cluster profile of cluster C is within a threshold ~ t~n~e of the cluster profle of the largest cluster r~c ~lting from clustering all the search profiles of active members of virtual community V; and in the case where the cluster C is a cluster of one or more target objects chosen from a separate browsing or filtering system, V may be chosen to be any t~xi~ting virtual co~ llulfily initi~ted in the same way from a cluster whose cluster 25 profile in that other system is within a threshold distance of the cluster profile of cluster C.
The threshold distance used in each case is optionally dependent on the cluster variance or cluster diameter of the profile sets whose means are being compared.
If no existing virtual community V meets these conditions and is also willing toaccept all the users in pre-community M as new members, then Virtual Collllllunily Service 30 attempts to create a new virtual community V. Regardless of whether virtual community V is an e~Cieting cOIlllllul~iLy or a newly created community, Virtual Community ~ervice sends an e-mail message to each pseudonym P in pre-community M whose associated user U does not already belong to virtual community V (under pseudonym P) and has not CA 0223601~ 1998-04-27 previously turned down a request to join virtual community V. The e-mail message informs user U of the existence of virtual co.. ..~ y V, and provides instructions which user U
may follow in order to join virtual commlmity V if desired, these instructions vary depe~ g on whether virtual co.. u....... -ly V is an existing commlmity or a new community.
5 The message inr.llldes a credrnt~ granted to pseudonym P, which credf~.nti~l must be pl~se -l~d by user U upon joining the virtual community V, as proof that user U was actually invited to join. If user U wishes to join virtual cc,....--un.ly V under a di~e.~n~ pseudonym Q, user U may first transfer the crec~rnti~l from pseudonym P to pseudonym Q, as described above. The e-mail message further provides an indication of the common interests of the 10 comml-nity, for example by inr.lur~ing a list of titles of messages recently sent to the community, or a charter or introductory message provided by the comm~mity (if available), or a label gen~led by the methods described above that id~ntifies the content of the cluster of messages, user profiles, search profiles, or target objects that was used to identif~ the pre-community M.
If V~tual C~ mm--nity Service must create a new co.. ,.. l,-;ly V, several methods are available for enabling the members of the new community to communicate with each other.
If the pre-commllnity M is large, for example co.~ ;""-g more than S0 users, then Virtual Community Service typically establishes either a multicast tree, as described below, or a widely-distributed bulletin board, ~c~ignin~ a name to the new bulletin board. If the 20 pre-con~l.u~ ily M has fewer members, for example 2-~0, Virtual Coll..llLllfil~,~ Service typically establishes either a multicast tree, as described below, or an e-mail mailing list.
If the new virtual community V was determined by a cluster of mesc~ge~, then Virtual Commllnity Service kicks off the di~clls~ion by distributing these messages to all members of virtual commllnity V. In addition to bulletin boards and mailing lists, alternative fora that 25 can be created and in which virtual commllnities can gather include real-time typed or spoken conversations (or engagement or distributed multi-user applications inrln-ling video games) over the computer network and physical meeting~, any of which can be scheduled by a partly automated process wherein Virtual Community Service requests mçeting time preferences from all members of the pre-co--l~--uniLy M and then notifies these individuals 30 of an app.~p. iate mreting time.
Continued Enrollment Even after creation of a new virtual community, Virtual Community Service continues to scan other virtual comml-ni~ies for new messages whose target profiles are CA 0223601~ 1998-04-27 W O 97/16796 PCT~US96/17981 similar to the cO~ llul~ily's cluster profile (average mess~ge profile). Copies of any such messages are sent to the new virtual COIILIIIU11ilY~ and the pseudonymous authors of these messages, as well as users who show high interest in reading such messages, are infa,rmed by Virtual Community Service (as for pre-comlllul~Ly members, above) that they may want 5 to join the community. Each such user can then decide whether or not to join the comm--nity. In the case of Internet Relay Chat (IRC), if the target profile of mess~gf!s in a real time dialog are (or become) similar to that of a user, VCS may also send an urgent e-mail message to such user whereby the user may be automatically notified as soon as the dialog appe~ ~, if desired.
Wlth these f~riliti~.c Virtual COllllllull~Ly Service provides ~lltom~tic creation of new virtual co. ~ ." " ~ c in any local or wide- area network, as well as m~intf n~nce of all virtual comm~nitieS on the network, inrl.lrling those not created by Virtual Community Service.
The core technology underlying Virtual Community Service is creating a search and clustering mech~nicm that can find articles that are "similar" in that the users share 15 interests. This is precisely what was described above. One must be sure that ~,'irtual Ccmm-mity Service does not bombard users with notices about comm--nities in which they have no real interest. On a very small network a human could be "in the loop", sc~nning proposed virtual commllniti(~s and perhaps even giving them names. But on larger networks Virtual ColllllluniLy Service has to run in fully automatic mode, since it is likely to find a 20 large number of virtual comm~-nitieS.
Delivering Messages to a Virtual Community Once a virtual cnmm--nity has been i~l~ntifie-l, it is straigllLrolw~d for Virtual Comm--nity Service to establish a mailing list so that any member of the virtual comrnunity may distribute e-mail to all other members. Another method of distribution is to use a 25 conventional network bulletin board or newsgroup to distribute the messages to all servers in the network, where they can be ~cc~c~ed by any member of the virtual comrnunity.
However, these simple methods do not take into account cost and pelrulmallce advantages which accrue from op~ g the construction of a mnltic~et tree to carry messages to the virtual comm--nity. Unlike a newsgroup, a multicast tree distributes messages to only a 30 selected set of servers, and unlike an e-mail mailing list, it does so efficiently.
A separate mlllti~ct kee MT(V) is ..~ ;"ec~ for each virtual community V, by useof the following four procedures. 1. To construct or reconstruct this multicast tree, the core servers for virtual col.. ~ ;ly V are taken to be those proxy servers that serve at least ûne CA 0223601~ 1998-04-27 pseudonymous member of virtual coll",.ul--Ly V. Then the mlllti~ ct tree MT(V) is established via steps 4 -6 in the section ~ rnlti5s~t Tree Construction Procedure" above. 2.
When a new user joins virtual cnmmlmity V, which is an ~xi~tinE virtual co~ "Ly, the user sends a message to the user's proxy server S. If user's proxy server S is not already a 5 core server for V, then it is d~i n~ted as a core server and is added to the mllltic~ct tree MT(V), as follows. If more than k servers have been added since the last time the multicast tree MT(V) was rebuilt, where k is a function of the nu~ber of core servers already in the tree, then the entire tree is simply rebuilt via steps 4-6 in the section "Multicast Tree Construction Procedure" above. Otherwise, server S retrieves its locally stored list of 10 nearby core servers for V, and chooses a server S 1. Server S sends a control message to S 1, in~ ting that it would like to be added to the mllltic~et tree Ml(V). Upon receipt ofthis m~Qc~g~ server S 1 retrieves its locally stored subtree Gl of MT(V), and forms a new graph G from Gl by removing all degree- I vertices other than S 1 itself. Server S 1 transmits graph G t o server S, which stores it as its locally stored subtree of MT(V). Finally, server S sends 15 a message to itself and to all servers that are vertices of graph G, instructing these servers to modify their locally stored subtrees of MT(V) by adding S as a vertex and adding an edge between Sl and S. 3. When a user at a client q wishes to send a message F to virtual comm~lnity V, client q embeds message F in a request R instructing the recipient to store message F locally, for a limited time, for access by member s of virtual collllllul~y V.
20 Request R incl~-des a credential proving that the user is a member of virtual co~ ul"~y V
or is otherwise entitled to post messages to virtual c~-mmllnity V (for example is not "black marked" by that or other virtual community members). Client q then bro~dc~t~ request R
to all core servers in the multicast tree Ml(V), by means of a global request message Ll~ ed to the user's proxy server as described above. The core servers satisfy request 25 R, provided that they can verify the in~lnded cred.o.nti~l 4. In order to retrieve a particular message sent to virtual community V, a user U at client q initi~tes the steps described in section "Retrieving Files from a Multicast Tree," above. If user U does not want to retrieve a particular message, but rather wants to retrieve all new messages sent to virtual community V, then user U pse~ldonymously instructs its proxy server (which is a core server 30 for V) to send it all messages that were mllltic~t to MT(V) after a certain date. In either case, user U must provide a credenti~l proving user U to be a member of virtual community V, or otherwise entitled to access messages on virtual comml~nity V.

CA 0223601~ 1998-04-27 WO 97/16796 PCT/US96/179~31 SUMMARY
A method has been presented for ~lltom~ti~lly selecting articles of interest to a user.
The method generates sets of search profiles for the users based on such attributes as the relative frequency of occurrence of words in the articles read by the users, and uses 1:hese S search profiles to ~ffici~.ntly identify future articles of interest. The methods is characterized by passive monitoring (users do not need to PYplic.itly rate the anticles), multiple search profiles per user (reflecting interest in ml~ltiple topics) and use of elements of the search profiles which are a~ltom~tic~lly determined from the data (notably, the TF/IDF measure based on word frequencies and descriptions of purchasable items). A method has also been 10 presented for automatically generating menus to allow users to locate and retrieve articles on topics of interest. This method clusters articles based on their similarity, as measured by the relative frequency of word occurrences. Clusters are labeled either with article titles or with key words extracted from the article. The method can be applied to large sets of articles distributed over many m~chines It has been further shown how to extend the above methods from articles to any class of target objects for which profiles can be generated, inclufling news articles, reference or work articles, electronic mail, product or service descriptions, people (based on the articles they read, demographic data, or the products they buy), and electronic bulletin boards (based on the articles posted to them). A particular con~equence of being able to group people by 20 their interests is that one can form virtual co.. ~ ";~ ies of people of common interest, who can then correspond with one another via electronic mail.

Claims (34)

WE CLAIM:
1. A method for providing a user with access to selected ones of a plurality of target objects and sets of target object characteristics that are accessible via an electronic storage media, where said users are connected via user terminals and data communication connections to a target server system which includes saidelectronic storage media, said method comprising the steps of:
automatically generating target profiles for target objects and sets of target object characteristics stored in said electronic storage media, each of said target profiles being generated from the contents of an associated one of said target objects and sets of target object characteristics;
automatically generating at least one user target profile interest summary for a user at a user terminal, each said user target profile interest summary being generated from ones of said target objects and sets of target object characteristics accessed by said user; and enabling access to said plurality of target objects and sets of target object characteristics stored on said electronic storage media by users via said targetprofiles.
2. The method of claim 1 wherein said step of enabling access comprises:
correlating said user target profile interest summaries, generated for an identified user, with said generated target profiles to identify ones of said plurality of target objects and sets of target object characteristics stored on said electronic storage media that are likely to be of interest to said identified user.
3. The method of claim 2 wherein said step of enabling access further comprises: transmitting a list, that identifies at least one of said identified ones of said plurality of target objects and sets of target object characteristics, to said identified user; and providing access to a selected one of said plurality of target objects and sets of target object characteristics stored on said electronic storage media in response to said identified user selecting an item from said list.
4. The method of claim 3 wherein said step of providing access comprises:
transmitting data, in response to said identified user activating a one of said user terminals to identify said selected item on said list, indicative of said identified user's selection of said selected item from said one user terminal to said target server via a one of said data communication connections.
5. The method of claim 4 wherein said step of providing access further comprises:
retrieving, in response to receipt of said data from said one user terminal, a target object identified by said selected item from said electronic storage media; and transmitting said retrieved target object to said one user terminal for display thereon to said identified user.
6. The method of claim 2 wherein said step of enabling access further comprises: transmitting at least one of said identified ones of said plurality of target objects and sets of target object characteristics, to said identified user in advance of said user requesting said at least one of said identified ones of said plurality of target objects and sets of target object characteristics.
7. The method of claim 2 wherein said step of enabling access further comprises: transmitting a list, that identifies at least one of said identified ones of said plurality of target objects and sets of target object characteristics, to said identified user; and transmitting said identified ones of said plurality of target objects and sets of target object characteristics stored on said electronic storage media from said target server system to a designated server located closer via said electronic communications connections to said user terminal than said target server system.
8. The method of claim 7 wherein said step of providing access comprises:
transmitting data, in response to said identified user activating a one of said user terminals to identify said selected item on said list, indicative of said identified user's selection of said selected item from said one user terminal to said designated server via a one of said data communication connections.
9. The method of claim 8 wherein said step of providing access further comprises:
retrieving, in response to receipt of said data from said one user terminal, a target object identified by said selected item from said designated server; and transmitting said retrieved target object to said one user terminal for display thereon to said identified user.
10. The method of claim 1 wherein said target object is a document having at least one page, said step of enabling access comprises:
automatically generating a user target profile interest summary for an identified user that is indicative of target objects and sets of target object characteristics retrieved by said identified user as well as the number of pages of said retrieved documents accessed by said identified user.
11. The method of claim 10 wherein said automatically generated user target profile interest summaries are also indicative of a length of time said identified user accessed said retrieved target objects and sets of target objectcharacteristics.
12. The method of claim 1 wherein said step of automatically generating target profiles comprises:
automatically generating a hierarchical menu that directs said users to at least a subset of said plurality of target objects and sets of target object characteristics stored on said electronic media, comprising:
sorting all target objects and sets of target object characteristics in said subset into a plurality of clusters of target objects and sets of target object characteristics based on an empirical measure of similarity of content of said target objects and sets of target object characteristics, and generating a hierarchical menu that identifies a content in common of target objects and sets of target object characteristics sorted into each of said plurality of clusters, to enable said identified user to identify ones of said plurality of target objects and sets of target object characteristics stored on said electronic storage media that are likely to be of interest to said identified user.
13. The method of claim 12 wherein said step of automatically generating a hierarchical menu further comprises:
ascribing a cluster profile to each of said plurality of clusters.
14. The method of claim 12 wherein said step of sorting comprises:
dividing said plurality of target objects and sets of target object characteristics into at least two clusters based upon said empirical measure of similarity of content of said target objects and sets of target object characteristics, subdividing each of said at least two clusters into at least two subclusters based upon said empirical measure of similarity of content of said target objects and sets of target object characteristics, and repeating said step of subdividing to produce a multi-level hierarchy of identified clusters.
15. The method of claim 14 wherein said step of generating a hierarchical menu comprises:
ascribing a cluster profile to each cluster produced by all steps of dividing and subdividing in said step of sorting.
16. The method of claim 15 wherein said step of ascribing comprises:
identifying at least one term in said generated target profiles produced for ones of said plurality of target objects and sets of target object characteristics sorted into a cluster that is indicates of the target content of said ones of said plurality of target objects and sets of target object characteristics sorted into said cluster.
17. The method of claim 15 wherein said step of ascribing comprises:

selecting at least one target object of said ones of said plurality of target objects and sets of target object characteristics sorted into said cluster that are closest to the center of the cluster, and ascribing a cluster profile that is indicative of the target content of said ones of said plurality of target objects and sets of target object characteristics sorted into said cluster, said cluster profile comprising elements of at least one of: a title of said selected at least one target object and sets of target object characteristics, and a set of words contained in the target profile of said selected at least one target object cluster which have the highest relative frequency.
18. Apparatus for providing a user with access to selected ones of a plurality of target objects and sets of target object characteristics that are accessible via an electronic storage media, where said users are connected via user terminals and data communication connections to a target server system which includes saidelectronic storage media, comprising:
means for automatically generating target profiles for target objects and sets of target object characteristics stored in said electronic storage media, each of said target profiles being generated from the contents of an associated one of said target objects and sets of target object characteristics;
means for automatically generating at least one user target profile interest summary for a user at a user terminal, each said user target profile interest summary being generated from ones of said target objects and sets of target object characteristics accessed by said user; and means for enabling access to said plurality of target objects and sets of targetobject characteristics stored on said electronic storage media by users via said target profiles.
19. The apparatus of claim 18 wherein said means for enabling access comprises:
means for correlating a user target profile interest summary, generated for an identified user, with said generated target profiles to identify ones of said plurality of target objects and sets of target object characteristics stored on said electronic storage media that are likely to be of interest to said identified user.
20. The apparatus of claim 19 wherein said means for enabling access further comprises:
means for transmitting a list, that identifies at least one of said identified ones of said plurality of target objects and sets of target object characteristics, to said identified user; and means for providing access to a selected one of said plurality of target objectsand sets of target object characteristics stored on said electronic storage media in response to said identified user selecting an item from said list.
21. The apparatus of claim 20 wherein said means for providing access comprises:
means for transmitting data, in response to said identified user activating a one of said user terminals to identify said selected item on said list, indicative of said identified user's selection of said selected item from said one user terminal to said target server via a one of said data communication connections.
22. The apparatus of claim 21 wherein said means for providing access further comprises:
means for retrieving, in response to receipt of said data from said one user terminal, a target object identified by said selected item from said electronic storage media; and means for transmitting said retrieved target object to said one user terminal for display thereon to said identified user.
23. The apparatus of claim 19 wherein said means for enabling access further comprises:
means for transmitting at least one of said identified ones of said plurality oftarget objects and sets of target object characteristics, to said identified user in advance of said user requesting said at least one of said identified ones of said plurality of target objects and sets of target object characteristics.
24. The apparatus of claim 19 wherein said means for enabling access further comprises:
means for transmitting a list, that identifies at least one of said identified ones of said plurality of target objects and sets of target object characteristics, to said identified user; and means for transmitting said identified ones of said plurality of target objects and sets of target object characteristics stored on said electronic storage media from said target server system to a designated server located closer via said electronic communications connections to said user terminal than said target server system.
25. The apparatus of claim 24 wherein said means for providing access comprises:
means for transmitting data, in response to said identified user activating a one of said user terminals to identify said selected item on said list, indicative of said identified user's selection of said selected item from said one user terminal to said designated server via a one of said data communication connections.
26. The apparatus of claim 25 wherein said means for providing access further comprises:
means for retrieving, in response to receipt of said data from said one user terminal, a target object identified by said selected item from said designated server;
and means for transmitting said retrieved target object to said one user terminal for display thereon to said identified user.
27. The apparatus of claim 20 wherein said target object is a document having at least one page, said means for enabling access comprises:
means for automatically generating a user target profile interest summary for an identified user that is indicative of target objects and sets of target object characteristics retrieved by said identified user as well as the number of pages of said retrieved documents accessed by said identified user.
28. The apparatus of claim 27 wherein said automatically generated user target profile interest summary is also indicative of a length of time said identified user accessed said retrieved target objects and sets of target object characteristics.
29. The apparatus of claim 22 wherein said means for automatically generating target profiles comprises:
means for automatically generating a hierarchical menu that directs said users to at least a subset of said plurality of target objects and sets of target object characteristics stored on said electronic media, comprising:
means for sorting all target objects and sets of target object characteristics in said subset into a plurality of clusters of target objects and sets of target object characteristics based on an empirical measure of similarity of content of said target objects and sets of target object characteristics, and means for generating a hierarchical menu that identifies the content in common of target objects and sets of target object characteristics sorted into each of said plurality of clusters, to enable said identified user to identify ones of said plurality of target objects and sets of target object characteristics stored on said electronic storage media that are likely to be of interest to said identified user.
30. The apparatus of claim 29 wherein said means for automatically generating a hierarchical menu further comprises:
means for ascribing a cluster profile to each of said plurality of clusters.
31. The apparatus of claim 29 wherein said means for sorting comprises:
means for dividing said plurality of target objects and sets of target object characteristics into at least two clusters based upon said empirical measure of similarity of content of said target objects arid sets of target object characteristics, means for subdividing each of said at least two clusters into at least two subclusters based upon said empirical measure of similarity of content of said target objects and sets of target object characteristics, and means for repeating said step of subdividing to produce a multi-level hierarchy of identified clusters.
32. The apparatus of claim 31 wherein said means for generating a hierarchical menu comprises:
means for ascribing a cluster profile to each cluster produced by all said means for dividing and subdividing.
33. The apparatus of claim 32 wherein said means for ascribing comprises:
means for identifying at least one term in said generated target profiles produced for ones of said plurality of target objects and sets of target object characteristics sorted into a cluster that is indicative of the target content of said ones of said plurality of target objects and sets of target object characteristics sorted into said cluster.
34. The apparatus of claim 32 wherein said means for ascribing comprises:
means for selecting at least one target object of said ones of said plurality oftarget objects and sets of target object characteristics sorted into said cluster that are closest to the center of the cluster; and means for ascribing a cluster profile that is indicative of the target content of said ones of said plurality of target objects and sets of target object characteristics sorted into said cluster, said cluster profile comprising elements of at least one of:
a title of said selected at least one target object, and a set of words contained in the target profile of said selected at least one target object cluster which have the highest relative frequency.
CA002236015A1995-10-311996-10-29System for customized electronic identification of desirable objectsAbandonedCA2236015A1 (en)

Applications Claiming Priority (3)

Application NumberPriority DateFiling DateTitle
US55119895A1995-10-311995-10-31
US08/551,1981995-10-31
PCT/US1996/017981WO1997016796A1 (en)1995-10-311996-10-29System for customized electronic identification of desirable objects

Publications (1)

Publication NumberPublication Date
CA2236015A1true CA2236015A1 (en)1997-05-09

Family

ID=29406067

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CA002236015AAbandonedCA2236015A1 (en)1995-10-311996-10-29System for customized electronic identification of desirable objects

Country Status (1)

CountryLink
CA (1)CA2236015A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US7716088B2 (en)1998-06-252010-05-11Amazon.Com, Inc.Method and system for electronic commerce using multiple roles
CN111078858A (en)*2018-10-192020-04-28阿里巴巴集团控股有限公司Article searching method and device and electronic equipment
CN113452580A (en)*2021-07-212021-09-28福州富昌维控电子科技有限公司Method for automatically checking network equipment faults in batches and test host end
CN114298308A (en)*2021-12-082022-04-08南京林业大学Furniture personalized customization system and method based on interactive genetic algorithm
CN114330963A (en)*2021-10-142022-04-12北京爱奇艺科技有限公司Video-based object evaluation method, system, device and storage medium

Cited By (8)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US7716088B2 (en)1998-06-252010-05-11Amazon.Com, Inc.Method and system for electronic commerce using multiple roles
US7912756B2 (en)1998-06-252011-03-22Amazon.Com, Inc.Method and system for electronic commerce using multiple roles
CN111078858A (en)*2018-10-192020-04-28阿里巴巴集团控股有限公司Article searching method and device and electronic equipment
CN111078858B (en)*2018-10-192023-06-09阿里巴巴集团控股有限公司Article searching method and device and electronic equipment
CN113452580A (en)*2021-07-212021-09-28福州富昌维控电子科技有限公司Method for automatically checking network equipment faults in batches and test host end
CN113452580B (en)*2021-07-212022-07-26福州富昌维控电子科技有限公司Method for automatically checking network equipment faults in batches and test host end
CN114330963A (en)*2021-10-142022-04-12北京爱奇艺科技有限公司Video-based object evaluation method, system, device and storage medium
CN114298308A (en)*2021-12-082022-04-08南京林业大学Furniture personalized customization system and method based on interactive genetic algorithm

Similar Documents

PublicationPublication DateTitle
US5754939A (en)System for generation of user profiles for a system for customized electronic identification of desirable objects
US6460036B1 (en)System and method for providing customized electronic newspapers and target advertisements
US6029195A (en)System for customized electronic identification of desirable objects
WO1997016796A1 (en)System for customized electronic identification of desirable objects
US8229782B1 (en)Methods and systems for processing distributed feedback
US8676645B2 (en)Identification of users for advertising using data with missing values
HargittaiOpen portals or closed gates? Channeling content on the World Wide Web
US7110983B2 (en)Methods for matching, selecting, narrowcasting, and/or classifying based on rights management and/or other information
US7085806B1 (en)Method and apparatus for recommending a match to another
US6112181A (en)Systems and methods for matching, selecting, narrowcasting, and/or classifying based on rights management and/or other information
US5918014A (en)Automated collaborative filtering in world wide web advertising
CA2236015A1 (en)System for customized electronic identification of desirable objects
AU1562402A (en)System for customized electronic identification of desirable objects
AU2008261113A1 (en)System for Customized Electronic Identification of Desirable Objects
AU2012216241A1 (en)System for Customized Electronic Identification of Desirable Objects
MwaipopoMarketing communications via interactive media in African developing countries: some research perspectives
Lane et al.The Effect of the Internet on the Advertising Industry in a Consumer Culture

Legal Events

DateCodeTitleDescription
EEERExamination request
FZDEDead

[8]ページ先頭

©2009-2025 Movatter.jp