Movatterモバイル変換


[0]ホーム

URL:


US20120078871A1 - Detecting duplicate and near-duplicate files - Google Patents

Detecting duplicate and near-duplicate files
Download PDF

Info

Publication number
US20120078871A1
US20120078871A1US13/313,913US201113313913AUS2012078871A1US 20120078871 A1US20120078871 A1US 20120078871A1US 201113313913 AUS201113313913 AUS 201113313913AUS 2012078871 A1US2012078871 A1US 2012078871A1
Authority
US
United States
Prior art keywords
search result
candidate search
candidate
duplicate
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US13/313,913
Inventor
William Pugh
Monika H. Henzinger
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Google LLC
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by IndividualfiledCriticalIndividual
Priority to US13/313,913priorityCriticalpatent/US20120078871A1/en
Publication of US20120078871A1publicationCriticalpatent/US20120078871A1/en
Assigned to GOOGLE INC.reassignmentGOOGLE INC.ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS).Assignors: PUGH, WILLIAM, HENZINGER, MONIKA H.
Abandonedlegal-statusCriticalCurrent

Links

Images

Classifications

Definitions

Landscapes

Abstract

Improved duplicate and near-duplicate detection techniques may assign a number of fingerprints to a given document by (i) extracting parts from the document, (ii) assigning the extracted parts to one or more of a predetermined number of lists, and (iii) generating a fingerprint from each of the populated lists. Two documents may be considered to be near-duplicates if any one of their fingerprints match.

Description

Claims (21)

1. A method for filtering a plurality of candidate search results to remove near-duplicates, the method comprising:
a) for one or more of the plurality of candidate search results, determining that one candidate search result of the one or more candidate search results is a near-duplicate of another of the plurality of candidate search results by
1) determining that a cluster identifier of the one candidate search result matches a cluster identifier of the other candidate search result; and
2) in response to determining that a cluster identifier of the one candidate search result matches a cluster identifier of the other candidate search result, concluding that the one candidate search is a near-duplicate of the other candidate search result; and
b) in response to the determination that the one candidate search result is a near-duplicate of the other candidate search result, rejecting the one candidate search result thereby defining a filtered set of search results including only those of the plurality of candidate search results that have not been rejected.
8. An apparatus, comprising:
at least one processor; and
at least one storage device storing a processor executable program which, when executed by the at least one processor, causes the at least one processor to perform operations comprising:
a) for one or more of the plurality of candidate search results, determining that one candidate search result of the one or more candidate search results is a near-duplicate of another of the plurality of candidate search results by
1) determining that a cluster identifier of the one candidate search result matches a cluster identifier of the other candidate search result; and
2) in response to determining that a cluster identifier of the one candidate search result matches a cluster identifier of the other candidate search result, concluding that the one candidate search is a near-duplicate of the other candidate search result; and
b) in response to the determination that the one candidate search result is a near-duplicate of the other candidate search result, rejecting the one candidate search result thereby defining a filtered set of search results including only those of the plurality of candidate search results that have not been rejected.
15. A computer storage device encoded with a computer program, the computer program comprising instructions that, when executed by data processing apparatus, cause the data processing apparatus to perform operations comprising:
a) for one or more of the plurality of candidate search results, determining that one candidate search result of the one or more candidate search results is a near-duplicate of another of the plurality of candidate search results by
1) determining that a cluster identifier of the one candidate search result matches a cluster identifier of the other candidate search result; and
2) in response to determining that a cluster identifier of the one candidate search result matches a cluster identifier of the other candidate search result, concluding that the one candidate search is a near-duplicate of the other candidate search result; and
b) in response to the determination that the one candidate search result is a near-duplicate of the other candidate search result, rejecting the one candidate search result thereby defining a filtered set of search results including only those of the plurality of candidate search results that have not been rejected.
US13/313,9132001-01-242011-12-07Detecting duplicate and near-duplicate filesAbandonedUS20120078871A1 (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
US13/313,913US20120078871A1 (en)2001-01-242011-12-07Detecting duplicate and near-duplicate files

Applications Claiming Priority (4)

Application NumberPriority DateFiling DateTitle
US09/768,947US6658423B1 (en)2001-01-242001-01-24Detecting duplicate and near-duplicate files
US10/608,468US7366718B1 (en)2001-01-242003-06-27Detecting duplicate and near-duplicate files
US12/049,278US9275143B2 (en)2001-01-242008-03-15Detecting duplicate and near-duplicate files
US13/313,913US20120078871A1 (en)2001-01-242011-12-07Detecting duplicate and near-duplicate files

Related Parent Applications (1)

Application NumberTitlePriority DateFiling Date
US12/049,278ContinuationUS9275143B2 (en)2001-01-242008-03-15Detecting duplicate and near-duplicate files

Publications (1)

Publication NumberPublication Date
US20120078871A1true US20120078871A1 (en)2012-03-29

Family

ID=29550477

Family Applications (4)

Application NumberTitlePriority DateFiling Date
US09/768,947Expired - LifetimeUS6658423B1 (en)2001-01-242001-01-24Detecting duplicate and near-duplicate files
US10/608,468Expired - LifetimeUS7366718B1 (en)2001-01-242003-06-27Detecting duplicate and near-duplicate files
US12/049,278Expired - Fee RelatedUS9275143B2 (en)2001-01-242008-03-15Detecting duplicate and near-duplicate files
US13/313,913AbandonedUS20120078871A1 (en)2001-01-242011-12-07Detecting duplicate and near-duplicate files

Family Applications Before (3)

Application NumberTitlePriority DateFiling Date
US09/768,947Expired - LifetimeUS6658423B1 (en)2001-01-242001-01-24Detecting duplicate and near-duplicate files
US10/608,468Expired - LifetimeUS7366718B1 (en)2001-01-242003-06-27Detecting duplicate and near-duplicate files
US12/049,278Expired - Fee RelatedUS9275143B2 (en)2001-01-242008-03-15Detecting duplicate and near-duplicate files

Country Status (1)

CountryLink
US (4)US6658423B1 (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US20080162478A1 (en)*2001-01-242008-07-03William PughDetecting duplicate and near-duplicate files
US20130275408A1 (en)*2012-04-162013-10-17International Business Machines CorporationPresenting Unique Search Result Contexts
US20140122450A1 (en)*2001-03-192014-05-01Fti Technology LlcComputer-Implemented System And Method For Identifying Duplicate And Near Duplicate Messages
US8782726B1 (en)2000-09-142014-07-15Network-1 Technologies, Inc.Method for taking action based on a request related to an electronic media work
WO2015013954A1 (en)*2013-08-012015-02-05Google Inc.Near-duplicate filtering in search engine result page of an online shopping system
US20150370880A1 (en)*2014-06-202015-12-24International Business Machines CorporationIdentifying items on a second website already browsed at a first website
US9536016B2 (en)*2013-01-162017-01-03Google Inc.On-disk multimap
US10671686B2 (en)2013-02-282020-06-02International Business Machines CorporationProcessing webpage data
RU2825549C1 (en)*2024-02-292024-08-27Общество с ограниченной ответственностью "СЕРЧИНФОРМ"Method of hashing files for fast search of duplicates

Families Citing this family (371)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US6671407B1 (en)*1999-10-192003-12-30Microsoft CorporationSystem and method for hashing digital images
KR20000012309A (en)*1999-11-232000-03-06고성민Circular radiation type internet classification searching method
US7080073B1 (en)2000-08-182006-07-18Firstrain, Inc.Method and apparatus for focused crawling
US6915294B1 (en)*2000-08-182005-07-05Firstrain, Inc.Method and apparatus for searching network resources
US7103838B1 (en)2000-08-182006-09-05Firstrain, Inc.Method and apparatus for extracting relevant data
US8010988B2 (en)*2000-09-142011-08-30Cox Ingemar JUsing features extracted from an audio and/or video work to obtain information about the work
WO2002037220A2 (en)*2000-10-312002-05-10ContextwebInternet contextual communication system
US6889209B1 (en)*2000-11-032005-05-03Shieldip, Inc.Method and apparatus for protecting information and privacy
US6978419B1 (en)*2000-11-152005-12-20Justsystem CorporationMethod and apparatus for efficient identification of duplicate and near-duplicate documents and text spans using high-discriminability text fragments
CN1316397C (en)*2001-02-122007-05-16Emc公司 System and method for indexing unique email messages and same
JP2002259193A (en)*2001-02-282002-09-13Fujitsu Ltd Information providing server, server information providing method, information providing system, and computer-readable recording medium
JP2002288214A (en)*2001-03-282002-10-04Hitachi Ltd Search system and search service
US7020775B2 (en)*2001-04-242006-03-28Microsoft CorporationDerivation and quantization of robust non-local characteristics for blind watermarking
US6973574B2 (en)*2001-04-242005-12-06Microsoft Corp.Recognizer of audio-content in digital signals
US7356188B2 (en)*2001-04-242008-04-08Microsoft CorporationRecognizer of text-based work
US6996273B2 (en)2001-04-242006-02-07Microsoft CorporationRobust recognizer of perceptually similar content
US7773240B2 (en)*2001-05-182010-08-10Hewlett-Packard Development Company, L.P.Image forming devices and methods of assisting with maintenance of image forming device consumables
US6993532B1 (en)*2001-05-302006-01-31Microsoft CorporationAuto playlist generator
US7899825B2 (en)*2001-06-272011-03-01SAP America, Inc.Method and apparatus for duplicate detection
US7136845B2 (en)2001-07-122006-11-14Microsoft CorporationSystem and method for query refinement to enable improved searching based on identifying and utilizing popular concepts related to users' queries
US6778995B1 (en)2001-08-312004-08-17Attenex CorporationSystem and method for efficiently generating cluster groupings in a multi-dimensional concept space
US6978274B1 (en)2001-08-312005-12-20Attenex CorporationSystem and method for dynamically evaluating latent concepts in unstructured documents
US6888548B1 (en)2001-08-312005-05-03Attenex CorporationSystem and method for generating a visualized data representation preserving independent variable geometric relationships
US7296019B1 (en)2001-10-232007-11-13Microsoft CorporationSystem and methods for providing runtime spelling analysis and correction
US6895550B2 (en)*2001-12-052005-05-17I2 Technologies Us, Inc.Computer-implemented PDF document management
US6978264B2 (en)2002-01-032005-12-20Microsoft CorporationSystem and method for performing a search and a browse on a query
US7139756B2 (en)*2002-01-222006-11-21International Business Machines CorporationSystem and method for detecting duplicate and similar documents
US6968331B2 (en)*2002-01-222005-11-22International Business Machines CorporationMethod and system for improving data quality in large hyperlinked text databases using pagelets and templates
US7853719B1 (en)2002-02-112010-12-14Microsoft CorporationSystems and methods for providing runtime universal resource locator (URL) analysis and correction
US7271804B2 (en)2002-02-252007-09-18Attenex CorporationSystem and method for arranging concept clusters in thematic relationships in a two-dimensional visual display area
US7219301B2 (en)*2002-03-012007-05-15Iparadigms, LlcSystems and methods for conducting a peer review process and evaluating the originality of documents
WO2003093961A2 (en)*2002-05-022003-11-13Shieldip, Inc.Method and apparatus for protecting information and privacy
US7035845B2 (en)*2002-05-152006-04-25Microsoft CorporationGeneric proxy for representing search engine partner
US20030233621A1 (en)*2002-06-132003-12-18International Business Machines CorporationEditor for smart version control
US20070244981A1 (en)*2002-06-272007-10-18Malden Matthew SDisseminating information about security threats
US8423374B2 (en)*2002-06-272013-04-16Siebel Systems, Inc.Method and system for processing intelligence information
US7006703B2 (en)*2002-06-282006-02-28Microsoft CorporationContent recognizer via probabilistic mirror distribution
EP1551424B1 (en)2002-09-042015-07-01Arrowhead Research CorporationTreatment of chronic neuropathic pain by admistration of dsrna
US7703000B2 (en)*2003-02-132010-04-20Iparadigms LlcSystems and methods for contextual mark-up of formatted documents
US7487185B2 (en)*2003-02-242009-02-03Microsoft CorporationTechnique for changing views of web content
US7380204B2 (en)*2003-02-242008-05-27Microsoft CorporationInfrastructure for generating web content
US7783617B2 (en)*2003-04-162010-08-24Yahoo! Inc.Personals advertisement affinities in a networked computer system
JP2006525584A (en)*2003-04-292006-11-09コーニンクレッカ フィリップス エレクトロニクス エヌ ヴィ Same record to P2P network mapped to a single query result
US20040261016A1 (en)*2003-06-202004-12-23Miavia, Inc.System and method for associating structured and manually selected annotations with electronic document contents
US7610313B2 (en)2003-07-252009-10-27Attenex CorporationSystem and method for performing efficient document scoring and clustering
FR2859341A1 (en)*2003-08-272005-03-04Thomson Licensing Sa METHOD FOR CONTROLLING EQUIPMENT CONNECTED TO A HETEROGENEOUS NETWORK AND APPARATUS IMPLEMENTING THE METHOD
US20060036582A1 (en)*2004-08-162006-02-16Laust SondergaardGlobal search with local search
US7505964B2 (en)*2003-09-122009-03-17Google Inc.Methods and systems for improving a search ranking using related queries
US7454417B2 (en)2003-09-122008-11-18Google Inc.Methods and systems for improving a search ranking using population information
US7849063B2 (en)*2003-10-172010-12-07Yahoo! Inc.Systems and methods for indexing content for fast and scalable retrieval
US7620624B2 (en)*2003-10-172009-11-17Yahoo! Inc.Systems and methods for indexing content for fast and scalable retrieval
US7937340B2 (en)*2003-12-032011-05-03Microsoft CorporationAutomated satisfaction measurement for web search
US8521725B1 (en)2003-12-032013-08-27Google Inc.Systems and methods for improved searching
US8676790B1 (en)2003-12-052014-03-18Google Inc.Methods and systems for improving search rankings using advertising data
US7302645B1 (en)2003-12-102007-11-27Google Inc.Methods and systems for identifying manipulated articles
US7333985B2 (en)*2003-12-152008-02-19Microsoft CorporationDynamic content clustering
US20050131872A1 (en)*2003-12-162005-06-16Microsoft CorporationQuery recognizer
US7249319B1 (en)2003-12-222007-07-24Microsoft CorporationSmartly formatted print in toolbar
US7523109B2 (en)*2003-12-242009-04-21Microsoft CorporationDynamic grouping of content including captive data
US20050216823A1 (en)*2003-12-302005-09-29Microsoft CorporationAssigning textual ads based on article history
US7890526B1 (en)2003-12-302011-02-15Microsoft CorporationIncremental query refinement
US7516118B1 (en)2003-12-312009-04-07Google Inc.Methods and systems for assisted network browsing
US7502773B1 (en)2003-12-312009-03-10Microsoft CorporationSystem and method facilitating page indexing employing reference information
US7953631B1 (en)2003-12-312011-05-31Microsoft CorporationPaid inclusion listing enhancement
US7676553B1 (en)2003-12-312010-03-09Microsoft CorporationIncremental web crawler using chunks
US20050149498A1 (en)*2003-12-312005-07-07Stephen LawrenceMethods and systems for improving a search ranking using article information
US8954420B1 (en)2003-12-312015-02-10Google Inc.Methods and systems for improving a search ranking using article information
US8612411B1 (en)*2003-12-312013-12-17Google Inc.Clustering documents using citation patterns
US7831832B2 (en)2004-01-062010-11-09Microsoft CorporationDigital goods representation based upon matrix invariances
US7444327B2 (en)2004-01-092008-10-28Microsoft CorporationSystem and method for automated optimization of search result relevance
US20050154717A1 (en)*2004-01-092005-07-14Microsoft CorporationSystem and method for optimizing paid listing yield
US8341017B2 (en)*2004-01-092012-12-25Microsoft CorporationSystem and method for optimizing search result listings
US8375048B1 (en)2004-01-202013-02-12Microsoft CorporationQuery augmentation
US20050165750A1 (en)*2004-01-202005-07-28Microsoft CorporationInfrequent word index for document indexes
US8015119B2 (en)2004-01-212011-09-06Google Inc.Methods and systems for the display and navigation of a social network
US7680770B1 (en)2004-01-212010-03-16Google Inc.Automatic generation and recommendation of communities in a social network
US8010459B2 (en)2004-01-212011-08-30Google Inc.Methods and systems for rating associated members in a social network
US7293016B1 (en)2004-01-222007-11-06Microsoft CorporationIndex partitioning based on document relevance for document indexes
US7293005B2 (en)*2004-01-262007-11-06International Business Machines CorporationPipelined architecture for global analysis and index building
US7424467B2 (en)2004-01-262008-09-09International Business Machines CorporationArchitecture for an indexer with fixed width sort and variable width sort
US7499913B2 (en)*2004-01-262009-03-03International Business Machines CorporationMethod for handling anchor text
US8296304B2 (en)2004-01-262012-10-23International Business Machines CorporationMethod, system, and program for handling redirects in a search engine
US7472113B1 (en)2004-01-262008-12-30Microsoft CorporationQuery preprocessing and pipelining
US7392262B1 (en)2004-02-112008-06-24Aol LlcReliability of duplicate document detection algorithms
US7725475B1 (en)2004-02-112010-05-25Aol Inc.Simplifying lexicon creation in hybrid duplicate detection and inductive classifier systems
US7624274B1 (en)*2004-02-112009-11-24AOL LLC, a Delaware Limited CompanyDecreasing the fragility of duplicate document detecting algorithms
US7191175B2 (en)2004-02-132007-03-13Attenex CorporationSystem and method for arranging concept clusters in thematic neighborhood relationships in a two-dimensional visual display space
US20050182755A1 (en)*2004-02-142005-08-18Bao TranSystems and methods for analyzing documents over a network
US8676830B2 (en)*2004-03-042014-03-18Yahoo! Inc.Keyword recommendation for internet search engines
US7554544B2 (en)*2004-03-162009-06-30Microsoft CorporationJust-in-time user interface layout
US7925657B1 (en)2004-03-172011-04-12Google Inc.Methods and systems for adjusting a scoring measure based on query breadth
US20050210009A1 (en)*2004-03-182005-09-22Bao TranSystems and methods for intellectual property management
US7603370B2 (en)*2004-03-222009-10-13Microsoft CorporationMethod for duplicate detection and suppression
US7590619B2 (en)*2004-03-222009-09-15Microsoft CorporationSearch system using user behavior data
US7707157B1 (en)2004-03-252010-04-27Google Inc.Document near-duplicate detection
US8386728B1 (en)2004-03-312013-02-26Google Inc.Methods and systems for prioritizing a crawl
US7412708B1 (en)2004-03-312008-08-12Google Inc.Methods and systems for capturing information
US8631076B1 (en)2004-03-312014-01-14Google Inc.Methods and systems for associating instant messenger events
US7581227B1 (en)2004-03-312009-08-25Google Inc.Systems and methods of synchronizing indexes
US8099407B2 (en)2004-03-312012-01-17Google Inc.Methods and systems for processing media files
US8161053B1 (en)2004-03-312012-04-17Google Inc.Methods and systems for eliminating duplicate events
US7333976B1 (en)2004-03-312008-02-19Google Inc.Methods and systems for processing contact information
US8346777B1 (en)2004-03-312013-01-01Google Inc.Systems and methods for selectively storing event data
US7725508B2 (en)2004-03-312010-05-25Google Inc.Methods and systems for information capture and retrieval
US7840547B1 (en)2004-03-312010-11-23Google Inc.Methods and systems for efficient query rewriting
US7680888B1 (en)2004-03-312010-03-16Google Inc.Methods and systems for processing instant messenger messages
US7941439B1 (en)2004-03-312011-05-10Google Inc.Methods and systems for information capture
US8275839B2 (en)2004-03-312012-09-25Google Inc.Methods and systems for processing email messages
US20050223027A1 (en)*2004-03-312005-10-06Lawrence Stephen RMethods and systems for structuring event data in a database for location and retrieval
US8914383B1 (en)2004-04-062014-12-16Monster Worldwide, Inc.System and method for providing job recommendations
AU2005239421B2 (en)2004-04-262012-04-12Google Inc.Methods and systems for dynamically composing distributed interactive applications from high-level programming languages
US7770014B2 (en)*2004-04-302010-08-03Microsoft CorporationRandomized signal transforms and their applications
US7739142B2 (en)*2004-05-172010-06-15Yahoo! Inc.System and method for providing automobile marketing research information
US7349901B2 (en)2004-05-212008-03-25Microsoft CorporationSearch engine spam detection using external data
US20050267809A1 (en)*2004-06-012005-12-01Zhiliang ZhengSystem, method and computer program product for presenting advertising alerts to a user
US20050267820A1 (en)*2004-06-012005-12-01Zhiliang ZhengSystem, method and computer program product for finding customer orientated advertisements
US20050267810A1 (en)*2004-06-012005-12-01Zhiliang ZhengSystem, method and computer program product for organizing items for presentment to a user
US20060041478A1 (en)*2004-06-012006-02-23Zhiliang ZhengUniversal network market system
US8019875B1 (en)2004-06-042011-09-13Google Inc.Systems and methods for indicating a user state in a social network
US8832132B1 (en)2004-06-222014-09-09Google Inc.Personalizing search queries based on user membership in social network communities
WO2006005001A2 (en)*2004-06-292006-01-12Blake BookstaffMethod and system for automated intelligent electronic advertising
US8825639B2 (en)2004-06-302014-09-02Google Inc.Endorsing search results
US8005716B1 (en)2004-06-302011-08-23Google Inc.Methods and systems for establishing a keyword utilizing path navigation information
US7827176B2 (en)2004-06-302010-11-02Google Inc.Methods and systems for endorsing local search results
US7266548B2 (en)*2004-06-302007-09-04Microsoft CorporationAutomated taxonomy generation
US8621215B1 (en)2004-06-302013-12-31Google Inc.Methods and systems for creating monetary accounts for members in a social network
US7293017B2 (en)2004-07-012007-11-06Microsoft CorporationPresentation-level content filtering for a search result
US7428530B2 (en)2004-07-012008-09-23Microsoft CorporationDispersing search engine results by using page category information
US7363296B1 (en)2004-07-012008-04-22Microsoft CorporationGenerating a subindex with relevant attributes to improve querying
AU2005264153B2 (en)*2004-07-212012-04-05Microsoft Israel Research And Development (2002) LtdA method for determining near duplicate data objects
US7580993B2 (en)*2004-07-302009-08-25Sap AgDocument broadcasting utilizing hashcodes
US8015019B1 (en)2004-08-032011-09-06Google Inc.Methods and systems for providing a document
EP1779659B1 (en)2004-08-122020-10-07Gracenote Inc.Selection of content from a stream of video or audio data
US20060036583A1 (en)*2004-08-162006-02-16Laust SondergaardSystems and methods for processing search results
US20060041477A1 (en)*2004-08-172006-02-23Zhiliang ZhengSystem and method for providing targeted information to users
US20060041476A1 (en)*2004-08-172006-02-23Zhiliang ZhengSystem and method for providing an expert platform
US7809695B2 (en)*2004-08-232010-10-05Thomson Reuters Global ResourcesInformation retrieval systems with duplicate document detection and presentation functions
US9535624B1 (en)*2004-09-132017-01-03EMC IP Holding Company LLCDuplicate management
US7523098B2 (en)*2004-09-152009-04-21International Business Machines CorporationSystems and methods for efficient data searching, storage and reduction
US8725705B2 (en)*2004-09-152014-05-13International Business Machines CorporationSystems and methods for searching of storage data with reduced bandwidth requirements
US7461064B2 (en)2004-09-242008-12-02International Buiness Machines CorporationMethod for searching documents for ranges of numeric values
US8051096B1 (en)2004-09-302011-11-01Google Inc.Methods and systems for augmenting a token lexicon
US7680648B2 (en)2004-09-302010-03-16Google Inc.Methods and systems for improving text segmentation
US7664751B2 (en)2004-09-302010-02-16Google Inc.Variable user interface based on document access privileges
US7996208B2 (en)2004-09-302011-08-09Google Inc.Methods and systems for selecting a language for text segmentation
US7603355B2 (en)2004-10-012009-10-13Google Inc.Variably controlling access to content
DE102004053513A1 (en)*2004-10-292006-05-04Hurra Communications Gmbh Method and search engine for the automatic computer-aided creation of a search result list with network pages of a client-server network with a specific content
US7574409B2 (en)*2004-11-042009-08-11Vericept CorporationMethod, apparatus, and system for clustering and classification
US7698569B2 (en)2004-11-192010-04-13Bea Systems, Inc.Data object identification, tracking, filtering and monitoring using data object fingerprints
JP4140910B2 (en)*2004-12-282008-08-27インターナショナル・ビジネス・マシーンズ・コーポレーション Data processing device, data management device, data processing method, data management method, data processing program, data management program, and information system
US8719779B2 (en)*2004-12-282014-05-06Sap AgData object association based on graph theory techniques
US10402457B1 (en)2004-12-312019-09-03Google LlcMethods and systems for correlating connections between users and links between articles
US8060405B1 (en)2004-12-312011-11-15Google Inc.Methods and systems for correlating connections between users and links between articles
US8843536B1 (en)2004-12-312014-09-23Google Inc.Methods and systems for providing relevant advertisements or other content for inactive uniform resource locators using search queries
US7716140B1 (en)2004-12-312010-05-11Google Inc.Methods and systems for controlling access to relationship information in a social network
US8131726B2 (en)*2005-01-122012-03-06International Business Machines CorporationGeneric architecture for indexing document groups in an inverted text index
US7356777B2 (en)2005-01-262008-04-08Attenex CorporationSystem and method for providing a dynamic user interface for a dense three-dimensional scene
US7404151B2 (en)2005-01-262008-07-22Attenex CorporationSystem and method for providing a dynamic user interface for a dense three-dimensional scene
KR100996131B1 (en)2005-03-112010-11-24야후! 인크. Listing Management System and Methods
US20060212416A1 (en)*2005-03-152006-09-21Directorynet, LlcMethod for processing data to optimize and categorize matches
US7098815B1 (en)*2005-03-252006-08-29Orbital Data CorporationMethod and apparatus for efficient compression
US8538810B2 (en)*2005-03-292013-09-17Google Inc.Methods and systems for member-created advertisement in a member network
US8412780B2 (en)2005-03-302013-04-02Google Inc.Methods and systems for providing current email addresses and contact information for members within a social network
US8140505B1 (en)*2005-03-312012-03-20Google Inc.Near-duplicate document detection for web crawling
US8849767B1 (en)*2005-04-132014-09-30Netapp, Inc.Method and apparatus for identifying and eliminating duplicate data blocks and sharing data blocks in a storage system
JP2006301975A (en)*2005-04-202006-11-02Canon Inc Information processing apparatus, information processing method, and program
US8375067B2 (en)2005-05-232013-02-12Monster Worldwide, Inc.Intelligent job matching system and method including negative filtration
US8433713B2 (en)2005-05-232013-04-30Monster Worldwide, Inc.Intelligent job matching system and method
US8527510B2 (en)2005-05-232013-09-03Monster Worldwide, Inc.Intelligent job matching system and method
US7962462B1 (en)2005-05-312011-06-14Google Inc.Deriving and using document and site quality signals from search query streams
US8417693B2 (en)2005-07-142013-04-09International Business Machines CorporationEnforcing native access control to indexed documents
WO2007022137A2 (en)2005-08-112007-02-22Contextweb, Inc.Method and system for placement and pricing of internet-based advertisements or services
US7401080B2 (en)*2005-08-172008-07-15Microsoft CorporationStorage reports duplicate file detection
US20070050423A1 (en)*2005-08-302007-03-01Scentric, Inc.Intelligent general duplicate management system
US20070061319A1 (en)*2005-09-092007-03-15Xerox CorporationMethod for document clustering based on page layout attributes
GB2430507A (en)*2005-09-212007-03-28Stephen Robert IvesSystem for managing the display of sponsored links together with search results on a mobile/wireless device
EP1941346A4 (en)*2005-09-212010-10-27Praxeon IncDocument processing
US7801893B2 (en)*2005-09-302010-09-21Iac Search & Media, Inc.Similarity detection and clustering of images
US20070076869A1 (en)*2005-10-032007-04-05Microsoft CorporationDigital goods representation based upon matrix invariants using non-negative matrix factorizations
CN101283357A (en)*2005-10-112008-10-08泰普有限公司Search using changes in prevalence of content items on the web
US20070112752A1 (en)*2005-11-142007-05-17Wolfgang KalthoffCombination of matching strategies under consideration of data quality
US20070143255A1 (en)*2005-11-282007-06-21Webaroo, Inc.Method and system for delivering internet content to mobile devices
US7895223B2 (en)2005-11-292011-02-22Cisco Technology, Inc.Generating search results based on determined relationships between data objects and user connections to identified destinations
US7590634B2 (en)*2005-12-092009-09-15Microsoft CorporationDetection of inaccessible resources
US9262446B1 (en)2005-12-292016-02-16Google Inc.Dynamically ranking entries in a personal data book
US20070162492A1 (en)*2005-12-302007-07-12Kritesh VasingReconstruction of historic business object state
US8195657B1 (en)2006-01-092012-06-05Monster Worldwide, Inc.Apparatuses, systems and methods for data entry correlation
WO2007086059A2 (en)*2006-01-252007-08-02Equivio Ltd.Determining near duplicate 'noisy' data objects
US7542989B2 (en)*2006-01-252009-06-02Graduate Management Admission CouncilMethod and system for searching, identifying, and documenting infringements on copyrighted information
WO2007121035A2 (en)2006-03-232007-10-25Exegy IncorporatedMethod and system for high throughput blockwise independent encryption/decryption
US8600931B1 (en)2006-03-312013-12-03Monster Worldwide, Inc.Apparatuses, methods and systems for automated online data submission
US12314907B2 (en)2006-03-312025-05-27Monster Worldwide, Inc.Apparatuses, methods and systems for automated online data submission
USD570370S1 (en)*2006-04-072008-06-03Microsoft CorporationIcon for a portion of a display screen
US7949824B2 (en)*2006-04-112011-05-24Emc CorporationEfficient data storage using two level delta resemblance
US7844652B2 (en)*2006-04-112010-11-30Emc CorporationEfficient computation of sketches
US7555620B1 (en)2006-04-282009-06-30Network Appliance, Inc.Method and system of using a backup image for multiple purposes
US7613734B2 (en)*2006-05-152009-11-03Xsprada CorporationSystems and methods for providing data sets using a store of albegraic relations
US7720806B2 (en)*2006-05-152010-05-18Algebraix Data CorporationSystems and methods for data manipulation using multiple storage formats
US7877370B2 (en)2006-05-152011-01-25Algebraix Data CorporationSystems and methods for data storage and retrieval using algebraic relations composed from query language statements
US7865503B2 (en)*2006-05-152011-01-04Algebraix Data CorporationSystems and methods for data storage and retrieval using virtual data sets
US7797319B2 (en)*2006-05-152010-09-14Algebraix Data CorporationSystems and methods for data model mapping
US7769754B2 (en)*2006-05-152010-08-03Algebraix Data CorporationSystems and methods for data storage and retrieval using algebraic optimization
US8175875B1 (en)*2006-05-192012-05-08Google Inc.Efficient indexing of documents with similar content
US20070288308A1 (en)*2006-05-252007-12-13Yahoo Inc.Method and system for providing job listing affinity
US7624117B2 (en)*2006-06-122009-11-24Sap AgComplex data assembly identifier thesaurus
SE530166C2 (en)*2006-06-162008-03-11Tobias Ekbom Method for optimizing storage of various digital data based on data history
US7941420B2 (en)*2007-08-142011-05-10Yahoo! Inc.Method for organizing structurally similar web pages from a web site
US7676465B2 (en)*2006-07-052010-03-09Yahoo! Inc.Techniques for clustering structurally similar web pages based on page features
GB2440174A (en)*2006-07-192008-01-23Chronicle SolutionsDetermining similarity of electronic documents by comparing hashed alphabetically ordered phrases
US7925620B1 (en)2006-08-042011-04-12Hyoungsoo YoonContact information management
US8015162B2 (en)*2006-08-042011-09-06Google Inc.Detecting duplicate and near-duplicate files
US8944332B2 (en)*2006-08-042015-02-03Intermec Ip Corp.Testing automatic data collection devices, such as barcode, RFID and/or magnetic stripe readers
US8732314B2 (en)*2006-08-212014-05-20Cisco Technology, Inc.Generation of contact information based on associating browsed content to user actions
US8010534B2 (en)2006-08-312011-08-30Orcatec LlcIdentifying related objects using quantum clustering
US8099415B2 (en)*2006-09-082012-01-17Simply Hired, Inc.Method and apparatus for assessing similarity between online job listings
US7730316B1 (en)*2006-09-222010-06-01Fatlens, Inc.Method for document fingerprinting
US8321197B2 (en)*2006-10-182012-11-27Teresa Ruth GaudetMethod and process for performing category-based analysis, evaluation, and prescriptive practice creation upon stenographically written and voice-written text files
JP4721447B2 (en)*2006-10-202011-07-13キヤノン株式会社 Imaging apparatus, content management method, program, and storage medium
FR2907934B1 (en)*2006-10-272009-02-06Inst Nat Rech Inf Automat COMPUTER TOOL FOR MANAGING DIGITAL DOCUMENTS
US8661029B1 (en)2006-11-022014-02-25Google Inc.Modifying search result ranking based on implicit user feedback
US8301621B2 (en)*2006-11-072012-10-30At&T Intellectual Property I, L.P.Topic map for navigational control
US8156112B2 (en)2006-11-072012-04-10At&T Intellectual Property I, L.P.Determining sort order by distance
US8326819B2 (en)2006-11-132012-12-04Exegy IncorporatedMethod and system for high performance data metatagging and data indexing using coprocessors
US7660793B2 (en)2006-11-132010-02-09Exegy IncorporatedMethod and system for high performance integration, processing and searching of structured and unstructured data using coprocessors
US8214517B2 (en)*2006-12-012012-07-03Nec Laboratories America, Inc.Methods and systems for quick and efficient data management and/or processing
US7844581B2 (en)*2006-12-012010-11-30Nec Laboratories America, Inc.Methods and systems for data management using multiple selection criteria
JP4732314B2 (en)*2006-12-052011-07-27キヤノン株式会社 Image processing apparatus and image processing method
US8577866B1 (en)2006-12-072013-11-05Googe Inc.Classifying content
US8983970B1 (en)2006-12-072015-03-17Google Inc.Ranking content using content and content authors
US7926026B2 (en)*2006-12-202011-04-12Sap AgGraphical analysis to detect process object anomalies
US7756798B2 (en)*2007-03-062010-07-13Oracle International CorporationExtensible mechanism for detecting duplicate search items
US8938463B1 (en)2007-03-122015-01-20Google Inc.Modifying search result ranking based on implicit user feedback and a model of presentation bias
US8694374B1 (en)2007-03-142014-04-08Google Inc.Detecting click spam
US20080235163A1 (en)*2007-03-222008-09-25Srinivasan BalasubramanianSystem and method for online duplicate detection and elimination in a web crawler
US7617195B2 (en)*2007-03-282009-11-10Xerox CorporationOptimizing the performance of duplicate identification by content
US7698317B2 (en)*2007-04-202010-04-13Yahoo! Inc.Techniques for detecting duplicate web pages
US9092510B1 (en)2007-04-302015-07-28Google Inc.Modifying search result ranking based on a temporal element of user feedback
US7849399B2 (en)*2007-06-292010-12-07Walter HoffmannMethod and system for tracking authorship of content in data
US20090012984A1 (en)*2007-07-022009-01-08Equivio Ltd.Method for Organizing Large Numbers of Documents
US20090043767A1 (en)*2007-08-072009-02-12Ashutosh JoshiApproach For Application-Specific Duplicate Detection
US8694511B1 (en)2007-08-202014-04-08Google Inc.Modifying search result ranking based on populations
US20090063470A1 (en)*2007-08-282009-03-05Nogacom Ltd.Document management using business objects
EP2186250B1 (en)2007-08-312019-03-27IP Reservoir, LLCMethod and apparatus for hardware-accelerated encryption/decryption
US8370751B2 (en)*2007-08-312013-02-05Sap AgUser interface customization system
US20090089326A1 (en)*2007-09-282009-04-02Yahoo!, Inc.Method and apparatus for providing multimedia content optimization
US8271493B2 (en)*2007-10-112012-09-18Oracle International CorporationExtensible mechanism for grouping search results
US8909655B1 (en)2007-10-112014-12-09Google Inc.Time based ranking
US8010524B2 (en)*2007-10-292011-08-30International Business Machines CorporationMethod of monitoring electronic media
US7363329B1 (en)2007-11-132008-04-22International Business Machines CorporationMethod for duplicate detection on web-scale data in supercomputing environments
US8131751B1 (en)2008-01-182012-03-06Google Inc.Algorithms for selecting subsequences
US8111912B2 (en)*2008-02-152012-02-07Yahoo! Inc.Cost-effective image metadata creation using near-duplicate image detection
US8055078B2 (en)*2008-02-282011-11-08Yahoo! Inc.Filter for blocking image-based spam
US9058378B2 (en)*2008-04-112015-06-16Ebay Inc.System and method for identification of near duplicate user-generated content
US8244551B1 (en)2008-04-212012-08-14Monster Worldwide, Inc.Apparatuses, methods and systems for advancement path candidate cloning
US7930306B2 (en)*2008-04-302011-04-19Msc Intellectual Properties B.V.System and method for near and exact de-duplication of documents
US9128945B1 (en)2008-05-162015-09-08Google Inc.Query augmentation
US8463770B1 (en)2008-07-092013-06-11Amazon Technologies, Inc.System and method for conditioning search results
US20100082356A1 (en)*2008-09-302010-04-01Yahoo! Inc.System and method for recommending personalized career paths
US8073818B2 (en)*2008-10-032011-12-06Microsoft CorporationCo-location visual pattern mining for near-duplicate image retrieval
US8701185B2 (en)*2008-10-142014-04-15At&T Intellectual Property I, L.P.Method for locating fraudulent replicas of web sites
US9043375B2 (en)*2008-10-172015-05-26Software Analysis And Forensic Engineering CorporationSearching the internet for common elements in a document in order to detect plagiarism
TW201027375A (en)2008-10-202010-07-16IbmSearch system, search method and program
US8041710B2 (en)*2008-11-132011-10-18Microsoft CorporationAutomatic diagnosis of search relevance failures
US8849772B1 (en)*2008-11-142014-09-30Emc CorporationData replication with delta compression
US8447740B1 (en)*2008-11-142013-05-21Emc CorporationStream locality delta compression
US8751462B2 (en)*2008-11-142014-06-10Emc CorporationDelta compression after identity deduplication
US8190581B2 (en)*2008-12-032012-05-29At&T Intellectual Property I, L.P.Real-time content detection in ISP transmissions
US7934161B1 (en)2008-12-092011-04-26Jason Adam DeniseElectronic search interface technology
KR20100066920A (en)*2008-12-102010-06-18한국전자통신연구원Electronic document processing apparatus and its method
US8396865B1 (en)2008-12-102013-03-12Google Inc.Sharing search engine relevance data between corpora
US20100169311A1 (en)*2008-12-302010-07-01Ashwin TengliApproaches for the unsupervised creation of structural templates for electronic documents
US8284418B2 (en)*2009-01-052012-10-09International Business Machines CorporationDocument information acquisition and notification of duplicate document storage
US8001462B1 (en)*2009-01-302011-08-16Google Inc.Updating search engine document index based on calculated age of changed portions in a document
CN101499098B (en)*2009-03-042012-07-11阿里巴巴集团控股有限公司Web page assessed value confirming and employing method and system
US9009146B1 (en)2009-04-082015-04-14Google Inc.Ranking search results based on similar queries
US8180773B2 (en)*2009-05-272012-05-15International Business Machines CorporationDetecting duplicate documents using classification
US8412848B2 (en)*2009-05-292013-04-02Exagrid Systems, Inc.Method and apparatus for content-aware and adaptive deduplication
US8762348B2 (en)*2009-06-092014-06-24Emc CorporationSegment deduplication system with compression of segments
US8401181B2 (en)*2009-06-092013-03-19Emc CorporationSegment deduplication system with encryption of segments
US8731190B2 (en)*2009-06-092014-05-20Emc CorporationSegment deduplication system with encryption and compression of segments
US8341538B1 (en)*2009-06-092012-12-25Symantec CorporationSystems and methods for reducing redundancies in quality-assurance reviews of graphical user interfaces
US8447760B1 (en)2009-07-202013-05-21Google Inc.Generating a related set of documents for an initial set of documents
US8635223B2 (en)2009-07-282014-01-21Fti Consulting, Inc.System and method for providing a classification suggestion for electronically stored information
US9245007B2 (en)*2009-07-292016-01-26International Business Machines CorporationDynamically detecting near-duplicate documents
US8959062B2 (en)*2009-08-132015-02-17Hitachi Solutions, Ltd.Data storage device with duplicate elimination function and control device for creating search index for the data storage device
US8612446B2 (en)2009-08-242013-12-17Fti Consulting, Inc.System and method for generating a reference set for use during document review
US8498974B1 (en)2009-08-312013-07-30Google Inc.Refining search results
US8972391B1 (en)2009-10-022015-03-03Google Inc.Recent interest based relevance scoring
US8874555B1 (en)2009-11-202014-10-28Google Inc.Modifying scoring data based on historical changes
CN101702179B (en)*2009-12-012012-07-25百度在线网络技术(北京)有限公司Method and device for removing duplication from data mining
US8615514B1 (en)2010-02-032013-12-24Google Inc.Evaluating website properties by partitioning user feedback
US8924379B1 (en)2010-03-052014-12-30Google Inc.Temporal-based score adjustments
US8959093B1 (en)2010-03-152015-02-17Google Inc.Ranking search results based on anchors
US11367295B1 (en)2010-03-232022-06-21Aurea Software, Inc.Graphical user interface for presentation of events
US10546311B1 (en)2010-03-232020-01-28Aurea Software, Inc.Identifying competitors of companies
US8452106B2 (en)*2010-03-232013-05-28Microsoft CorporationPartition min-hash for partial-duplicate image determination
US9760634B1 (en)2010-03-232017-09-12Firstrain, Inc.Models for classifying documents
US10643227B1 (en)2010-03-232020-05-05Aurea Software, Inc.Business lines
US20110238686A1 (en)*2010-03-242011-09-29Microsoft CorporationCaching data obtained via data service interfaces
US8650195B2 (en)*2010-03-262014-02-11Palle M PedersenRegion based information retrieval system
US9489350B2 (en)*2010-04-302016-11-08Orbis Technologies, Inc.Systems and methods for semantic search, content correlation and visualization
US8626770B2 (en)2010-05-032014-01-07International Business Machines CorporationIceberg query evaluation implementing a compressed bitmap index
JP5008748B2 (en)*2010-05-132012-08-22株式会社日立ソリューションズ Search method, integrated search server, and computer program
US9311395B2 (en)2010-06-102016-04-12Aol Inc.Systems and methods for manipulating electronic content based on speech recognition
US8601076B2 (en)2010-06-102013-12-03Aol Inc.Systems and methods for identifying and notifying users of electronic content based on biometric recognition
US9623119B1 (en)2010-06-292017-04-18Google Inc.Accentuating search results
WO2012009672A1 (en)*2010-07-162012-01-19Altruik, Inc.System and method for improving webpage indexing and optimization
US9152660B2 (en)2010-07-232015-10-06Donato DiorioData normalizer
US8832083B1 (en)2010-07-232014-09-09Google Inc.Combining user feedback
CN102375813B (en)*2010-08-092016-12-21深圳市世纪光速信息技术有限公司Search engine re-scheduling system and method
US8332408B1 (en)2010-08-232012-12-11Google Inc.Date-based web page annotation
KR101452664B1 (en)2010-09-032014-10-22아이패러다임스, 엘엘씨Systems and methods for document analysis
US8606787B1 (en)2010-09-152013-12-10Google Inc.Social network node clustering system and method
US8346792B1 (en)2010-11-092013-01-01Google Inc.Query generation using structural similarity between documents
CN102541918A (en)*2010-12-302012-07-04阿里巴巴集团控股有限公司Method and equipment for identifying repeated information
US9002867B1 (en)2010-12-302015-04-07Google Inc.Modifying ranking data based on document changes
US8527516B1 (en)*2011-02-252013-09-03Google Inc.Identifying similar digital text volumes
JP5708107B2 (en)*2011-03-232015-04-30日本電気株式会社 Duplicate file detection device
US8504673B2 (en)2011-03-312013-08-06Verisign, Inc.Traffic like NXDomains
US8582872B1 (en)*2011-06-302013-11-12Google Inc.Place holder image detection via image clustering
CN103106585B (en)*2011-11-112016-05-04阿里巴巴集团控股有限公司The real-time repetition removal method and apparatus of product information
US20130144847A1 (en)*2011-12-052013-06-06Google Inc.De-Duplication of Featured Content
US9690789B2 (en)*2011-12-092017-06-27Veritas Technologies LlcArchive systems and methods
CN103294671B (en)*2012-02-222018-04-27深圳市世纪光速信息技术有限公司The detection method and system of document
US9015080B2 (en)2012-03-162015-04-21Orbis Technologies, Inc.Systems and methods for semantic inference and reasoning
US8995771B2 (en)2012-04-302015-03-31Microsoft Technology Licensing, LlcIdentification of duplicates within an image space
US8583687B1 (en)2012-05-152013-11-12Algebraix Data CorporationSystems and methods for indirect algebraic partitioning
WO2013173802A1 (en)*2012-05-172013-11-21Google Inc.Systems and methods for crawling and indexing content
WO2013173805A1 (en)2012-05-172013-11-21Google Inc.Systems and methods re-ranking ranked search results
JP5352712B2 (en)*2012-05-292013-11-27株式会社日立ソリューションズ Search method, integrated search server, and computer program
US10135462B1 (en)*2012-06-132018-11-20EMC IP Holding Company LLCDeduplication using sub-chunk fingerprints
US10007731B2 (en)*2012-09-122018-06-26Google LlcDeduplication in search results
JP5526209B2 (en)*2012-10-092014-06-18株式会社Ubic Forensic system, forensic method, and forensic program
CN103778163A (en)*2012-10-262014-05-07广州市邦富软件有限公司Rapid webpage de-weight algorithm based on fingerprints
US9189531B2 (en)2012-11-302015-11-17Orbis Technologies, Inc.Ontology harmonization and mediation systems and methods
US10229367B2 (en)*2013-02-062019-03-12Jacob DrewCollaborative analytics map reduction classification learning systems and methods
US9720669B2 (en)2013-05-302017-08-01The Boeing CompanyDeployment of software across an enterprise system
US9183499B1 (en)2013-04-192015-11-10Google Inc.Evaluating quality based on neighbor features
US9477973B2 (en)2013-06-252016-10-25International Business MachinesVisually generated consumer product presentation
US9251160B1 (en)*2013-06-272016-02-02Symantec CorporationData transfer between dissimilar deduplication systems
RU2580036C2 (en)2013-06-282016-04-10Закрытое акционерное общество "Лаборатория Касперского"System and method of making flexible convolution for malware detection
US20150032729A1 (en)*2013-07-232015-01-29Salesforce.Com, Inc.Matching snippets of search results to clusters of objects
CN103455641B (en)*2013-09-292017-02-22北大医疗信息技术有限公司Crossing repeated retrieval system and method
KR101577376B1 (en)*2014-01-212015-12-14(주) 아워텍System and method for determining infringement of copyright based on the text reference point
US20170330153A1 (en)2014-05-132017-11-16Monster Worldwide, Inc.Search Extraction Matching, Draw Attention-Fit Modality, Application Morphing, and Informed Apply Apparatuses, Methods and Systems
US20150379022A1 (en)*2014-06-272015-12-31General Electric CompanyIntegrating Execution of Computing Analytics within a Mapreduce Processing Environment
US10102480B2 (en)2014-06-302018-10-16Amazon Technologies, Inc.Machine learning service
US10963810B2 (en)*2014-06-302021-03-30Amazon Technologies, Inc.Efficient duplicate detection for machine learning data sets
CN104090976B (en)*2014-07-212017-06-23北京奇虎科技有限公司The method and device of search engine crawler capturing webpage
US9984166B2 (en)2014-10-102018-05-29Salesforce.Com, Inc.Systems and methods of de-duplicating similar news feed items
US10592841B2 (en)2014-10-102020-03-17Salesforce.Com, Inc.Automatic clustering by topic and prioritizing online feed items
US9805099B2 (en)*2014-10-302017-10-31The Johns Hopkins UniversityApparatus and method for efficient identification of code similarity
CN105630802A (en)2014-10-302016-06-01阿里巴巴集团控股有限公司Webpage duplication removal method and apparatus
US10282353B2 (en)*2015-02-262019-05-07Accenture Global Services LimitedProactive duplicate identification
CN115795147A (en)2015-05-202023-03-14电子湾有限公司Method and system for searching
TWI570579B (en)*2015-07-232017-02-11葆光資訊有限公司An information retrieving method utilizing webpage visual features and webpage language features and a system using thereof
RU2614561C1 (en)*2015-12-182017-03-28Закрытое акционерное общество "Лаборатория Касперского"System and method of similar files determining
WO2017210618A1 (en)2016-06-022017-12-07Fti Consulting, Inc.Analyzing clusters of coded documents
US10229315B2 (en)2016-07-272019-03-12Intuit, Inc.Identification of duplicate copies of a form in a document
US10664645B2 (en)*2016-10-072020-05-26AlltheroomsSystem and method for transposing web content
US10599614B1 (en)2018-01-022020-03-24Amazon Technologies, Inc.Intersection-based dynamic blocking
CN110427391B (en)*2018-04-282023-07-28伊姆西Ip控股有限责任公司Method, apparatus and computer program product for determining duplicate data
JP6990634B2 (en)*2018-08-212022-02-03株式会社日立ハイテク State prediction device and semiconductor manufacturing device
US10970321B1 (en)*2018-09-202021-04-06Vinod J MapranathIntelligent term and query generation and feedback based selective execution
US10872028B2 (en)*2019-03-012020-12-22Red Hat Israel, Ltd.Methods and systems for identifying duplicate jobs in a continuous integration environment
US20210004795A1 (en)*2019-07-032021-01-07Sap SeAnomaly and fraud detection using duplicate event detector
US11282078B2 (en)2019-07-032022-03-22Sap SeTransaction auditing using token extraction and model matching
US12154179B2 (en)2019-07-032024-11-26Sap SeTransaction policy audit
US20210360001A1 (en)*2020-05-152021-11-18Ebay Inc.Cluster-based near-duplicate document detection
US11430244B2 (en)*2020-12-232022-08-30Cylance Inc.Statistical data fingerprinting and tracing data similarity of documents
US11797486B2 (en)2022-01-032023-10-24Bank Of America CorporationFile de-duplication for a distributed database

Citations (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US6438741B1 (en)*1998-09-282002-08-20Compaq Computer CorporationSystem and method for eliminating compile time explosion in a top down rule based system using selective sampling
US6618727B1 (en)*1999-09-222003-09-09Infoglide CorporationSystem and method for performing similarity searching
US6832218B1 (en)*2000-09-222004-12-14International Business Machines CorporationSystem and method for associating search results
US6978419B1 (en)*2000-11-152005-12-20Justsystem CorporationMethod and apparatus for efficient identification of duplicate and near-duplicate documents and text spans using high-discriminability text fragments

Family Cites Families (56)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US6850252B1 (en)1999-10-052005-02-01Steven M. HoffbergIntelligent electronic appliance system and method
US5465299A (en)*1992-12-031995-11-07Hitachi, Ltd.Electronic document processing system and method of forming digital signature
US5850490A (en)*1993-12-221998-12-15Xerox CorporationAnalyzing an image of a document using alternative positionings of a class of segments
US6505160B1 (en)*1995-07-272003-01-07Digimarc CorporationConnected audio and other media objects
US5778395A (en)*1995-10-231998-07-07Stac, Inc.System for backing up files from disk volumes on multiple nodes of a computer network
US5832182A (en)1996-04-241998-11-03Wisconsin Alumni Research FoundationMethod and system for data clustering for very large databases
US5909677A (en)1996-06-181999-06-01Digital Equipment CorporationMethod for determining the resemblance of documents
US6052693A (en)*1996-07-022000-04-18Harlequin Group PlcSystem for assembling large databases through information extracted from text sources
US5745900A (en)*1996-08-091998-04-28Digital Equipment CorporationMethod for indexing duplicate database records using a full-record fingerprint
US6285999B1 (en)1997-01-102001-09-04The Board Of Trustees Of The Leland Stanford Junior UniversityMethod for node ranking in a linked database
US6088707A (en)*1997-10-062000-07-11International Business Machines CorporationComputer system and method of displaying update status of linked hypertext documents
US6134532A (en)1997-11-142000-10-17Aptex Software, Inc.System and method for optimal adaptive matching of users to most relevant entity and information in real-time
US6374251B1 (en)1998-03-172002-04-16Microsoft CorporationScalable system for clustering of large databases
US6119124A (en)*1998-03-262000-09-12Digital Equipment CorporationMethod for clustering closely resembling data objects
US6263348B1 (en)*1998-07-012001-07-17Serena Software International, Inc.Method and apparatus for identifying the existence of differences between two files
US6363377B1 (en)*1998-07-302002-03-26Sarnoff CorporationSearch data processor
US6240409B1 (en)*1998-07-312001-05-29The Regents Of The University Of CaliforniaMethod and apparatus for detecting and summarizing document similarity within large document sets
US6493709B1 (en)*1998-07-312002-12-10The Regents Of The University Of CaliforniaMethod and apparatus for digitally shredding similar documents within large document sets in a data processing environment
US6317722B1 (en)1998-09-182001-11-13Amazon.Com, Inc.Use of electronic shopping carts to generate personal recommendations
US6360215B1 (en)*1998-11-032002-03-19Inktomi CorporationMethod and apparatus for retrieving documents based on information other than document content
US6415297B1 (en)1998-11-172002-07-02International Business Machines CorporationParallel database support for workflow management systems
JP2000187668A (en)*1998-12-222000-07-04Hitachi Ltd Grouping and deduplication methods
US6564202B1 (en)1999-01-262003-05-13Xerox CorporationSystem and method for visually representing the contents of a multiple data object cluster
US6347313B1 (en)1999-03-012002-02-12Hewlett-Packard CompanyInformation embedding based on user relevance feedback for object retrieval
CA2365245C (en)*1999-03-312008-12-09British Telecommunications Public Limited CompanyServer computer for guaranteeing files integrity
US6574657B1 (en)*1999-05-032003-06-03Symantec CorporationMethods and apparatuses for file synchronization and updating using a signature list
US6564261B1 (en)1999-05-102003-05-13Telefonaktiebolaget Lm Ericsson (Publ)Distributed system to intelligently establish sessions between anonymous users over various networks
US6873982B1 (en)*1999-07-162005-03-29International Business Machines CorporationOrdering of database search results based on user feedback
US6389433B1 (en)*1999-07-162002-05-14Microsoft CorporationMethod and system for automatically merging files into a single instance store
US6718363B1 (en)1999-07-302004-04-06Verizon Laboratories, Inc.Page aggregation for web sites
US6542889B1 (en)2000-01-282003-04-01International Business Machines CorporationMethods and apparatus for similarity text search based on conceptual indexing
WO2001084313A2 (en)2000-05-022001-11-08Sun Microsystems, Inc.Method and system for achieving high availability in a networked computer system
US6697800B1 (en)2000-05-192004-02-24Roxio, Inc.System and method for determining affinity using objective and subjective data
US6665661B1 (en)*2000-09-292003-12-16Battelle Memorial InstituteSystem and method for use in text analysis of documents and records
US7069367B2 (en)2000-12-292006-06-27Intel CorporationMethod and apparatus for avoiding race condition with edge-triggered interrupts
US6658423B1 (en)2001-01-242003-12-02Google, Inc.Detecting duplicate and near-duplicate files
WO2003017055A2 (en)2001-08-152003-02-27Visa International Service AssociationMethod and system for delivering multiple services electronically to customers via a centralized portal architecture
US7203343B2 (en)2001-09-212007-04-10Hewlett-Packard Development Company, L.P.System and method for determining likely identity in a biometric database
US7343365B2 (en)2002-02-202008-03-11Microsoft CorporationComputer system architecture for automatic context associations
US6871200B2 (en)2002-07-112005-03-22Forensic Eye Ltd.Registration and monitoring system
US7434220B2 (en)2002-09-162008-10-07Clearcube Technology, Inc.Distributed computing infrastructure including autonomous intelligent management system
US7228351B2 (en)2002-12-312007-06-05International Business Machines CorporationMethod and apparatus for managing resource contention in a multisystem cluster
US20040139072A1 (en)2003-01-132004-07-15Broder Andrei Z.System and method for locating similar records in a database
US20060168006A1 (en)2003-03-242006-07-27Mr. Marvin ShannonSystem and method for the classification of electronic communication
US20040210575A1 (en)2003-04-182004-10-21Bean Douglas M.Systems and methods for eliminating duplicate documents
US7831679B2 (en)2003-10-152010-11-09Microsoft CorporationGuiding sensing and preferences for context-sensitive services
US8296304B2 (en)2004-01-262012-10-23International Business Machines CorporationMethod, system, and program for handling redirects in a search engine
AU2005264153B2 (en)2004-07-212012-04-05Microsoft Israel Research And Development (2002) LtdA method for determining near duplicate data objects
US7966327B2 (en)2004-11-082011-06-21The Trustees Of Princeton UniversitySimilarity search system with compact data structures
US20060149820A1 (en)2005-01-042006-07-06International Business Machines CorporationDetecting spam e-mail using similarity calculations
US7433869B2 (en)2005-07-012008-10-07Ebrary, Inc.Method and apparatus for document clustering and document sketching
US7739314B2 (en)2005-08-152010-06-15Google Inc.Scalable user clustering based on set similarity
US7747614B2 (en)2005-10-312010-06-29Yahoo! Inc.Difference control for generating and displaying a difference result set from the result sets of a plurality of search engines
US7472121B2 (en)2005-12-152008-12-30International Business Machines CorporationDocument comparison using multiple similarity measures
US8015162B2 (en)2006-08-042011-09-06Google Inc.Detecting duplicate and near-duplicate files
US8843915B2 (en)*2011-07-282014-09-23Hewlett-Packard Development Company, L.P.Signature-based update management

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US6438741B1 (en)*1998-09-282002-08-20Compaq Computer CorporationSystem and method for eliminating compile time explosion in a top down rule based system using selective sampling
US6618727B1 (en)*1999-09-222003-09-09Infoglide CorporationSystem and method for performing similarity searching
US6832218B1 (en)*2000-09-222004-12-14International Business Machines CorporationSystem and method for associating search results
US6978419B1 (en)*2000-11-152005-12-20Justsystem CorporationMethod and apparatus for efficient identification of duplicate and near-duplicate documents and text spans using high-discriminability text fragments

Cited By (52)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US10205781B1 (en)2000-09-142019-02-12Network-1 Technologies, Inc.Methods for using extracted features to perform an action associated with selected identified image
US9348820B1 (en)2000-09-142016-05-24Network-1 Technologies, Inc.System and method for taking action with respect to an electronic media work and logging event information related thereto
US10621227B1 (en)2000-09-142020-04-14Network-1 Technologies, Inc.Methods for using extracted features to perform an action
US10621226B1 (en)2000-09-142020-04-14Network-1 Technologies, Inc.Methods for using extracted features to perform an action associated with selected identified image
US8782726B1 (en)2000-09-142014-07-15Network-1 Technologies, Inc.Method for taking action based on a request related to an electronic media work
US8904465B1 (en)2000-09-142014-12-02Network-1 Technologies, Inc.System for taking action based on a request related to an electronic media work
US8904464B1 (en)2000-09-142014-12-02Network-1 Technologies, Inc.Method for tagging an electronic media work to perform an action
US10552475B1 (en)2000-09-142020-02-04Network-1 Technologies, Inc.Methods for using extracted features to perform an action
US10540391B1 (en)2000-09-142020-01-21Network-1 Technologies, Inc.Methods for using extracted features to perform an action
US10521470B1 (en)2000-09-142019-12-31Network-1 Technologies, Inc.Methods for using extracted features to perform an action associated with selected identified image
US10521471B1 (en)2000-09-142019-12-31Network-1 Technologies, Inc.Method for using extracted features to perform an action associated with selected identified image
US9558190B1 (en)2000-09-142017-01-31Network-1 Technologies, Inc.System and method for taking action with respect to an electronic media work
US10303713B1 (en)2000-09-142019-05-28Network-1 Technologies, Inc.Methods for using extracted features to perform an action
US9256885B1 (en)2000-09-142016-02-09Network-1 Technologies, Inc.Method for linking an electronic media work to perform an action
US9544663B1 (en)2000-09-142017-01-10Network-1 Technologies, Inc.System for taking action with respect to a media work
US9282359B1 (en)2000-09-142016-03-08Network-1 Technologies, Inc.Method for taking action with respect to an electronic media work
US10303714B1 (en)2000-09-142019-05-28Network-1 Technologies, Inc.Methods for using extracted features to perform an action
US9883253B1 (en)2000-09-142018-01-30Network-1 Technologies, Inc.Methods for using extracted feature vectors to perform an action associated with a product
US10305984B1 (en)2000-09-142019-05-28Network-1 Technologies, Inc.Methods for using extracted features to perform an action associated with selected identified image
US10108642B1 (en)2000-09-142018-10-23Network-1 Technologies, Inc.System for using extracted feature vectors to perform an action associated with a work identifier
US9529870B1 (en)2000-09-142016-12-27Network-1 Technologies, Inc.Methods for linking an electronic media work to perform an action
US10073862B1 (en)2000-09-142018-09-11Network-1 Technologies, Inc.Methods for using extracted features to perform an action associated with selected identified image
US9536253B1 (en)2000-09-142017-01-03Network-1 Technologies, Inc.Methods for linking an electronic media work to perform an action
US9538216B1 (en)2000-09-142017-01-03Network-1 Technologies, Inc.System for taking action with respect to a media work
US10063936B1 (en)2000-09-142018-08-28Network-1 Technologies, Inc.Methods for using extracted feature vectors to perform an action associated with a work identifier
US10367885B1 (en)2000-09-142019-07-30Network-1 Technologies, Inc.Methods for using extracted features to perform an action associated with selected identified image
US10063940B1 (en)2000-09-142018-08-28Network-1 Technologies, Inc.System for using extracted feature vectors to perform an action associated with a work identifier
US9781251B1 (en)2000-09-142017-10-03Network-1 Technologies, Inc.Methods for using extracted features and annotations associated with an electronic media work to perform an action
US10057408B1 (en)2000-09-142018-08-21Network-1 Technologies, Inc.Methods for using extracted feature vectors to perform an action associated with a work identifier
US9805066B1 (en)2000-09-142017-10-31Network-1 Technologies, Inc.Methods for using extracted features and annotations associated with an electronic media work to perform an action
US9807472B1 (en)2000-09-142017-10-31Network-1 Technologies, Inc.Methods for using extracted feature vectors to perform an action associated with a product
US9824098B1 (en)2000-09-142017-11-21Network-1 Technologies, Inc.Methods for using extracted features to perform an action associated with identified action information
US9832266B1 (en)2000-09-142017-11-28Network-1 Technologies, Inc.Methods for using extracted features to perform an action associated with identified action information
US9275143B2 (en)2001-01-242016-03-01Google Inc.Detecting duplicate and near-duplicate files
US20080162478A1 (en)*2001-01-242008-07-03William PughDetecting duplicate and near-duplicate files
US9798798B2 (en)*2001-03-192017-10-24FTI Technology, LLCComputer-implemented system and method for selecting documents for review
US20140122450A1 (en)*2001-03-192014-05-01Fti Technology LlcComputer-Implemented System And Method For Identifying Duplicate And Near Duplicate Messages
US20150100595A1 (en)*2001-03-192015-04-09Fti Technology LlcComputer-Implemented System And Method For Identifying Related Messages
US8914331B2 (en)*2001-03-192014-12-16Fti Technology LlcComputer-implemented system and method for identifying duplicate and near duplicate messages
US20160314190A1 (en)*2001-03-192016-10-27Fti Technology LlcComputer-Implemented System And Method For Selecting Documents For Review
US9384250B2 (en)*2001-03-192016-07-05Fti Technology LlcComputer-implemented system and method for identifying related messages
US9165089B2 (en)*2012-04-162015-10-20International Business Machines CorporationPresenting unique search result contexts
US20140046932A1 (en)*2012-04-162014-02-13International Business Machines CorporationPresenting unique search result contexts
US9104785B2 (en)*2012-04-162015-08-11International Business Machines CorporationPresenting unique search result contexts
US20130275408A1 (en)*2012-04-162013-10-17International Business Machines CorporationPresenting Unique Search Result Contexts
US9536016B2 (en)*2013-01-162017-01-03Google Inc.On-disk multimap
US10671686B2 (en)2013-02-282020-06-02International Business Machines CorporationProcessing webpage data
US9607331B2 (en)2013-08-012017-03-28Google Inc.Near-duplicate filtering in search engine result page of an online shopping system
WO2015013954A1 (en)*2013-08-012015-02-05Google Inc.Near-duplicate filtering in search engine result page of an online shopping system
US9342849B2 (en)2013-08-012016-05-17Google Inc.Near-duplicate filtering in search engine result page of an online shopping system
US20150370880A1 (en)*2014-06-202015-12-24International Business Machines CorporationIdentifying items on a second website already browsed at a first website
RU2825549C1 (en)*2024-02-292024-08-27Общество с ограниченной ответственностью "СЕРЧИНФОРМ"Method of hashing files for fast search of duplicates

Also Published As

Publication numberPublication date
US7366718B1 (en)2008-04-29
US6658423B1 (en)2003-12-02
US20080162478A1 (en)2008-07-03
US9275143B2 (en)2016-03-01

Similar Documents

PublicationPublication DateTitle
US7366718B1 (en)Detecting duplicate and near-duplicate files
US7779002B1 (en)Detecting query-specific duplicate documents
US8015162B2 (en)Detecting duplicate and near-duplicate files
US8548972B1 (en)Near-duplicate document detection for web crawling
Haveliwala et al.Evaluating strategies for similarity search on the web
ChakrabartiMining the Web: Discovering knowledge from hypertext data
Dill et al.Self-similarity in the web
Wan et al.CollabRank: towards a collaborative approach to single-document keyphrase extraction
US7650330B1 (en)Information extraction from a database
US8473510B1 (en)Locating meaningful stopwords or stop-phrases in keyword-based retrieval systems
US8423885B1 (en)Updating search engine document index based on calculated age of changed portions in a document
LiuInformation retrieval and Web search
Tajima et al.Cut as a querying unit for WWW, Netnews, and E-mail
Shi et al.Mining related queries from web search engine query logs using an improved association rule mining model
Hannappel et al.MSEEC-a multi search engine with multiple clustering
Caverlee et al.Discovering objects in dynamically-generated web pages
Qiu et al.Detection and optimized disposal of near-duplicate pages
LinIntelligent Internet Information Systems in Knowledge Acquisition: Techniques and Applications
Bhimireddy et al.A survey to fix the threshold and implementation for detecting duplicate web documents
Lee et al.Mining navigation behaviors for term suggestion of search engines
WenEnhancing Web Search through Web Structure Mining
EskicioğluA Search Engine for Turkish with Stemming
Rigaux et al.Web search
LINACQUISITION: TECHNIQUES AND APPLICATIONS
ZhangVirtual Question Answering System for CINDI.

Legal Events

DateCodeTitleDescription
ASAssignment

Owner name:GOOGLE INC., CALIFORNIA

Free format text:ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:PUGH, WILLIAM;HENZINGER, MONIKA H.;SIGNING DATES FROM 20010118 TO 20010120;REEL/FRAME:028376/0317

STCBInformation on status: application discontinuation

Free format text:ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION


[8]ページ先頭

©2009-2025 Movatter.jp