Movatterモバイル変換

Fingerprint (computing)

From Wikipedia, the free encyclopedia

Digital identifier derived from the data by an algorithm

Incomputer science, afingerprinting algorithm is a procedure that maps an arbitrarily large data item (such as a computer file) to a much shorter bit string, itsfingerprint, that uniquely identifies the original data for all practical purposes just as human fingerprints uniquely identify people for practical purposes. This fingerprint may be used for data deduplication purposes. This is also referred to asfile fingerprinting,data fingerprinting, orstructured data fingerprinting.

Fingerprints are typically used to avoid the comparison and transmission of bulky data. For instance, a web browser or proxy server can efficiently check whether a remote file has been modified by fetching only its fingerprint and comparing it with that of the previously fetched copy.

Fingerprint functions may be seen as high-performancehash functions used to uniquely identify substantial blocks of data wherecryptographic hash functions may be unnecessary.

Special algorithms exist for audio and video fingerprinting.

Properties

[edit]

Virtual uniqueness

[edit]

Compounding

[edit]

Algorithms

[edit]

Rabin's algorithm

[edit]

Rabin's fingerprinting algorithm is the prototype of the class.^[1] It is fast and easy to implement, allows compounding, and comes with a mathematically precise analysis of the probability of collision. Namely, the probability of two stringsr ands yielding the samew-bit fingerprint does not exceed max(|r|,|s|)/2^w-1, where |r| denotes the length ofr in bits. The algorithm requires the previous choice of aw-bit internal "key", and this guarantee holds as long as the stringsr ands are chosen without knowledge of the key.

Rabin's method is not secure against malicious attacks. An adversarial agent can easily discover the key and use it to modify files without changing their fingerprint.

Cryptographic hash functions

[edit]

Main article:cryptographic hash function

Mainstreamcryptographic grade hash functions generally can serve as high-quality fingerprint functions, are subject to intense scrutiny fromcryptanalysts, and have the advantage that they are believed to be safe against malicious attacks.

A drawback of cryptographic hash algorithms such asMD5 andSHA is that they take considerably longer to execute than Rabin's fingerprint algorithm. They also lack proven guarantees on the collision probability. Some of these algorithms, notablyMD5, are no longer recommended for secure fingerprinting. They are still useful for error checking, where purposeful data tampering is not a primary concern.

Perceptual hashing

[edit]

This section is an excerpt fromPerceptual hashing.[edit]

Perceptual hashing is the use of a fingerprinting algorithm that produces a snippet,hash, or fingerprint of various forms ofmultimedia.^[2]^[3] A perceptual hash is a type oflocality-sensitive hash, which is analogous iffeatures of the multimedia are similar. This is in contrast tocryptographic hashing, which relies on theavalanche effect of a small change in input value creating a drastic change in output value. Perceptual hash functions are widely used in finding cases of onlinecopyright infringement as well as indigital forensics because of the ability to have a correlation between hashes so similar data can be found (for instance with a differingwatermark).

Application examples

[edit]

NIST distributes a software reference library, the AmericanNational Software Reference Library, that uses cryptographic hash functions to fingerprint files and map them to software products. TheHashKeeper database, maintained by theNational Drug Intelligence Center, is a repository of fingerprints of "known to be good" and "known to be bad" computer files, for use in law enforcement applications (e.g. analyzing the contents of seized disk drives).

Content similarity detection

[edit]

This section is an excerpt fromContent similarity detection § Fingerprinting.[edit]

Fingerprinting is currently the most widely applied approach to content similarity detection. This method forms representative digests of documents by selecting a set of multiple substrings (n-grams) from them. The sets represent the fingerprints and their elements are called minutiae.^[4]^[5]A suspicious document is checked for plagiarism by computing its fingerprint and querying minutiae with a precomputed index of fingerprints for all documents of a reference collection. Minutiae matching with those of other documents indicate shared text segments and suggest potential plagiarism if they exceed a chosen similarity threshold.^[6] Computational resources and time are limiting factors to fingerprinting, which is why this method typically only compares a subset of minutiae to speed up the computation and allow for checks in very large collection, such as the Internet.^[4]

References

[edit]

^Rabin, M. O. (1981). "Fingerprinting by random polynomials".Center for Research in Computing Technology Harvard University Report TR-15-81.
^Buldas, Ahto; Kroonmaa, Andres; Laanoja, Risto (2013). "Keyless Signatures' Infrastructure: How to Build Global Distributed Hash-Trees". In Riis, Nielson H.; Gollmann, D. (eds.).Secure IT Systems. NordSec 2013. Lecture Notes in Computer Science. Vol. 8208. Berlin, Heidelberg: Springer.doi:10.1007/978-3-642-41488-6_21.ISBN 978-3-642-41487-9.Keyless Signatures Infrastructure (KSI) is a globally distributed system for providing time-stamping and server-supported digital signature services. Global per-second hash trees are created and their root hash values published. We discuss some service quality issues that arise in practical implementation of the service and present solutions for avoiding single points of failure and guaranteeing a service with reasonable and stable delay. Guardtime AS has been operating a KSI Infrastructure for 5 years. We summarize how the KSI Infrastructure is built, and the lessons learned during the operational period of the service.
^Klinger, Evan; Starkweather, David."pHash.org: Home of pHash, the open source perceptual hash library".pHash.org. Retrieved2018-07-05.pHash is an open source software library released under the GPLv3 license that implements several perceptual hashing algorithms, and provides a C-like API to use those functions in your own programs. pHash itself is written in C++.
^^a ^bHoad, Timothy; Zobel, Justin (2003),"Methods for Identifying Versioned and Plagiarised Documents"(PDF),Journal of the American Society for Information Science and Technology,54 (3):203–215,CiteSeerX 10.1.1.18.2680,doi:10.1002/asi.10170, archived fromthe original(PDF) on 30 April 2015, retrieved14 October 2014
^Stein, Benno (July 2005), "Fuzzy-Fingerprints for Text-Based Information Retrieval",Proceedings of the I-KNOW '05, 5th International Conference on Knowledge Management, Graz, Austria(PDF), Springer, Know-Center, pp. 572–579, archived fromthe original(PDF) on 2 April 2012, retrieved7 October 2011
^Brin, Sergey; Davis, James; Garcia-Molina, Hector (1995), "Copy Detection Mechanisms for Digital Documents",Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data(PDF), ACM, pp. 398–409,CiteSeerX 10.1.1.49.1567,doi:10.1145/223784.223855,ISBN 978-1-59593-060-6,S2CID 8652205, archived fromthe original(PDF) on 18 August 2016, retrieved7 October 2011