| diff | |
|---|---|
| Original authors | Douglas McIlroy (AT&T Bell Laboratories) |
| Developers | Variousopen-source andcommercial developers |
| Initial release | June 1974; 51 years ago (1974-06) |
| Written in | C |
| Operating system | Unix,Unix-like,V,Plan 9,Inferno |
| Platform | Cross-platform |
| Type | Command |
| License | Plan 9:MIT License |
diff is ashellcommand thatcompares the content of files and reports differences. The termdiff is also used to identify the output of the command and isused as a verb for running the command. To diff files, one runs diff to create a diff.[1]
Typically, the command is used to comparetext files, but it does support comparingbinary files. If one of the input files contains non-textual data, then the command defaults to brief-mode in which it reports only a summary indication of whether the files differ. With the--text option, it always reports line-based differences, but the output may be difficult to understand since binary data is generally not structured in lines like text is.[2]
Although the command is primarily used ad hoc to analyze changes between two files, a special use is for creating apatch file for use with thepatch command – which was specifically designed to use a diff output report as a patch file.POSIX standardized thediff andpatch commands including their shared file format.[3]
The originaldiffutility was developed in the early 1970s for the Unix operating system, atBell Labs in Murray Hill, New Jersey. It was part of the 5th Edition of Unix released in 1974,[4] and was written byDouglas McIlroy, andJames Hunt. This research was published in a 1976 paper co-written with James W. Hunt, who developed an initial prototype ofdiff.[5] The algorithm this paper described became known as theHunt–Szymanski algorithm.
McIlroy's work was preceded and influenced bySteve Johnson's comparison program onGECOS andMike Lesk'sproof program.Proof also originated on Unix and, likediff, produced line-by-line changes and even used angle-brackets (">" and "<") for presenting line insertions and deletions in the program's output. Theheuristics used in these early applications were, however, deemed unreliable. The potential usefulness of a diff tool provoked McIlroy into researching and designing a more robust tool that could be used in a variety of tasks, but perform well in the processing and size limitations of thePDP-11's hardware. His approach to the problem resulted from collaboration with individuals at Bell Labs includingAlfred Aho, Elliot Pinson,Jeffrey Ullman, and Harold S. Stone.
In the context of Unix, the use of theed line editor provideddiff with the natural ability to create machine-usable "edit scripts". These edit scripts, when saved to a file, can, along with the original file, be reconstituted byed into the modified file in its entirety. This greatly reduced thesecondary storage necessary to maintain multiple versions of a file. McIlroy considered writing a post-processor fordiff where a variety of output formats could be designed and implemented, but he found it more frugal and simpler to havediff be responsible for generating the syntax and reverse-order input accepted by theed command.
In 1984,Larry Wall created thepatch utility (releasing its source code on themod.sources andnet.sources newsgroups[6][7][8]) for patching text files; using the output fromdiff plus the diff input file with the content before changes to create a file with the content after changes.
X/Open Portability Guide issue 2 of 1987 includes diff. Context mode was added in POSIX.1-2001 (issue 6). Unified mode was added in POSIX.1-2008 (issue 7).[9]
Indiff's early years, common uses included comparing changes in the source of software code and markup for technical documents, verifying program debugging output, comparing filesystem listings and analyzing computer assembly code. The output targeted fored was motivated to provide compression for a sequence of modifications made to a file.[citation needed] TheSource Code Control System (SCCS) and its ability to archive revisions emerged in the late 1970s as a consequence of storing edit scripts fromdiff.
Unlikeedit distance notions used for other purposes,diff is line-oriented rather than character-oriented, but it is likeLevenshtein distance in that it tries to determine the smallest set of deletions and insertions to create one file from the other.
The operation ofdiff is based on solving thelongest common subsequence problem.[5] In this problem, given two sequences of items:
abcdfg hj qz
abcd efg ij k r x yz
and we want to find a longest sequence of items that is present in both original sequences in the same order. That is, we want to find a new sequence which can be obtained from the first original sequence by deleting some items, and from the second original sequence by deleting other items. We also want this sequence to be as long as possible. In this case it is
a b c d f g j z
From a longest common subsequence it is only a small step to getdiff-like output: if an item is absent in the subsequence but present in the first original sequence, it must have been deleted (as indicated by the '-' marks, below). If it is absent in the subsequence but present in the second original sequence, it must have been inserted (as indicated by the '+' marks).
e h i q k r x y+ - + - + + + +
Thediff command accepts two arguments like:difforiginalnew. Commonly, the arguments each identify normal files, but if the two arguments identify directories, then the command compares corresponding files in the directories. With the-r option, it recursively descends matching subdirectories to compare files with corresponding relative paths.
The example below shows the original and new file content as well as the resultingdiff output in the default format. The output is shown with coloring to improve readability. By default, diff outputsplain text, but GNU diff does use colorhighlighting when the--color option is used.[citation needed]
original: This part of thedocument has stayed thesame from version toversion. It shouldn'tbe shown if it doesn'tchange. Otherwise, thatwould not be helping tocompress the size of thechanges.This paragraph containstext that is outdated.It will be deleted in thenear future.It is important to spellcheck this dokument. Onthe other hand, amisspelled word isn'tthe end of the world.Nothing in the rest ofthis paragraph needs tobe changed. Things canbe added after it. | new: This is an importantnotice! It shouldtherefore be located atthe beginning of thisdocument!This part of thedocument has stayed thesame from version toversion. It shouldn'tbe shown if it doesn'tchange. Otherwise, thatwould not be helping tocompress the size of thechanges.It is important to spellcheck this document. Onthe other hand, amisspelled word isn'tthe end of the world.Nothing in the rest ofthis paragraph needs tobe changed. Things canbe added after it.This paragraph containsimportant new additionsto this document. | output: 0a1,6> This is an important> notice! It should> therefore be located at> the beginning of this> document!>11,15d16< This paragraph contains< text that is outdated.< It will be deleted in the< near future.<17c18< check this dokument. On---> check this document. On24a26,29>> This paragraph contains> important new additions> to this document.
|
In this default format,a stands for added,d for deleted andc for changed. The line number of the original file appears before the single-letter code and the line number of the new file appears after. Theless-than andgreater-than signs (at the beginning of lines that are added, deleted or changed) indicate which file the lines appear in. Addition lines are added to the original file to appear in the new file. Deletion lines are deleted from the original file to be missing in the new file.
By default, lines common to both files are not shown. Lines that have moved are shown as added at their new location and as deleted from their old location.[10] However, some diff tools highlight moved lines.
Aned script can be generated by modern versions of diff with the-e option. The resulting edit script for this example is as follows:
24aThis paragraph containsimportant new additionsto this document..17ccheck this document. On.11,15d0aThis is an importantnotice! It shouldtherefore be located atthe beginning of thisdocument!.
In order to transform the content of the original file into the content of new file usinged, one appends two lines to this diff file, one line containing aw (write) command, and one containing aq (quit) command (e.g. byprintf"w\nq\n">>mydiff). Here we gave the diff file the namemydiff and the transformation will then happen when we runed-soriginal<mydiff.
TheBerkeley distribution of Unix made a point of adding thecontext format (-c) and the ability to recurse on filesystem directory structures (-r), adding those features in 2.8 BSD, released in July 1981. The context format of diff introduced at Berkeley helped with distributing patches for source code that may have been changed minimally.
In the context format, any changed lines are shown alongside unchanged lines before and after. The inclusion of any number of unchanged lines provides acontext to the patch. Thecontext consists of lines that have not changed between the two files and serve as a reference to locate the lines' place in a modified file and find the intended location for a change to be applied regardless of whether the line numbers still correspond. The context format introduces greater readability for humans and reliability when applying the patch, and an output which is accepted as input to thepatch program. This intelligent behavior is not possible with the traditional diff output.
The number of unchanged lines shown above and below a changehunk can be defined by the user, even zero, but three lines is typically the default. If the context of unchanged lines in a hunk overlap with an adjacent hunk, then diff will avoid duplicating the unchanged lines and merge the hunks into a single hunk.
A "!" represents a change between lines that correspond in the two files, whereas a "+" represents the addition of a line, and a "-" the removal of a line. A blankspace represents an unchanged line. At the beginning of the patch is the file information, including the full path and atime stamp delimited by a tab character. At the beginning of each hunk are the line numbers that apply for the corresponding change in the files. A number range appearing between sets of three asterisks applies to the original file, while sets of three dashes apply to the new file. The hunk ranges specify the starting and ending line numbers in the respective file.
The commanddiff -c original new produces the following output:
*** /path/to/originaltimestamp--- /path/to/newtimestamp****************** 1,3 ****--- 1,9 ----+ This is an important+ notice! It should+ therefore be located at+ the beginning of this+ document!+ This part of the document has stayed the same from version to****************** 8,20 **** compress the size of the changes.- This paragraph contains- text that is outdated.- It will be deleted in the- near future. It is important to spell! check this dokument. On the other hand, a misspelled word isn't the end of the world.--- 14,21 ---- compress the size of the changes. It is important to spell! check this document. On the other hand, a misspelled word isn't the end of the world.****************** 22,24 ****--- 23,29 ---- this paragraph needs to be changed. Things can be added after it.++ This paragraph contains+ important new additions+ to this document.
Theunified format (orunidiff)[11][12] inherits the technical improvements made by the context format, but produces a smaller diff with old and new text presented immediately adjacent. Unified format is usually invoked using the "-u"command-line option. This output is often used as input to thepatch program. Many projects specifically request that "diffs" be submitted in the unified format, making unified diff format the most common format for exchange between software developers.
Unified context diffs were originally developed by Wayne Davison in August 1990 (inunidiff which appeared in Volume 14 of comp.sources.misc).Richard Stallman added unified diff support to theGNU Project's diff one month later, and the feature debuted inGNU diff 1.15, released in January 1991. GNU diff has since generalized the context format to allow arbitrary formatting of diffs.
The format starts with the same two-lineheader as the context format, except that the original file is preceded by "---" and the new file is preceded by "+++". Following this are one or morechange hunks that contain the line differences in the file. The unchanged, contextual lines are preceded by a space character, addition lines are preceded by aplus sign, and deletion lines are preceded by aminus sign.
A hunk begins withrange information and is immediately followed with the line additions, line deletions, and any number of the contextual lines. The range information is surrounded by doubleat signs, and combines onto a single line what appears on two lines in the context format (above). The format of the range information line is as follows:
@@ -l,s +l,s @@optional section heading
The hunk range information contains two hunk ranges. The range for the hunk of the original file is preceded by a minus symbol, and the range for the new file is preceded by a plus symbol. Each hunk range is of the formatl,s wherel is the starting line number ands is the number of lines the change hunk applies to for each respective file. In many versions of GNU diff, each range can omit the comma and trailing values, in which cases defaults to 1. Note that the only really interesting value is thel line number of the first range; all the other values can be computed from the diff.
The hunk range for the original should be the sum of all contextual and deletion (including changed) hunk lines. The hunk range for the new file should be a sum of all contextual and addition (including changed) hunk lines. If hunk size information does not correspond with the number of lines in the hunk, then the diff could be considered invalid and be rejected.
Optionally, the hunk range can be followed by the heading of the section or function that the hunk is part of. This is mainly useful to make the diff easier to read. When creating a diff with GNU diff, the heading is identified byregular expression matching.[13]
If a line is modified, it is represented as a deletion and addition. Since the hunks of the original and new file appear in the same hunk, such changes would appear adjacent to one another.[14]An occurrence of this in the example below is:
-check this dokument. On+check this document. On
The commanddiff -u original new produces the following output:
--- /path/to/originaltimestamp+++ /path/to/newtimestamp@@ -1,3 +1,9 @@+This is an important+notice! It should+therefore be located at+the beginning of this+document!+This part of thedocument has stayed thesame from version to@@ -8,13 +14,8 @@compress the size of thechanges.-This paragraph contains-text that is outdated.-It will be deleted in the-near future.-It is important to spell-check this dokument. On+check this document. Onthe other hand, amisspelled word isn'tthe end of the world.@@ -22,3 +23,7 @@this paragraph needs tobe changed. Things canbe added after it.++This paragraph contains+important new additions+to this document.
To successfully separate the file names from the timestamps, the delimiter between them is a tab character. This is invisible on screen and can be lost when diffs are copy/pasted from console/terminal screens.
There are some modifications and extensions to the diff formats that are used and understood by certain programs and in certain contexts. For example, somerevision control systems—such asSubversion—specify a version number, "working copy", or any other comment instead of or in addition to a timestamp in the diff's header section.
Some tools allow diffs for several different files to be merged into one, using a header for each modified file that may look something like this:
Index: path/to/file.cpp
The special case of files that do not end in a newline is not handled. Neitherunidiff nor the POSIXdiff standard define a way to handle this type of files. (Indeed, such files are not "text" files by strict POSIX definitions.[15]) GNU diff and git produce "\ No newline at end of file" (or a translated version) as a diagnostic, but this behavior is not portable.[16] GNU patch does not seem to handle this case, while git-apply does.[17]
Thepatch program does not necessarily recognize implementation-specific diff output. GNU patch is, however, known to recognize git patches and act a little differently.[18]
Changes since 1975 include improvements to the core algorithm, the addition of useful features to the command, and the design of new output formats. The basic algorithm is described in the papersAn O(ND) Difference Algorithm and its Variations byEugene W. Myers[19]and inA File Comparison Program by Webb Miller and Myers.[20]The algorithm was independently discovered and described inAlgorithms for Approximate String Matching, byEsko Ukkonen.[21]The first editions of the diff program were designed for line comparisons of text files expecting thenewline character to delimit lines. By the 1980s, support for binary files resulted in a shift in the application's design and implementation.
GNU diff and diff3 are included in thediffutils package with other diff andpatch related utilities.[22]
Postprocessorssdiff anddiffmk render side-by-side diff listings and applied change marks to printed documents, respectively. Both were developed elsewhere in Bell Labs in or before 1981.[citation needed][discuss]
Diff3 compares one file against two other files by reconciling two diffs. It was originally conceived by Paul Jensen to reconcile changes made by two people editing a common source. It is also used by revision control systems, e.g.RCS, formerging.[23]
Emacs hasEdiff for showing the changes a patch would provide in a user interface that combines interactive editing and merging capabilities for patch files.
Vim providesvimdiff to compare from two to eight files, with differences highlighted in color.[24] While historically invoking the diff program, modern vim usesgit's fork of xdiff library (LibXDiff) code, providing improved speed and functionality.[25]
GNUWdiff[26] is a front end to diff that shows the words or phrases that changed in a text document of written language even in the presence of word-wrapping or different column widths.
colordiff is a Perl wrapper for 'diff' and produces the same output but with colorization for added and deleted bits.[27] diff-so-fancy and diff-highlight are newer analogues.[28] "delta" is a Rust rewrite that highlights changes and the underlying code at the same time.[29]
Patchutils contains tools that combine, rearrange, compare and fix context diffs and unified diffs.[30]
Utilities that compare source files by their syntactic structure have been built mostly as research tools for some programming languages;[31][32][33] some are available as commercial tools.[34][35] In addition, free tools that perform syntax-aware diff include:
spiff is a variant ofdiff that ignores differences in floating point calculations with roundoff errors andwhitespace, both of which are generally irrelevant to source code comparison.Bellcore wrote the original version.[41][42] AnHPUX port is the most current public release. spiff does not support binary files. spiff outputs to thestandard output in standard diff format and accepts inputs in theC,Bourne shell,Fortran,Modula-2 andLispprogramming languages.[43][44][41][45][42]
LibXDiff is an LGPLlibrary that provides an interface to many algorithms from 1998. An improved Myers algorithm withRabin fingerprint was originally implemented (as of the final release of 2008),[46] butgit andlibgit2's fork has since expanded the repository with many of its own. One algorithm called "histogram" is generally regarded as much better than the original Myers algorithm, both in speed and quality.[47][48] This is the modern version ofLibXDiff used by Vim.[25]
diff – Shell and Utilities Reference,The Single UNIX Specification, Version 5 fromThe Open GroupIn git-style diffs, the "before" state of each patch refers to the initial state before modifying any files,..
The easiest way to start editing in diff mode is with the "vimdiff" command. This starts Vim as usual, and additionally sets up for viewing the differences between the arguments.vimdiff file1 file2 [file3] [file4] [...file8]This is equivalent to:vim -d file1 file2 [file3] [file4] [...file8]
This does indeed show that histogram diff slightly beats Myers, while patience is much slower than the others.
diff: compare two files – Shell and Utilities Reference,The Single UNIX Specification, Version 5 fromThe Open Groupdiff(1) – Plan 9 Programmer's Manual, Volume 1diff(1) – Inferno General commandsManual