return to first page linux journal archive
keywordscontents

Diff, Patch, and Friends

``Kernel patches'' may sound like magic, but the two tools used to create and apply patches are simple and easy to use---if they weren't, some Linux developers would be too lazy to use them...

Best of all, they can be very useful to you, even if you never touch a line of source code.

by Michael K. Johnson

Diff is designed to show you the differences between files, line by line. It is fundamentally simple to use, but takes a little practice. Don't let the length of this article scare you; you can get some use out of diff by reading only the first page or two. The rest of the article is for those who aren't satisfied with very basic uses.

While diff is often used by developers to show differences between different versions of a file of source code, it is useful for far more than source code. For example, diff comes in handy when editing a document which is passed back and forth between multiple people, perhaps via e-mail. At Linux Journal, we have experience with this. Often both the editor and an author are working on an article at the same time, and we need to make sure that each (correct) change made by each person makes its way into the final version of the article being edited. The changes can be found by looking at the differences between two files.

However, it is hard to show off how helpful diff can be in finding these kinds of differences. To demonstrate with files large enough to really show off diff's capabilities would require that we devote the entire magazine to this one article. Instead, because few of our readers are likely to be fluent in Latin, at least compared to those fluent in English, we will give a Latin example from Winnie Ille Pu, a translation by Alexander Leonard of A. A. Milne's Winnie The Pooh (ISBN 0-525-48335-7). This will make it harder for the average reader to see differences at a glance and show how useful these tools can be in finding changes in much larger documents.

Quickly now, find the differences between these two passages:

Ecce Eduardus Ursus scalis nunc tump-tump-tump
occipite gradus pulsante post Christophorum
Robinum descendens. Est quod sciat unus et solus
modus gradibus desendendi, non nunquam autem
sentit, etiam alterum modum exstare, dummodo
pulsationibus desinere et de no modo meditari
possit. Deinde censet alios modos non esse. En,
nunc ipse in imo est, vobis ostentari paratus.
Winnie ille Pu.
Ecce Eduardus Ursus scalis nunc tump-tump-tump
occipite gradus pulsante post Christophorum
Robinum descendens. Est quod sciat unus et solus
modus gradibus descendendi, nonnunquam autem
sentit, etiam alterum modum exstare, dummodo
pulsationibus desinere et de eo modo meditari
possit. Deinde censet alios modos non esse. En,
nunc ipse in imo est, vobis ostentari paratus.
Winnie ille Pu.

You may be able to find one or two changes after some careful comparison, but are you sure you have found every change? Probably not: tedious, character-by-character comparison of two files should be the computer's job, not yours.

Use the diff program to avoid eyestrain and insanity:

diff -u 1 2
-- 1   Sat Apr 20 22:11:53 1996
+++ 2   Sat Apr 20 22:12:01 1996
 -1,9 +1,9 
 Ecce Eduardus Ursus scalis nunc tump-tump-tump
 occipite gradus pulsante post Christophorum
 Robinum descendens. Est quod sciat unus et solus
-modus gradibus desendendi, non nunquam autem
+modus gradibus descendendi, nonnunquam autem
 sentit, etiam alterum modum exstare, dummodo
-pulsationibus desinere et de no modo meditari
+pulsationibus desinere et de eo modo meditari
 possit. Deinde censet alios modos non esse. En,
 nunc ipse in imo est, vobis ostentari paratus.
 Winnie ille Pu.

There are several things to notice here:

Perhaps the main thing to notice is that you didn't need this description of how to interpret diff's output in order to find the differences. It is rather easy to compare two adjacent lines and see the differences.

It's not always this easy

Unfortunately, if too many adjacent lines have been changed, interpretation isn't as immediately obvious; but by knowing that each marked line has been changed in some way, you can figure it out. For instance, in this comparison, where the file 3 contains the damaged contents, and file 4 (identical to file 2 in the previous example) contains the correct contents, three lines in a row are changed, and now each line with a difference is not shown directly above the corrected line:

diff -u 3 4
--- 3   Sun Apr 21 18:57:08 1996
+++ 4   Sun Apr 21 18:56:45 1996
 -1,9 +1,9 
 Ecce Eduardus Ursus scalis nunc tump-tump-tump
 occipite gradus pulsante post Christophorum
 Robinum descendens. Est quod sciat unus et solus
-modus gradibus desendendi, non nunquam autem
-sentit, etiam alterum nodum exitare, dummodo
-pulsationibus desinere et de no modo meditari
+modus gradibus descendendi, nonnunquam autem
+sentit, etiam alterum modum exstare, dummodo
+pulsationibus desinere et de eo modo meditari
 possit. Deinde censet alios modos non esse. En,
 nunc ipse in imo est, vobis ostentari paratus.
 Winnie ille Pu.

It takes a little more work to find the added mistakes; ``nodum'' for ``modum'' and ``exitare'' for ``exstare''. Imagine if 50 lines in a row had each had a one-character change, though. This begins to resemble the old job of going through the whole file, character-by-character, looking for changes. All we've done is (potentially) shrink the amount of comparison you have to do.

Fortunately, there are several tools for finding these kinds of differences more easily. GNU Emacs has ``word diff'' functionality. There is also a GNU ``wdiff'' program which helps you find these kinds of differences without using Emacs.

Let's look first at GNU Emacs. For this example, files 5 and 6 are exactly the same, respectively, as files 3 and 4 before. I bring up emacs under X (which provides me with colored text), and type:

M-x ediff-files RET
5 RET
6 RET

In the new window which pops up, I press the space bar, which tells Emacs to highlight the differences. Look at Figure 1 and see how easy it is to find each changed word.

Figure 1. ediff-files 5 6

GNU wdiff is also very useful, especially if you aren't running X. A pager (such as less) is all that is required--and that is only required for large differences. The exact same set of files (5 and 6), compared with the command wdiff -t 5 6, is shown in Figure 2.

Figure 2. wdiff -t 5 6

If you are getting extra character sequences like ESC[24 instead of getting underline and reverse video, it's probably because you are using less, which by default doesn't pass through all escape characters. Use less -r instead, or use the more pager. Either should work.

wdiff uses the termcap database (that's what the -t option is for) to find out how to enable underline and reverse video, and not all termcap entries are correct. In some instances, I've found that the linux termcap entry works well for other terminals, since the codes for turning underline and reverse video on and off don't differ very much across terminals. To use the linux termcap entry, you can do this:

TERM=linux wdiff -t 5 6 | less -r

This will work only with bourne shell derivatives such as bash, not with csh or tesh. But since you need to do this only to correct for a broken termcap database, this limitation shouldn't be too much of a problem.

wdiff isn't always built with the termcap support needed to underline and reverse video, and it's not always what you want even if you have a working termcap database, so there's an alternate output format that is just as easy to understand. We'll kill two birds with one stone by also showing off wdiff's ability to deal with re-wrapped paragraphs while showing off its ability to work without underline and reverse video. File 8 is the same as the correct file 2, shown at the beginning of this article, but file 7 (the corrupted one) now has much shorter lines, which makes them even harder to compare ``by eye'':

Ecce Eduardus Ursus scalis
nunc tump-tump-tump occipite
gradus pulsante post
Christophorum Robinum
descendens. Est quod sciat
unus et solus modus gradibus
desendendi, non nunquam autem
sentit, etiam alterum nodum
exitare, dummodo pulsationibus
desinere et de no modo
meditari possit. Deinde censet
alios modos non esse. En, nunc
ipse in imo est, vobis
ostentari paratus.
Winnie ille Pu.

wdiff is not confused by the differently-wrapped lines. The command wdiff 7 8 produces this output:

Ecce Eduardus Ursus scalis nunc tump-tump-tump
occipite gradus pulsante post Christophorum
Robinum descendens. Est quod sciat unus et solus
modus gradibus
[-desendendi, non nunquam-]
{+descendendi, nonnunquam+} autem
sentit, etiam alterum [-nodum
exitare,-] {+modum exstare,+} dummodo
pulsationibus desinere et de [-no-] {+eo+}
modo meditari
possit. Deinde censet alios modos non esse. En,
nunc ipse in imo est, vobis ostentari paratus.
Winnie ille Pu.

Remember the + and - characters? They mean the same thing with wdiff as they mean with diff. (Consistent user interfaces are wonderful.)

Chunks

Near the beginning of this article, I promised to explain this line:

 -1,9 +1,9 

that describes the chunk that diff found differences in. In each file, the chunk starts on line 1 and extends for 9 lines beyond the first line. However, with this small example, the chunk shown in the example contains the whole file. With larger files, only the lines around the changes, called the context, are shown.

In files 9 and 10, I've inserted a lot of blank lines in the middle of the paragraph, in order to show what multiple chunks look like. File 9 is damaged, file 10 is correct (except for the blank lines in the middle of the paragraph):

diff -u 9 10

--- 9 Mon Apr 22 15:46:37 1996 +++ 10 Mon Apr 22 15:46:14 1996 -1,7 +1,7 Ecce Eduardus Ursus scalis nunc tump-tump-tump occipite gradus pulsante post Christophorum Robinum descendens. Est quod sciat unus et solus -modus gradibus desendendi, non nunquam autem +modus gradibus descendendi, nonnunquam autem -33,7 +33,7 sentit, etiam alterum modum exstare, dummodo -pulsationibus desinere et de no modo meditari +pulsationibus desinere et de eo modo meditari possit. Deinde censet alios modos non esse. En, nunc ipse in imo est, vobis ostentari paratus. Winnie ille Pu.

So you see that we have one seven-line chunk starting at line 1 and one seven-line chunk starting at line 33 are shown here.

You should notice several things here:

``Patches'' (or ``diffs'') are the output of the diff program. They include all the chunks of changes between the two files.

Other formats

This only brushes the surface of diff. For one thing, the three lines of unchanged context is configurable. Instead of using the -u option, you can use the -U lines option to specify any reasonable number of lines of context. You can even specify -U 0 if you don't want to use any context at all, though that is rarely useful.

What does the -u (or -U lines) argument mean? It specifies the unified diff format, which is the particular format covered here. Other formats include:

You will almost never want to create context or normal diffs, but it may be useful to recognize them from time to time. Context diffs are marked by the use of the character ! to mark changes, and normal diffs are marked by the use of the characters < and > to mark changes.

Here are examples:

diff -c 1 2
*** 1	Sat Apr 20 22:11:53 1996
--- 2	Sat Apr 20 22:12:01 1996
***************
*** 1,9 ****
  Ecce Eduardus Ursus scalis nunc tump-tump-tump
  occipite gradus pulsante post Christophorum
  Robinum descendens. Est quod sciat unus et solus
! modus gradibus desendendi, non nunquam autem
  sentit, etiam alterum modum exstare, dummodo
! pulsationibus desinere et de no modo meditari
  possit. Deinde censet alios modos non esse. En,
  nunc ipse in imo est, vobis ostentari paratus.
  Winnie ille Pu.
--- 1,9 ----
  Ecce Eduardus Ursus scalis nunc tump-tump-tump
  occipite gradus pulsante post Christophorum
  Robinum descendens. Est quod sciat unus et solus
! modus gradibus descendendi, nonnunquam autem
  sentit, etiam alterum modum exstare, dummodo
! pulsationibus desinere et de eo modo meditari
  possit. Deinde censet alios modos non esse. En,
  nunc ipse in imo est, vobis ostentari paratus.
  Winnie ille Pu.
diff 1 2
4c4
< modus gradibus desendendi, non nunquam autem
---
> modus gradibus descendendi, nonnunquam autem
6c6
< pulsationibus desinere et de no modo meditari
---
< pulsationibus desinere et de eo modo meditari

There are a few other important things to note here:

Using Patches

When someone changes a file that other people have copies of (source code, documentation, or just about any other text file), they often send patches instead of (or in addition to) making the entire new file available. If you have the old file and the patches, you might wish that you could have a program apply the patches. You might think that normal diff format, which was made to look like input to the ed program, would be the best way to accomplish this.

As it turns out, this is not true.

A program called patch has been written which is specifically designed to apply patches to files (change the files as specified in the patch). It correctly recognizes all the formats of patches and applies them. With unified and context diffs, patch can usually apply patches, even if lines have been added or removed from the file, by looking for unchanged context lines. Only if the context lines have themselves been changed is patch likely to fail.

To apply patches with patch, you normally have a file containing the patch (we'll call it patchfile), and then run patch:

patch < patchfile

Patch is very verbose. If it gets confused by anything, it stops and asks you in English (it was written by a linguist, not a computer scientist) what you want to do. If you want to learn more about patch, the man page is unusually readable.

Other Related Tools

If you read the RCS article in the May issue (Take Command: Keeping Track of Change, LJ #25, May 1996), you may have noticed that the article talked a bit about a program called rcsdiff. rcsdiff is really just a front end to diff. That is, it looks for arguments that it understands (such as revision numbers and the filename) and prepares two files representing the two versions of the file you are examining. It then calls diff with the remaining options. The RCS article used -u to get the unified format without explaining what it meant, but you can use -c to get context diffs, or use -U lines to choose the amount of context you get in a unified diff, or use any other diff options you like.

You may notice that rcsdiff produces more verbose output than normal diff. From the RCS article:

rcsdiff -u -r1.3 -r1.6 foo
==============================================
RCS file: foo,v
retrieving revision 1.3
retrieving revision 1.6
diff -u -r1.3 -r1.6
--- foo 1996/02/01 00:34:15     1.3
+++ foo 1996/02/01 01:05:28     1.6
 -1,2 +1,6 
 This is a test of the emergency
-RCS system.  This is only a test.
+RCS version control system.
+This is only a test.
+
+I'm now adding a few lines for
+the next version.

It looks just like a normal unified diff except for the first 5 lines.

This doesn't prevent you from sending patches to people. The patch program is extremely good about ignoring extraneous information. It can even ignore news or mail headers, extra comments written in a file outside a patch, and people's signatures following patches. Patch tells you when it is determining whether text is part of a patch or not by saying ``Hmm...''

If you don't care how two files differ, but just want to know whether they differ, the cmp program will tell you. It works not only for text files, but also for binary files. In this example, the files 5 and 6 are different; 2 and 4 are the same:

cmp 5 6
5 6 differ: char 159, line 4
cmp 2 4

Notice that when two files are the same, cmp doesn't say anything at all. It only tells you explicitly if the files have been changed. For use in writing shell scripts, cmp also returns true if the files are the same and false if they don't, as shown by this shell session:

if cmp 5 6 ; then
  echo "same"
else
  echo "different"
fi
5 6 differ: char 159, line 4
different
if cmp 2 4 ; then
  echo "same"
else
  echo "different"
fi
same

There are several other programs with related functionality. In particular, diff3 can be used to merge together two different files that have both been edited from a common ancestor file. That common ancestor must exist in order for diff3 to work correctly.

The info pages which are shipped with diff are probably installed on your system. If you want to learn more about diff, try the command info diff or use info mode from within emacs or jed.

diff, wdiff, patch, and emacs are available via ftp from the canonical GNU ftp archive, prep.ai.mit.edu, in the directory /pub/gnu/

Michael K. Johnson's wife Kim likes A. A. Milne and briefly studied Latin (unlike Michael, whose experience with Latin was limited to singing in choir), which is why she owns Winnie Ille Pu as well as Tela Charlottae (Charlotte's Web).