From  CertCities.com
Column
Inside the Kernal
Setting the Stage with Stream Editing
Emmett walks you through an exercise using the "sed" tool.

by Emmett Dulaney

9/15/2007 -- Two of my favorite tools in the Unix/Linux toolbox are sed and awk. While sed is the "stream editor" and awk is a quick programming language, the truth of the matter is that they complement each other so well that I rarely use one without the other.

I recently had the occasion to work with two examples that were similar enough to require one or both of these tools, but were unrelated in scope. Both examples, however, show the beauty and power of what these tools can do. I'll be the first to admit that I'm not the cleanest programmer in the world -- in fact, the term "spaghetti code" was invented to describe my work -- but one of the advantages of these tools is that it's possible to accomplish what you need to without thinking too much about how the code looks.

I'll walk through the first example this month, then set the stage for the second example and discuss it next month. The following are sample lines of a colon-delimited employee database that includes five fields: unique ID number, name, department, phone number, address:

1218:Kris Cottrell:Marketing:219.555.5555:123 Main Street
1219:Nate Eichhorn:Sales:219.555.5555:1219 Locust Avenue
1220:Joe Gunn:Payables:317.555.5555:21974 Unix Way
1221:Anne Heltzel:Finance:219.555.5555:652 Linux Road
1222:John Kuzmic:Human Resources:219.555.5555:984 Bash Lane

This database has been in existence since the beginning of the company and has since grown to include everyone who now works, or has ever worked, for the company. Given that, a number of proprietary scripts read from the database and the company can't afford to be without it. The problem is that the telephone company has changed the 219 prefix to 260 and all entries in the database need to be changed.

This is precisely the task for which sed was created. As opposed to standard (interactive) editors, a stream editor works its way through a file and makes changes based on the rules it was given. The rule, in this case, is to change "219" to "260." It is not quite that simple, however; if you use the command

sed 's/219/260/'

the result won't be completely what you want (changes have been bolded):

1218:Kris Cottrell:Marketing:260.555.5555:123 Main Street
1260:Nate Eichhorn:Sales:219.555.5555:1219 Locust Avenue
1220:Joe Gunn:Payables:317.555.5555:26074 Unix Way
1221:Anne Heltzel:Finance:260.555.5555:652 Linux Road
1222:John Kuzmic:Human Resources:260.555.5555:984 Bash Lane

The changes in the first, fourth and fifth lines are correct, which only produces a 60 percent accuracy rate. In the second line, the first occurrence of "219" is changed to "260," and this appears in the employee ID number rather than in the phone number. If you wanted to change more than the very first occurrence in a line, you could slap a "g" (for global) into the command:

sed 's/219/260/g'

That is not what you want to do in this case, however; the employee ID number shouldn't change. Similarly, in the third line, no change at all should be made since the employee doesn't have this telephone prefix. Nevertheless, a change was made erroneously to their addresses since they both contain the value that's being searched for.

The first rule of using sed is to identify what makes the location of the string you're looking for unique. If the telephone prefix were encased in parentheses, it would be much easier to isolate. That's not the case in this database, though, and the task becomes a bit more complicated.

In this case, you could say that it must appear at the beginning of the field (denoted by a colon) and get a result which is much closer:

sed 's/:219/:260/'

Again, bolding has been added to the changes:

1218:Kris Cottrell:Marketing:260.555.5555:123 Main Street
1219:Nate Eichhorn:Sales:260.555.5555:1219 Locust Avenue
1220:Joe Gunn:Payables:317.555.5555:26074 Unix Way
1221:Anne Heltzel:Finance:260.555.5555:652 Linux Road
1222:John Kuzmic:Human Resources:260.555.5555:984 Bash Lane

The accuracy has now increased to 80 percent, but there's still the problem of the third line. As the colon helped to identify the start of the string, it may be tempting to turn to the period to identify the end:

sed 's/:219./:260./'

But the result still isn't what you want. Notice the third line:

1218:Kris Cottrell:Marketing:260.555.5555:123 Main Street
1219:Nate Eichhorn:Sales:260.555.5555:1219 Locust Avenue
1220:Joe Gunn:Payables:317.555.5555:260.4 Unix Way
1221:Anne Heltzel:Finance:260.555.5555:652 Linux Road
1222:John Kuzmic:Human Resources:260.555.5555:984 Bash Lane

Since the period has a special meaning of any character, a match is found to the search whether the 219 is followed by a period itself, a "7" or any single character. Whatever that character happens to be, it gets replaced with a period. There's no problem with the replacement side of things, but the search needs to be tweaked. By using the \ character, it's possible to override the special meaning of the period and specify that you are indeed looking for a period and not any single character:

sed 's/:219\./:260./'

The result becomes:

1218:Kris Cottrell:Marketing:260.555.5555:123 Main Street
1219:Nate Eichhorn:Sales:260.555.5555:1219 Locust Avenue
1220:Joe Gunn:Payables:317.555.5555:21974 Unix Way
1221:Anne Heltzel:Finance:260.555.5555:652 Linux Road
1222:John Kuzmic:Human Resources:260.555.5555:984 Bash Lane

And the mission is accomplished.

The second example involves a database of books that includes the ISBN numbers of each title. Prior to the beginning of this year, ISBN numbers were 10 digits and included an identifier for the publisher and a unique number for each book. As of January, ISBN numbers are now 13 digits long for new books. Old books (those published prior to the first of this year) have both the old 10-digit and a new 13-digit number. For this example, the existing 10-digit number will stay in the database and a new field will be added to the end of each entry holding the ISBN-13 number.

To come up with the ISBN-13 number for the existing entries in the database, you start with "978" then use the first nine digits of the old ISBN number. The 13th digit is a mathematical calculation (a "check digit") obtained by doing the following:

  1. Add all the odd-placed digits together.
  2. Multiply all the even-placed digits by 3 and add them together.
  3. Add the total of Step 2 to the total of Step 1.
  4. Find out what you need to add to round the number up to the nearest ten. This value becomes the 13th digit.

For example, consider the 10-digit ISBN of 0743477103. It first becomes 978074347710. Then:

  1. 9+8+7+3+7+1=35
  2. 7*3=21 ; 0*3=0; 4*3=12; 4*3=12; 7*3=21; 0*3=0; 21+0+12+12+21+0=66
  3. 66+35=101
  4. 110-101=9. The ISBN-13 thus becomes: 9780743477109

The beginning database resembles

0743477103:Macbeth:Shakespeare, William

And you want the resulting database to resemble

0743477103:Macbeth:Shakespeare, William:9780743477109

Next month, we'll look at how to create a script to generate this. In the meantime, feel free to play around with it and see if you can come up with your own way to do so. Hint: sed will not do it all, and is only one part of the toolbox.


Emmett Dulaney is the author of several books on Linux, Unix and certification. He can be reached at .

 

 

top

Copyright 2000-2009, 101communications LLC. See our Privacy Policy.
For more information, e-mail .