101communication LLC CertCities.com -- The Ultimate Site for Certified IT Professionals
   Certification Communities:  Home  Microsoft®  Cisco®  Oracle®  A+/Network+"  Linux/Unix  More  
Editorial
Choose a Cert
News
Exam Reviews
Features
Columns
Salary Surveys
Free Newsletter
Resources
Forums
Practice Exams
Cert Basics
Links Library
Tips
Pop Quiz
Industry Releases
Job Search
Conferences
Contributors
About Us
Search


Advanced Search
CertCities.com

CertCities.com
Let us know what you
think! E-mail us at:
.. Home .. Certifications .. Linux Unix .. Columns ..Column Story Wednesday, July 02, 2003

TechMentor Conference & Expo PDF Brochure - Download It Now!

 Notes from Underground   James Ervin
James Ervin



 Log Analysis 101
The what, where, why and how of setting up logs on your Unix system.
by James Ervin  
3/6/2002 -- Log management and analysis is a critical, yet often neglected part of system administration. Computers can generate a remarkable "paper" trail in short order. To complicate matters, consider this: If you were searching for a needle in a haystack, would you be able to tell if it was already gone? The integrity of logs is as important as their interpretation. Any organization can benefit from a logging infrastructure, but you should have these questions in mind before embarking: What do you want to log, to what purpose, and how do you intend to present your results?

Why Log
Motivations for logging depend on your particular situation. Most log analysis is retrospective: That is, you need to be able to process logs and obtain information from them, but not in real time. Retrospective analysis is appropriate for identifying trends over a period of time: analyzing a traffic patterns on a Web site, identifying performance bottlenecks and times of peak usage, tracking mail delivery, and so on. Other tasks, such as monitoring systems for failures and identifying security breaches in progress, call for real-time analysis.

A second motivation for logging, not to be overlooked, is to prepare for the inevitable audit or budget cutback. Statistics have a powerful justificatory power when the auditor comes a-calling. Additionally, many institutions now impose log retention requirements on their IT departments. Though the information you end up logging may not be edifying or useful in the technical sense, there may be legal requirements your logging architecture must meet. Many's the system administrator who lived to regret eliminating a critical portion of the digital paper trail, even two or three years after the fact. I've found that the best policy is to save as many logs as possible, for as long a period of time as is possiblebut it's impossible to globally define what "possible" means. Luckily, offline storage is cheaper by the hour.

What to Log (or Not)
Most applications have variable levels of logging, from shy and reticent to extremely talkative, so the question becomes what not to log while retaining the information you want. Applications that perform their own logging usually have some quirks in how they go about it, but any application should allow you to adjust at least the location and the verbosity of the log file.

The Apache Web server, for instance, performs its own logging. It creates two logs by default: the error log and the access log. You can adjust the verbosity of the error log; however, only the format of the access log can be customized. The Apache access log, in other words, doesn't have a concept of different log priorities: All information in the access log has the same value, in the eyes of Apache.

Most of the backbone Internet programsFTP, DNS, Sendmail and othersuse the syslog protocol to route their log messages. Syslog is like a traffic cop for log information. It allows you to redirect log information to multiple files, or even to other applications that might analyze the logs on the fly. It's also extremely granular, which accounts for its popularity. Syslog defines a log priority as the combination of a severity level and a facility. Sendmail, the popular mail transfer agent, can use the syslogd program, an implementation of the syslog protocol, as a conduit for routing its logs (given that the same person authored both programs, this isn't surprising). So, in the syslogd configuration file, /etc/syslog.conf, this line:

mail.debug    /var/log/maillog

indicates that any messages from the "mail" facility with a severity of "debug" will get appended to the file /var/log/maillog.

Several additional facilities and severity levels are defined in the syslog Request for Comments, RFC 3164, such as kern for the kernel message facility and emerg for messages with emergency severity. Additionally, you can define your own facilities for special uses. Keep in mind that there is a hierarchy to the severity levels, though not to the facilities. These are the RFC-defined severity levels, in order of increasing urgency: debug, informational, notice, warning, error, critical, alert, and emergency. Normally, a line in /etc/syslog.conf will log all messages of that severity and higher for the specified facility, but some implementations of syslog, such as that included in RedHat Linux, allow you to select only log messages of a specific severity level.

The Presentation Layer: Digesting Apache and Sendmail Logs
The Apache Web server is the classic example of an application that doesn't need to be monitored in real time, but generates logs far too verbosely for eyeball analysis over any significant length of time. To analyze massive amounts of log information, you need a tool that can generate summary information. My experience with logging tools is that typically, the first one you try answers some questions, but mainly serves to raise new ones. Then you go searching for another tool, and another, and another, ad infinitum. While it doesn't look like it on the surface, log analysis can be one of the system administrator's more rewarding and interesting tasks, since it generates immediate proof of successor failure, if that's your thing. Either way, pretty graphs tend to impress.

The premier free tool for Apache log analysis is Analog, currently at version 5.21 and updated at breakneck pace. It takes one or more Apache log files as input and generates a text or HTML report summarizing client browser types, requests per hour, and more. It lacks some of the features of commercial log analysis suites (Sawmill, Summary, WebTrends or Urchin), such as the ability to track an individual user's progress through the site. Also, it must be reconfigured if you want to change the report generationit cannot be configured "on-the-fly" from a Web interface, as some of the commercial offerings can. Its customizability and extensibility, though, more than compensate for these deficiencies. Several helper applications are available for the core product that permit it to generate presentation-worthy charts, analyze Sendmail logs in addition to Apache logs, and automate log generation. Additionally, it's written in C, and is quite fastanalyzing several gigabytes of logs at a time is not a problem.

For More Information

Counterpane's Log Analysis Resources -- Maintained by Counterpane Security, home of Bruce Schneier's Cryptogram newsletter, this is the most complete, best annotated collection of links to log analysis sites and services on the Web. It also hosts the LogAnalysis mailing list. A heavy emphasis on security pervades the site.

Log Analysis Software -- Another set of log analysis links.

Sendmail FAQ -- A short guide to analyzing Sendmail logs.

Apache Web logs are instructive, because they demonstrate a truism of logging: consistency of format equals ease of analysis. The format of Apache logs is well documented and can be explicitly defined via the LogFormat directive in the Apache configuration file. In fact, the worst bottleneck in analyzing Apache logs is, by far, the time required to convert the IP addresses in the log files into hostnames. For performance reasons, the Apache Web server is distributed with the directive "HostnameLookups Off" in its configuration file, so that the Web server doesn't have to query DNS just to write each request to the log. This is a good idea if you expect your Web server to actually receive any hits. However, you'll end up performing these DNS lookups en masse during log analysis. Several applications are available that can accelerate batch DNS lookups. My personal favorite is fastresolve, which takes an Apache log file as input and spits out the same file, with hostnames in place of IP addresses.

Sendmail, on the other hand, produces logs in a less consistent format. To quote the Sendmail Installation and Operation Guide found in its distribution: "Some [log] fields may be omitted if they do not contain interesting information." Consequently, there are fewer and less advanced tools for analyzing its output. The best tool I've found for this purpose, besides Analog, is Lire. Lire produces reports in text or XML that can be converted into numerous other formats, including PDF and RTF, which makes the suits happy. Lire is actually a modular log analysis suite that operates somewhat independently of the type of log it analyzes. However, because of its use of XML and XSL and the very extensive summary data it produces, Lire may be slower than other, simpler tools.

Centralized Logging
If your Web site is of considerable size, or if you have several servers under your belt, you may find yourself feeling constrained. A Web site that gets 2 million hits daily can easily generate half a gigabyte of logs. Additionally, analysis of huge amounts of data is likely to interfere with the speed of the server. Establishing a logging server separate from your production environment begins to look like a good idea. Of course, then you have to get the logs to the logging server.

UNIX syslog has always been able to send log messages across the network via the unreliable UDP protocol. However, in today's world, it's not the best idea to send critical system information over an unreliable protocol, unencrypted to boot, even if you control your network with an iron fist. Several replacements for the syslogd daemon are available that obviate these issues:

  • Syslog Next-Generation, or syslog-ng, can sort log messages based on their contents, and transmits logs using the reliable TCP protocol.
  • Modular syslog, aka msyslog, allows you to send logs securely to a MySQL or PostgreSQL database, and can create encrypted logs, for those with a need to protect log data.

For Apache logs, there are several free options for redirecting log messages:

  • The mod_log_spread module uses a central host.
  • The mod_log_mysql module redirects logs to a MySQL database.
  • The Apache error log can be redirected to syslogd within Apache as described in the Apache manual. However, the access log can only be piped to another program.

Lies, Damned Lies, and Statistics.
The last, best piece of advice I have for serious log junkies is to never, ever trust your tools completely. These aren't hammers; They're computer applications, and prone to err even without your help. If the results a particular tool produces aren't what you expect, try a competing tool. Given two analysis programs and identical input, even something as apparently simple as counting Web site hits can generate different results, depending on the internal workings of the chosen tools. If, however, every attempt to retrieve your boss's desired results fails, remember: "facts are stubborn, but statistics are more pliable" -- Mark Twain.

Questions? Comments? Tips to share? Post 'em below!


James Ervin is alone among his coworkers in enjoying Michelangelo Antonioni films, but in his more lucid moments suspects that they're not entirely wrong.

 

More articles by James Ervin:

Post your comment below, or better yet, go to our Discussion Forums and really post your mind.
Current CertCities.com user Comments for "Log Analysis 101"
No postings yet.
Add your comment here:
Name: (optional)
Location: (optional)
E-mail Address: (optional)
Comments:  
 
top

Sponsored Links:
Exchange Server 2003: FREE special report from ENTmag.com
Windows Server 2003 Workshop: TechMentor, Sept. 2-6, San Diego
Free CertCities.com Newsletter: The best source for weeekly IT certification news!
Turn Up the Volume on IT: Listen to MCP Radio
Home | Microsoft | Cisco | Oracle | A+/Network+ | Linux/Unix | MOUS | List of Certs
Advertise | Certification Basics | Conferences | Contact Us | Contributors | Features | Forums | Links | News | Pop Quiz | Industry Releases | Tips
Search | Site Map | MCPmag.com | TCPmag.com | OfficeCert.com | TechMentor Conferences | 101communications | Privacy Policy
This Web site is not sponsored by, endorsed by or affiliated with Cisco Systems, Inc., Microsoft Corp., Oracle Corp., The Computing Technology Industry Association, Linus Torvolds, or any other certification or technology vendor. Cisco® and Cisco Systems® are registered trademarks of Cisco Systems, Inc. Microsoft, Windows and Windows NT are either registered trademarks or trademarks of Microsoft Corp. Oracle® is a registered trademark of Oracle Corp. A+®, i-Net+™, Network+™, and Server+™ are trademarks and registered trademarks of The Computing Technology Industry Association. (CompTIA). Linux™ is a registered trademark of Linus Torvalds. All other trademarks belong to their respective owners.
All content copyright 2000-03 101communications LLC, unless otherwise noted. All rights reserved.
Reprints allowed with written permission from the publisher. For more information, e-mail