Todd Hoff at High Scalability wrote an interesting piece on what and how to log. His position is that you essentially need to "log everything all the time". However, curiously missing from his list of what "everything" means is the full detail on how application release/update procedures impacted the environment.
This is a common problem we see all too often. Extensive system logging efforts but no visibility into the change management processes. Without the complete picture you are spending your time essentially studying symptoms and trying to guess at the root event, rather than quickly identifying the root event and spending your time identifying a solution. Under the pressure of a significant outage you can't underestimate the value of having the right tools at your disposal.
From my more detailed comment to Todd's post:
Info you can't get from normal system and application logs:The common perception is that it just isn't possible or practical to collect this kind of data in an automated and authoritative manner. It is, but it depends on the correct choice of build, deployment, and configuration management tooling.
1. When did the application change?
2. What was changed? What are all of the code, data, and content assets related to that change?
3. Exactly what procedures were run to produce the change? Who ran the commands? What variables/inputs did the procedures use?
4. What nodes did those procedures touch?
5. What commands can I run to immediately put everything back into a last known good state? (often through a "roll-forward" rather than a true "roll-back" procedure)