> the problem isn't that we don't need flares, the problem is we have
> too many flares. The solution is a better coordination of the many flares
> to a single flare that summarizes the entire domino-effect of failures.
Right! So: "Information is data that makes a difference."
I'm not sure but sounds to me (#Nomenclature) that when you say "too many flares" you're talking about a whole bunch of "real-time operational metrics" .
But every datum shouldn't become a flare … an alert. That's what my subject line meant to imply. (Context: "Found was I was responding to" in your blog.)
For me a "flare" signifies a pre-specified condition i.e. "that freakin' HD has failed".
So yes, what you write about here is exactly what I guessed / intuited.
Example (from the avionics R&D project I worked on in Sydney … Micronav International, in Point Edward): any known failure mode will create a recognizable set of readings in our Built In Testing Equipment.
Q: How to take that data and produce information meaningful to the attendant? (BTW: we finished design for BITE but unfortunately the project failed before we got to that next state.)
> Currently, that's an unsolved problem, but people are thinking about it.
Well I'd love to chat with those people! 🙂
I was using #monitoring/#notification … it seems to me that's accurate … that it failed to communicate makes me wonder just where I stepped into cartoon fiction.
> _If you use PagerDuty or Nagios you'll know that it comes with an
> inbox full of noise.
A) nice to see familiar nomenclature. A signal that isn't significant isn't noice … but it sure ain't useful!
B) I haven't "used" anything. I only 3 days ago realized that Rackspace (Who I've known of for years) is very implicated in OpenStack (which is quite new to me). Looking into that got me here.
Ganglia made perfect sense to me.
Your videos were also totally meaningful. (From memory … StackTach and Stacky?) BTW I left a question about stacky on one of your TY.
FWIW I have Nagios docs loaded in a set of tabs; just now reading up on Ceilometer to get a sense of the plumbing.
> How to make that salient is key.
Contact … solid copy … salience = information; Irrelevant "operational metrics" ain't. (Can't call those metrics noise/static since it's not random / entropic.)
But yes, precisely that: the way I used the terms, "information" as a sub-set of all readings.
FWIW I first encountered this when babysitting NORAD/SAC multiplexing. (DEW Line … I'm old.) Channels, groups, super-groups … amazing similarities with cloud / instances / servers.
Sure, over-view of all metrics would give an experienced operator a sense of system status as a whole, but what was paramount was to have "flares" that were diagnostic.
Google+: View post on Google+
Post imported by Google+Blog. Created By Daniel Treadwell.