I get asked this all the time and have no way of telling people without manually writing out each probe config, which takes ages..
Are there any other solutions?
I have to answer this and "what can you monitor", or "can you monitor this" often enough. Sadly, I agree it is pretty hard to tell. I have written documentation of our basic monitoring templates, which includes all the checkpoints that are included in them and thresholds. Of course these only cover a subset of things that can be monitored, so when customers want something else I have to write it out. Still, it helps, and we'll likely go further in this and make different sets of monitoring offerings, which all include different things to monitor (and naturally have a different price). These documents also serve as a changelog / doc technical documentation for our templates.
I've written configuration reporting scripts and applications to help with this, but in truth it would require much more effort to do it properly. The fact is, sometime's it's not enough to give the template documents and say "this is what we should monitor", sometimes people want to see what is actually being monitored. BryanKMorrow posted the "probe configuration archive" in this community which fetches probe configuration data and stores it in the NiS database. This will probably be a good starting point of reporting on configurations: You can create your SQL queries that are reusable. Though, due to the structure of the configration files, this is not quite so simple a task..
I am also working on a similar probe that will store all of the probe configurations using the probe_config_get callback. This stores them by section->key->value, however this includes a large amount of inserts into the database at execution time so I'm working in SQLITE currently. I need to add some error handling and building a better SQL transaction to cut down on the inserts. I will post it when it is completed, I also have a sample Unified Report that dynamically loads all the probe sections so you can report on specific key/value pairs. This is a starting point to getting the information you are looking for.
My answer to this is that you monitor the things that allow you to predict and react to events that cost you money so that you can reduce that cost.
For example, I went through an effort with a new application group within my company and one of the requirements was to monitor for the change of CPU type. The supporting argument to that was "well, because we want to know". All this sort of monitoring does is cost everyone something with the only benefit being satisfying a curiosity.
On the other hand, another group looked at the past 6 months of cases, grouped by root cause, sorted them by time spent to resolve and picked the top 10 off the list and monitored for the causes of those cases. That translated to a direct cost savings across the board - happier customers, happier employees, better work flow, etc.
Underlying it I think is a subtle difference in the way to ask the "what do I monitor" question. You don't want to monitor a metric and then figure out what it means. You want to find a reoccurring issue, identify the cause, and then develop monitoring that predicts (this is important, you don't want to monitor for the failure, you want to monitor for the predicting symptoms - its like the difference between monitoring for the flow of lava in a volcanic eruption as compared to monitoring seismic activity that happens before the lava flows.... ) the occurrence of the issue.
If you are presenting data back to your non technical users, it gets trickier. If you've done your monitoring correctly (in my opinion) then you have a bunch of data that predicts a failure but isn't interesting to look at. It's like the check engine light on your car. Here's where you make decisions to monitor things that have no direct value but make for nice graphs - total memory used for instance, or total CPU. Lots of variation, you can color code so it looks interesting, people intuitively believe that small numbers are good (even though it indicates waste) so you don't need documentation. In some respects you are looking to recreate the Windows Task manager.
Sorry, not enough coffee yet today and I misread the question.
But one of the results of the process above is documentation about what you are doing and why.
That formalization of the "what is monitored" process makes it easy to answer the question of what you are actually monitoring. Put it all in packages and/or scripts to do the configuration and you have predictability. As soon as you have individuals individually tweaking settings you are in a losing battle to keep track of the answer.
Good explanation Garin
Request for a Report that provides all Robots, their probes and contained configuration settings
Has anyone found a way to collect this information?
I have a probe that is in testing currently, that can do what you ask. It currently stores all the information in a SQLite database in the probe directory and has callbacks that allow you to create CSV reports for Spreadsheet import. It allows you to create probe 'profiles' so you can filter out the noise from the configurations. This won't work on probes that don't store their configurations in the CFG files, aka the new SNMP Collector. Nor will it return baseline information.
If you are interested in this field solution, please email me at firstname.lastname@example.org
Perfect Bryan, thank you.
I'll reply you by e-mail!
Tried to email you re the probe, but I think the email hasn't gone through. I sent it to both ca.com and Broadcom.com.
Afraid to tell you that he is no longer with the organization.
That's a disappointing loss.
Huge loss, very sorry to hear this
I wrote a script that uses the various probe call backs to query every hub for ever robot for every probe and every config and pulled that back to a mssql DB. This can be used to report on as well as a "backup" of current start of monitoring for each of our customers.
Retrieving data ...