Good morning! In efforts to try and help others with accomplishing the Herculean task of monitoring Microsoft SQL Server, I have decided to start putting together a series of documents around my tweaks to the various probes in order to get solid performance monitoring as well as alarming. In past lives, I have used best of breed monitoring applications to monitor my SQL Servers, as integration with an enterprise class monitoring platform/ticketing system wasn't a priority or need. These monitoring platforms do, admittedly, install with essentially everything you could ever want metrics on already humming away. They do not, however, have any other purpose in life beyond monitoring MS SQL Server - which simply won't do in my current environment (MSP).
My goal was to get additional insight into the SQL Server logs. SQL Server does a great job of logging everything you could ever possibly want. Of all of the events that SQL Server logs, it logs a subset of those events to the Windows Server event logs. This is good news! The task of gaining additional insight into concerning SQL Server events just got a lot easier - we can leverage the ntevl probe to monitor for critical errors. In my case, I have defined a critical error as an error that is severity 17 and above (these require administrative effort to resolve). Severity 16 and under are errors that are user correctable (syntax errors, permissions, and other general errors that are user correctable) and so I am not as concerned with these. In order to monitor for these with the ntevl probe, you need the EventIDs (or do you? more on this later). Every SQL Server version is different, and while these generally stay fairly consistent, you'll want to check for the specifics for your SQL Server, which you can do as follows:
select message_id, severity, is_event_logged, text from sys.messages where severity >= 17 and language_id = 1033 order by severity desc, message_id;
A few notes on the where clause: I was looking only for errors that are severity 17 and higher (those are errors I defined as critical) and the language_id 1033 maps to us_english, which is my locality. If you need a different locality and aren't sure what the language_id is, just grab the lcid from your locality:
select name, alias, lcid, msglangid from sys.syslanguages;
Now let's head on to some other notes about the result set from the messages catalog view. You'll notice the column is_event_logged which, surprisingly enough, maps to whether or not SQL Server is going to log that event to the Windows Event Log. Not everything is event logged, but a lot of it is. You can modify these as you see fit by using the sp_altermessage stored procedure (full documentation: https://msdn.microsoft.com/en-us/library/ms175094.aspx). An example of enabling event logging on a pre-existing error:
exec sp_altermessage @message_id = 5125, @parameter = N'WITH_LOG', @parameter_value = 'true';
So that covers all the SQL Server specifics, now on to the ntevl probe. For my purposes, I created a profile for every...single...eventID. See:
Why would I do this, when there is that handy little Source/Publisher Name (which would be MSSQLSERVER in this case) and severity (which would be sufficient with error and critical - 2 profiles)? You can kind of see the reason just below. The run command on match capabilities. I did this by eventID so that I had the flexibility to run different automated responses for different issues. If I get an error 822, 823 or 824, I want to immediately kick off database consistency checks, for instance. If you don't care about that functionality, you could absolutely use the source and severity fields and have hundreds fewer profiles than I have.
This was one of those things that I had to get done ASAP when creating my SQL Server super package in order to sleep soundly at night. I have attached the profile definitions to hopefully make life easy for other DBAs - I want you to sleep soundly too. You can stuff these in right after your last profile definition, but before the </watcher> tag at the end of the ntevl.cfx config file from your ntevl probe.
I am planning on doing documentation similar to this for all of the other probes and the subsequent customization that I found to be crucial to getting UIM up to snuff with monitoring my SQL Server environments. Hope this helps and feel free to make suggestions!
- Added a modified .cfx with probe definitions that have alarm level set to critical.
- Added EID 3041 (failed backup) - previous version assumed it would be caught in the context of a SQL Server Agent Job failure. This has been added to account for those who implement backups with proper try...catch logic to prevent failing their jobs. Alarm level is major.