Nucleus Datakonsult Agent Alerts Management Pack

Home

About me and Nucleus
Partners

Articles, tips etc.
My blog
Links
Microsoft articles
Forums

Course demo files




Microsoft Most Valuable Professional


Overview
SQL Server writes messages to the EventLog when various things happen in the database engine. You can tell SQL Server Agent to monitor the EventLog for some specific SQL Server errors, and let Agent can notify an operator (send e-mail) and/or start a job. The problem has always been to know what alerts to define - this article is an attempt to rectify that.

You might already have some monitoring application that has support for SQL Server, like OP5 ,HP Operations Manager or Microsoft Operations Manager. If you do and are happy with that solution, then stick with it. But if you don't have such a solution, or if you feel that your current software's support for SQL Server EventLog monitoring can be improved, then read on.

Versions of SQL Server
This article applies to SQL Server 2005 and 2008, where not noted otherwise.

Intended audience
I assume a reasonable knowledge level of SQL Server. You should be able to find your way around the tools and understand the most common terms and acronyms. If you don't meet those criteria, then I suggest you buy some tool which has an interface which is easier to use, hand this over to a senior DBA or perhaps hire a consultant to do the planning and implementation. If you want to use the T-SQL script at the end of this article, then you should be able to read and understand T-SQL scripts.

The e-mail part
You probably want to send an email when the event occurs. I suggest you use the Database Mail support which was introduced in SQL Server 2005 - a pure SMTP implementation. I discourage you from using the old "SQL Mail" support since it uses MAPI. SQL Server Books Online describes how to configure Database Mail, so I won't talk about it here. Check out Dan Guzman's script to configure Database Mail using T-SQL instead of running the Wizard on each SQL Server Instance.

Frequency of EventLog polling
You can control this using a registry setting. Do this at own risk, it isn't supported or documented. The name of the registry key is EventLogPeekInterval. Where to find it depends on things like whether it is a default or a named instance, the instance name and also version of SQL Server. To put it bluntly, if you can't find it, then you shouldn't mess with it. The default value is every 20 seconds.

Defining the alert
You define an alert using the procedure msdb.dbo.sp_add_alert or SSMS (SQL Server Agent, Alerts folder). If you are about to use the T-SQL script and sp_add_alert, then please do read about the procedure in Books Online, and decide what values you want for each parameter. In the script below you will find what values I use, but that might not be the right values for you. I won't discuss the parameters here since they are documented in Books Online.

You also want to say what action to be performed. I will assume that you want to send an email (msdb.dbo.sp_add_notification). This requires that you configured the Database Mail support, configured Agent to use the Profile you created, re-started Agent, and finally configured an Operator. You will see that in the script below, I hard-wired the operator name to 'Tibor'. I did this so I don't have to pass the operator name for every single alert I define. Change this to your liking.

For the alert, you say whether you want to trigger on a certain error number or all errors with a certain severity level (I'm ignoring Perfmon and WMI alerts here). You can't have both. The difficult part is deciding on what to trigger on. You don't want to go too wide so you will be mail-bombed - but you don't want to miss important messages either.

One thing you will notice that if you trigger for a certain severity level that there are messages with that level that you may want to exclude. We used to be able to enter such "Non Alertable Errors" in the registry key NonAlertableErrors (as described here). This functionality seems to have been removed, I'm afraid. Let me know if your tests show that the reg key actually does anything in 2005 or 2008.

How will this evolve and how can you help?
In a blog post I wrote 2009-02-23, I had hopes for some community project to get started for this. That didn't happen, so I decided to write this article. I don't expect that I managed to find the "right" messages, of course. If you have some messages you find obvious should be there or vice versa, then let me know. For instance, there might be some error which always follow some particular other error - meaning it isn't necessary to define alerts for both. Just keep in mind that we cannot exclude certain errors if we included the whole severity level for that error. I expect that many of us will end up with some differences in their alert definitions - one size doesn't fit all. But over time I hope that the suggested definitions on this page can serve as a good starting point. So, please send me an email if you find obvious things to add or remove from the list. If I use your recommendation, then I'll add your name (and URL if you provide) to the Acknowledgements section, unless you ask otherwise. You find my mail address here.

My suggested alerts
I have learned that some readers don't feel confident with adapting and running the TSQL code. Instead, they prefer to have the definitions listed and add the alerts using point&click in SSMS or some other means. That is fine, so below you find my suggestions, as a table, and further down you have the T-SQL code.

First up are the alerts defined for the whole severity level:
Suggested alert nameSeverity level
Severity level 1616
Severity level 1717
Severity level 1818
Severity level 1919
Severity level 2020
Severity level 2121
Severity level 2222
Severity level 2323
Severity level 2424
Severity level 2525

And then we have the alerts defined for specific error numbers:
Suggested alert name(severity)Err noNeed enableDesc
Error 0060112601yCould not continue scan with NOLOCK due to data movement.
Error 0067410674nException occurred in destructor of RowsetNewSS 0x%p. This error may indicate a problem related to releasing pre-allocated disk blocks used during bulk-insert operations.
Error 0070810708nServer is running low on virtual address space or machine is running low on virtual memory. Reserved memory used %d times since startup. Cancel query and re-run, decrease
Error 0080610806naudit failure (a page read from disk failed to pass basic integrity checks)
Error 0082510825nA read of the file '%ls' at offset %#016I64x succeeded after failing %d time(s) with error: %ls. Additional messages in the SQL Server error log and system event log may
Error 0097310973nDatabase %ls was started . However, FILESTREAM is not compatible with the READ_COMMITTED_SNAPSHOT and ALLOW_SNAPSHOT_ISOLATION options. Either remove the FILESTREAM fi
Error 01205131205yTransaction (Process ID %d) was deadlocked on %.*ls resources with another process and has been chosen as the deadlock victim. Rerun the transaction.
Error 03401103401nErrors occurred during recovery while rolling back a transaction. The transaction was deferred. Restore the bad page or file, and re-run recovery.
Error 03410103410nData in filegroup %s is offline, and deferred transactions exist. Use RESTORE to recover the filegroup, or drop the filegroup if you never intend to recover it. Log trunc
Error 03414103414nAn error occurred during recovery, preventing the database '%.*ls' (database ID %d) from restarting. Diagnose the recovery errors and fix them, or restore from a known go
Error 03422103422nDatabase %ls was shutdown due to error %d in routine '%hs'. Restart for non-snapshot databases will be attempted after all connections to the database are aborted.
Error 03452103452nRecovery of database '%.*ls' (%d) detected possible identity value inconsistency in table ID %d. Run DBCC CHECKIDENT ('%.*ls').
Error 03619103619nCould not write a checkpoint record in database ID %d because the log is out of space. Contact the database administrator to truncate the log or allocate more space to th
Error 03620103620nAutomatic checkpointing is disabled in database '%.*ls' because the log is out of space. Automatic checkpointing will be enabled when the database owner successfully chec
Error 03959103959nVersion store is full. New version(s) could not be added. A transaction that needs to access the version store may be rolled back. Please refer to BOL on how to configure
Error 05029105029nWarning: The log for database '%.*ls' has been rebuilt. Transactional consistency has been lost. The RESTORE chain was broken, and the server no longer has context on the
Error 05144105144nAutogrow of file '%.*ls' in database '%.*ls' was cancelled by user or timed out after %d milliseconds. Use ALTER DATABASE to set a smaller FILEGROWTH value for this file
Error 05145105145nAutogrow of file '%.*ls' in database '%.*ls' took %d milliseconds. Consider using ALTER DATABASE to set a smaller FILEGROWTH for this file.
Error 05182105182nNew log file '%.*ls' was created.
Error 08539108539nThe distributed transaction with UOW %ls was forced to commit. MS DTC became temporarily unavailable and forced heuristic resolution of the transaction. This is an inform
Error 08540108540nThe distributed transaction with UOW %ls was forced to rollback.
Error 09001109001nThe log for database '%.*ls' is not available. Check the event log for related error messages. Resolve any errors and restart the database.
Error 141571014157nThe subscription created by Subscriber '%s' to publication '%s' has expired and has been dropped.
Error 141611014161nThe threshold [%s:%s] for the publication [%s] has been set. Make sure that the logreader and distribution agents are running and can match the latency requirement.
Error 171731017173nIgnoring trace flag %d specified during startup. It is either an invalid trace flag or a trace flag that cannot be specified during server startup.
Error 171791017179nCould not use Address Windowing Extensions because the 'lock pages in memory' privilege was not granted.
Error 178831017883nProcess %ld:%ld:%ld (0x%lx) Worker 0x%p appears to be non-yielding on Scheduler %ld. Thread creation time: %I64d. Approx Thread CPU Used: kernel %I64d ms, user %I64d ms.
Error 178841017884nNew queries assigned to process on Node %d have not been picked up by a worker thread in the last %d seconds. Blocking or long-running queries can contribute to this cond
Error 178871017887nIO Completion Listener (0x%lx) Worker 0x%p appears to be non-yielding on Node %ld. Approx CPU Used: kernel %I64d ms, user %I64d ms, Interval: %I64d.
Error 178881017888nAll schedulers on Node %d appear deadlocked due to a large number of worker threads waiting on %ls. Process Utilization %d%%.
Error 178901017890nA significant part of sql server process memory has been paged out. This may result in a performance degradation. Duration: %d seconds. Working set (KB): %I64d, committed
Error 178911017891nResource Monitor (0x%lx) Worker 0x%p appears to be non-yielding on Node %ld. Memory freed: %I64d KB. Approx CPU Used: kernel %I64d ms, user %I64d ms, Interval: %I64d.
Error 184011418401nLogin failed for user '%.*ls'. Reason: Server is in script upgrade mode. Only administrator can connect at this time.%.*ls
Error 205721020572nSubscriber '%s' subscription to article '%s' in publication '%s' has been reinitialized after a validation failure.
Error 205741020574nSubscriber '%s' subscription to article '%s' in publication '%s' failed data validation.
"Need enable" means that you need to enable writing to eventlog for this error (it isn't written to eventlog by default) using sp_altermessage. This requires sp3 for 2005 and sp1 for 2008.

About the T-SQL code
As I already mentioned, if you don't feel comfortable with reading, modifying and using the code below, then don't. I will not reply to email about how the code work, how to use it, where to execute it etc. Hire a consultant or hand this over to somebody with more T-SQL experience if you feel unsecure.
I prefer to have my own database for these types of things - I generally name it 'maint'.

The first part is a procedure that acts as a wrapper around sp_add_alert and sp_add_notification. This makes the code for each alert definition a little shorter.

The second part is the alert definitions. First the sever severity levels. I decided to go all the way down to 16, because there are a lot of errors with 16 which I want to be notified for and I didn't want to add each and every one of those individually. I wish we could block out some, but NonAlertableError doesn't seem to work anymore. Then comes the error numbers within each severity level. Note that I also added a couple that do not by default go to the event log. If you want to be alerted for such errors, you first need to configure the errors to be written to the EventLog. This functionality was added in sp3 for SQL Server 2005 and sp1 for SQL Server 2008 (blogged here). I'm sure there are plenty other errors that don't go to the EventLog that we want to be alerted for, please let me know. There were so many potential errors for severity level 10, so I used some T-SQL code to generate the calls to create_alert_notification.
Just execute to text, copy text and paste to query window and execute that. You can of course write a cursor around this, if you feel like it.

The wrapper procedure
USE maint
GO
IF OBJECT_ID('create_alert_notification') IS NOT NULL DROP PROC create_alert_notification
GO

CREATE PROC create_alert_notification
@msgid INT
,@sev INT
AS
---------------------------------------------------------------------------------------------------
--This procedure create alert and operator notification
--It is a wrapper around sp_add_alert and sp_add_notification
--Paramaters:
-- @msgid error number to define alert for. Specify 0 if for severity level.
-- @sev severity level to define alert for. Specify 0 if for error number.
--One of above must be 0 and the other must be > 0

--IMPORTANT: Walk the code and replace ward-wired values.
--Operator name is obvious, but also check other relevant parameter. Adjust to suit you!

--2009-05-29 Tibor Karaszi
---------------------------------------------------------------------------------------------------
DECLARE
@alert_name sysname
,@ret INT
--Not both @msgid and @sev can be <> 0
IF @msgid <> 0 AND @sev <> 0
BEGIN
RAISERROR
('Cannot have both error number and severity <> 0.', 16, 0)
RETURN -101
END

SET
@alert_name =
CASE
WHEN @sev = 0 THEN 'Error ' + (RIGHT('00000' + CAST(@msgid AS VARCHAR(20)),5))
ELSE 'Severity level ' + CAST(@sev AS VARCHAR(20))
END

BEGIN
TRY
EXEC @ret = msdb.dbo.sp_add_alert
@name = @alert_name
,@message_id = @msgid
,@severity = @sev
,@delay_between_responses = 600 --10 minutes
,@include_event_description_in = 1 --Email

EXEC msdb.dbo.sp_add_notification
@alert_name = @alert_name
,@operator_name=N'Tibor'
,@notification_method = 1
END TRY

BEGIN CATCH
DECLARE
@err_str VARCHAR(2000)
,
@err_sev tinyint
,@err_state tinyint
,@err_proc VARCHAR(200)
SET @err_str = ERROR_MESSAGE() + ' Error rooted in procdure "' + ERROR_PROCEDURE() + '".'
SET @err_sev = ERROR_SEVERITY()
SET @err_state = ERROR_STATE()
RAISERROR(@err_str, @err_sev, @err_state)
END CATCH
/*
--Test execution
EXEC create_alert_notification @msgid = 1105, @sev = 0
EXEC create_alert_notification @msgid = 0, @sev = 18
EXEC create_alert_notification @msgid = 55000, @sev = 0
EXEC create_alert_notification @msgid = 55000, @sev = 18
*/
GO


The alert definitions
--Code below defines SQL Server Agent alerts.
--Don't use it if you don't understand the code.
USE maint
GO

--Query to play with, to investigate sysmessages
SELECT message_id, severity, is_event_logged, TEXT, language_id
FROM sys.messages
WHERE language_id = 1033
AND severity < 16
AND is_event_logged = 1
ORDER BY severity DESC, message_id

--All severity level 16 and higher
EXEC create_alert_notification @msgid = 0, @sev = 16
EXEC create_alert_notification @msgid = 0, @sev = 17
EXEC create_alert_notification @msgid = 0, @sev = 18
EXEC create_alert_notification @msgid = 0, @sev = 19
EXEC create_alert_notification @msgid = 0, @sev = 20
EXEC create_alert_notification @msgid = 0, @sev = 21
EXEC create_alert_notification @msgid = 0, @sev = 22
EXEC create_alert_notification @msgid = 0, @sev = 23
EXEC create_alert_notification @msgid = 0, @sev = 24
EXEC create_alert_notification @msgid = 0, @sev = 25

--Other, selected errors:
--Level 14
EXEC create_alert_notification @msgid = 18401, @sev = 0

--Level 13
--None that by default goes to eventlog
--Only add below if you are on 2005sp3 or 2008sp1, or higher:
EXEC sp_altermessage @message_id = 1205, @parameter = 'WITH_LOG', @parameter_value = 'true'
EXEC create_alert_notification @msgid = 1205, @sev = 0

--Level 12
--None that by default goes to eventlog
--Only add below if you are on 2005sp3 or 2008sp1, or higher:
EXEC sp_altermessage @message_id = 601, @parameter = 'WITH_LOG', @parameter_value = 'true'
EXEC create_alert_notification @msgid = 601, @sev = 0

--Level 10
--There are so many of these so we auto-generate the calls
--(which also auto-adapt to prior version which might how have some message).
--We create a temp tables with alerts we want, and then use that when we select from sys.messages.
--You can of course make the temp table a permanent table, add version information etc.
--Execute below into text and take the generated call, paste them to query windows and execute.
IF OBJECT_ID('tempdb..#alerts_to_include') IS NOT NULL DROP TABLE #alerts_to_include
GO
CREATE TABLE #alerts_to_include
(message_id INT PRIMARY KEY, short_msg VARCHAR(90), already_defined bit DEFAULT 0)

INSERT INTO #alerts_to_include(message_id, short_msg)
SELECT 674, 'Exception occurred in destructor of RowsetNewSS 0x%p...'
UNION ALL
SELECT 708, 'Server is running low on virtual address space or machine is running low on virtual...'
UNION ALL
SELECT 806, 'audit failure (a page read from disk failed to pass basic integrity checks)...'
UNION ALL
SELECT 825, 'A read of the file %ls at offset %#016I64x succeeded after failing %d time(s) wi..'
UNION ALL
SELECT 973, 'Database %ls was started . However, FILESTREAM is not compatible with the READ_COM...'
UNION ALL
SELECT 3401, 'Errors occurred during recovery while rolling back a transaction...'
UNION ALL
SELECT 3410, 'Data in filegroup %s is offline, and deferred transactions exist...'
UNION ALL
SELECT 3414, 'An error occurred during recovery, preventing the database %.*ls (database ID %d)...'
UNION ALL
SELECT 3422, 'Database %ls was shutdown due to error %d in routine %hs.'
UNION ALL
SELECT 3452, 'Recovery of database %.*ls (%d) detected possible identity value inconsistency...'
UNION ALL
SELECT 3619, 'Could not write a checkpoint record in database ID %d because the log is out of space...'
UNION ALL
SELECT 3620, 'Automatic checkpointing is disabled in database %.*ls because the log is out of spac...'
UNION ALL
SELECT 3959, 'Version store is full. New version(s) could not be added.'
UNION ALL
SELECT 5029, 'Warning: The log for database %.*ls has been rebuilt.'
UNION ALL
SELECT 5144, 'Autogrow of file %.*ls in database %.*ls was cancelled by user or timed out...'
UNION ALL
SELECT 5145, 'Autogrow of file %.*ls in database %.*ls took %d milliseconds.'
UNION ALL
SELECT 5182, 'New log file %.*ls was created.'
UNION ALL
SELECT 8539, 'The distributed transaction with UOW %ls was forced to commit...'
UNION ALL
SELECT 8540, 'The distributed transaction with UOW %ls was forced to rollback. '
UNION ALL
SELECT 9001, 'The log for database %.*ls is not available.'
UNION ALL
SELECT 14157, 'The subscription created by Subscriber %s to publication %s has expired...'
UNION ALL
SELECT 14161, 'The threshold [%s:%s] for the publication [%s] has been set.'
UNION ALL
SELECT 17173, 'Ignoring trace flag %d specified during startup'
UNION ALL
SELECT 17179, 'Could not use Address Windowing Extensions because the lock pages in mem...'
UNION ALL
SELECT 17883, 'Process %ld:%ld:%ld (0x%lx) Worker 0x%p appears to be non-yielding on Scheduler...'
UNION ALL
SELECT 17884, 'New queries assigned to process on Node %d have not been picked up by a worker...'
UNION ALL
SELECT 17887, 'IO Completion Listener (0x%lx) Worker 0x%p appears to be non-yielding...'
UNION ALL
SELECT 17888, 'All schedulers on Node %d appear deadlocked due to a large number of...'
UNION ALL
SELECT 17890, 'A significant part of sql server process memory has been paged out...'
UNION ALL
SELECT 17891, 'Resource Monitor (0x%lx) Worker 0x%p appears to be non-yielding on Node %ld...'
UNION ALL
SELECT 20572, 'Subscriber %s subscription to article %s in publication %s has been reinitiali...'
UNION ALL
SELECT 20574, 'Subscriber %s subscription to article %s in publication %s failed...'

SELECT
'EXEC create_alert_notification @msgid = ' +
CAST(message_id AS VARCHAR(10)) +
', @sev = 0'
FROM sys.messages
WHERE message_id IN
(
SELECT message_id FROM #alerts_to_include
)
AND
language_id = 1033

--Query to play with to generate above:
/*
SELECT 'SELECT ' + CAST(message_id AS varchar(10)) +
', ''' + REPLACE(CAST(text AS varchar(90)), '''', '') + '''' +
CHAR(13) + CHAR(10) + ' UNION ALL'
FROM sys.messages
WHERE language_id = 1033
AND severity < 16
AND is_event_logged = 1
ORDER BY severity DESC, message_id
*/