The following is a summary of rules available out-of-the-box with the DB2 for LUW cartridge. Default threshold values can be changed or scoped to specific values, generally through registry variables. These rules can be copied, modified, disabled, or customized in a wide variety of ways.
This section describes the following rules:
Rules related to instance and database availability, sessions, CPU, and memory.
This alarm fires when a collection fails to retrieve its data.
This alarm fires when the database is not running (down). In this case, users will not be able to connect or retrieve any information from the database.
This alarm fires when the average connection time to the database exceeds a predefined threshold (Fatal: 20ms, Warning: 10ms).
This alarm fires when the database’s average response time exceeds a predefined threshold (Fatal: 20ms, Warning: 10ms).
This alarm fires when the instance is not running (down). In this case, users will not be able to connect or retrieve any information from the instance.
This alarm is invoked when a change in major DB2 release version is detected. It is designed to remind Foglight administrators to verify that the Foglight user permissions are applicable to the new version, to ensure continuous monitoring.
This alarm is invoked when the CPU utilization of the DB2 agents exceeds the baseline.
This alarm is invoked when the CPU utilization of the DB2 agents exceeds the baseline.
This alarm is invoked when the CPU utilization of the DB2 agents exceeds the baseline.
This alarm is invoked when the Utilization of the database Memory Pool exceeds a predefined threshold (Fatal: 95%, Warning: 90%).
This alarm is invoked when the memory consumption of the DB2 agents exceeds the baseline.
This alarm is invoked when the utilization of the instance’s memory pool exceeds a predefined threshold (Fatal: 95%, Warning: 90%).
This alarm is invoked when the memory consumption of the DB2 agents exceeds the baseline.
This alarm is invoked when the memory consumption of the DB2 agents exceeds the baseline.
This alarm is invoked when the status of the HADR connection to the database is disconnected. Ensuring that the database is not disconnected is critical in order to prevent a crisis situation.
This alarm fires when the database failover occurred and the database role changes from Primary to Standby or vice versa.
This alarm is invoked when the HADR log gap exceeds a predefined threshold (Fatal: 16384KB, Warning: 4096KB). Ensuring that the gap between the standby and the primary database is kept as small as possible is important in order to have fast switch over, and to prevent large amount of data losses in a crisis situation.
This alarm is invoked when the HADR state is not in peer state. Ensuring that the database is in PEER state is important in order to prevent data losses in a crisis situation.
This alarm is invoked when a cluster caching facility (CF) server that currently functions as backup CF server is not in PEER state, and therefore cannot serve as a primary CF in case the current primary fails.
This alarm is invoked when a pureScale member is either in error state or not running on its designated host.
Rules related to storage utilization including tablespaces, file systems, and log space.
This alarm fires when the file system runs out of space, because its utilization exceeded a predefined threshold (Fatal: 90%, Warning: 80%). When a filesystem reaches its full capacity, writing data into the said filesystem is no longer possible. This inability to add data is critical if the database’s containers reside on this filesystem, in which case the database’s functionality will be affected.
This alarm fires when the most recent resize attempt has failed.
This alarm fires when the percentage of used tablespace exceeds a predefined threshold (Fatal: 90%, Warning: 80%). If the tablespace is configured to extend automatically (Autostorage), this property is taken into account, and the alarm is not invoked. A tablespace is a set of containers that contain data. A tablespace whose storage place becomes full can no longer store additional data; as a result, the application’s functionality may be adversely affected.
This alarm fires when the percentage of used space within a tablespace exceeded a predefined threshold.
This alarm fires when the percentage of used Log space exceeds a predefined threshold (Fatal: 90%, Warning: 80%).
This alarm fires when the percentage of used Log space exceeds a predefined threshold (Fatal: 90%, Warning: 80%).
Rules related to agent registration and monitoring configuration.
This alarm is invoked when at least one of the monitoring configuration parameters metrics (MON_REQ_METRICS, MON_ACT_METRICS) is set to NONE. Such a setting results in partial data retrieval by the collection, which, in turn, is reflected in partial data display on the relevant dashboards.
This alarm fires when at least one of the required monitored switches is off. For versions below 9.7.0.1 these include (UOW_SW_STATE, STATEMENT_SW_STATE, LOCK_SW_STATE, SORT_SW_STATE, TABLE_SW_STATE, BUFFERPOOL_SW_STATE, TIMESTAMP_SW_STATE). For version 9.7.0.1 and above monitor switches are not required. If one of the required switches is off, part of the data will be missing in the collections, and on the relevant dashboards.
Rules related to Fast Communication Manager (FCM) connections between instance members.
This alarm is invoked when the status of the connection to an FCM member is Congested or Inactive. FCM availability issues affect the networking traffic between the instance members, and result in performance issues and instance unavailability.
Rules related to request time, wait time, cache performance, and other historical metrics.
This alarm fires when the index hit ratio of the database’s I/O activity falls below a predefined threshold (Fatal: 0%, Warning: 70%).
This alarm fires when the overall cache hit ratio of the database’s I/O activity falls below a predefined threshold (Fatal: 0%, Warning: 70%).
This alarm is invoked when the percentage sort overflows spent by DB2 agents exceeds a predefined threshold (Fatal: 80%, Warning: 60%). Sort overflows can result from various causes, such as a small buffer pool cache, excessive buffer pool throughput, a large number of cache-based sorts, and a DB2 process that does not keep up with the workload. Sort overflow will cause sorting on the disk, thereby resulting in performance issues.
This alarm is invoked when the percentage of total hit ratio percent spent on specific database buffer pool falls below a predefined threshold (Fatal: 0%, Warning: 70%). Buffer pool operations occur when data is being read from or written to the database memory. A high percentage of buffer pool-related I/O activity can indicate that the buffer cache is either set to a too small size or inefficiently used. Such a situation can possibly result in performance issues and excessive disk activity.
This alarm fires when the index hit ratio of the database’s I/O activity falls below a predefined threshold (Fatal: 0%, Warning: 70%).
This alarm fires when the overall cache hit ratio of the database’s I/O activity falls below a predefined threshold (Fatal: 0%, Warning: 70%).
This alarm is invoked when the percentage sort overflows spent by DB2 agents exceeds a predefined threshold (Fatal: 80%, Warning: 60%). Sort overflows can result from various causes, such as a small buffer pool cache, excessive buffer pool throughput, a large number of cache-based sorts, and a DB2 process that does not keep up with the workload. Sort overflow will cause sorting on the disk, thereby resulting in performance issues.
This alarm fires when deadlocks were encountered for the database (Threshold: 1). A deadlock should be investigated by the DBA, as it can result in a rollback of uncommitted data, thereby leading to applicative risk to the data.
This alarm fires when lock timeouts were encountered for the database (Threshold: 0 lock timeouts).
This alarm is invoked when, during the last lock tree snapshot, the lock was detected as exceeding a predefined number of seconds (Threshold: 90 seconds). Frequent blocking locks can cause waits when data modifications take place, and possibly result in performance issues.
This alarm is invoked when, during the last lock tree snapshot, the lock was detected as exceeding a predefined number of seconds (Threshold: 90 seconds). The alarm contains the lock details.
Rules that generate alarms requiring attention, including log messages and diagnostic information.
This alarm fires when a message is read from the DB2 diaglog and its severity exceeds the minimum severity threshold (warning / event / error / critical / severe). This message includes important information about the database issues from the DB2 diagnostic log file.
This alarm fires when at least one message is found in the diagnostic log with severity level that match, or tops, the minimum alarm severity. Message count reports the number of messages found of each severity.
This alarm fires when the size of the Diag log file exceeds a predefined threshold size (Fatal: 1000MB).
This alarm is invoked when operation fails. Review the message to find the cause of the problem.
Rules related to database backup, summary, and recovery operations.
This alarm is invoked when the last backup for the database partition has failed. If the database has never been backed up, you risk losing all data in the event of storage device (hardware) failure or mistaken deletion. Lack of backup also prevents recovering data to the requested restore point, before any unwanted changes took place.
This alarm is invoked when the last backup for the database partition has failed. If the database has never been backed up, you risk losing all data in the event of storage device (hardware) failure or mistaken deletion. Lack of backup also prevents recovering data to the requested restore point, before any unwanted changes took place.
This alarm is invoked when the number of days that have passed since the last valid full database backup exceeds a predefined registry variable value (Fatal: 31 days, Warning: 7 days). If no valid full backup of the database has been carried out for several days, you risk losing data in the event of storage device (hardware) failure or mistaken deletion.
This alarm is invoked when no valid full backup date is found. If the database has never been backed up, you risk losing all data in the event of storage device (hardware) failure or mistaken deletion. Lack of backup also prevents recovering data to the requested restore point, before any unwanted changes took place.