By Guy Warren, CEO of ITRS
It should come as no surprise that having a substantial amount of IT often means having lots of monitoring tools. Importantly, if one of these tools is facing issues, and cannot relay information from a crucial transaction application, how would your firm know? The health of your monitoring tools is as important as the health of the applications and infrastructure they monitor. This is why you need a monitor-of-monitors.
Financial services institutions have several monitoring tools for their vast IT environments to ensure constant availability of their business services. This is done through checking the underlying technical services enabling the business at regular intervals.
As IT services grow more complex, spanning from on-premises to the cloud, the potential for IT service disruption, the associated costs for businesses, increases. IT service disruption or outage can have severe implications on not just their revenues, but also the organisation’s reputation. If an incident disrupts service, firms will not only have to rebuild investor trust, they also may become susceptible to regulatory inquiries and fines.
Why do monitoring services fail?
There are a variety of reasons that monitoring services can fail, although what is abundantly clear is that IT services monitoring play a very important role in avoiding an outage. Whenever an outage does occur, it is likely due to one or more of the following:
- Service was not being monitored due to not being configured or an outdated model
- No alerts/too many alerts were configured even though monitoring was being done
- Alerts didn’t catch the attention of the operator or were lost among too many alerts, or a “sea of red”
The above demonstrates exactly why it is critical that you monitor the health of the monitoring system itself in order to steer clear of it being one of the root-causes for an outage.
5 ways to monitor your monitoring tools
Drawing on the insights into the pitfalls of monitoring services mentioned above, here are five fundamental checks to ensure the robustness of your monitoring system.
- Are all your monitoring systems working?
This sounds simple but firms need to apply checks on availability of monitoring for all services to ensure that they are working at all times. This can be done by applying a simple severity rule on sampling status of all services being monitored. It can be then checked through the sampling status that it is indeed being monitored.
- Ensure monitoring of Physical and Virtual servers:
Modern IT infrastructures often consist of a mix of physical and virtual servers, each playing a vital role in delivering various services. Check if all the configured application services are covered in monitoring, whilst keeping in mind that there may be more than one application service on a single server.
- Ensuring certificate compliance
Digital certificates allow firms to verify the identity of the sender/receiver of an electronic message to protect their website, network, or devices. Every certificate has an expiry date written into it. But if it has expired, there is often no way to tell until it is too late. There needs to be a way to check – and fix – digital certificates that are about to expire. Monitoring tools can help.
- Understanding the health of your monitoring system
Effective monitoring alerts play a crucial role in guiding troubleshooting decisions when incidents occur. Based on monitoring alerts, various troubleshooting decisions like restarting a process, restarting a module or fail-over to backup are taken during incident. Consequently, it becomes important that the health status of the monitoring estate is available to all who take these decisions.
This can be done by having a placeholder for the underlying monitoring health on the mission critical dashboards itself. Thus, the decision maker knows if they are relying on the correct monitoring data or if there is a break in monitoring services which may be resulting in the alert.
Additionally, a one-second ticking date time also assures that the dashboard state is latest and not affected / screen freeze due to a local workstation issue.
- Keeping on top of reporting and audit
Finally, it is key that the monitoring team publishes to all stakeholders daily / weekly/monthly reports on:
- Lists of servers covered in monitoring and the metrics and regular expressions being monitored
- The data which was evaluated to define an alert, including the data which didn’t breach a threshold
- Lists of applications covered in monitoring
- Lists of existing issues in monitoring
- Lists of critical, warning alerts per application, per server
- Lists of alerts disabled or snoozed
- Lists of alert receipts configured (email & mobile).
It is then expected from the stakeholders to pinpoint any gaps in the configured monitoring.
Luckily, services and product do exist which can partner with mission critical financial enterprises to continuously mature the monitoring templates for the ongoing transition of enterprise datacenters to hybrid IT. Despite the rapid changes, the core principles of effective monitoring and observability have stood the test of time.
With ITRS Geneos you can monitor and contextualize everything in one single tool, from legacy systems to cutting-edge new technology, from applications, servers, VMs, databases, middleware and cloud services to containers.