Around-the-clock system admin and engineering support for mission-critical and revenue-critical networks, sites, and services running in clouds, colos, or in hybrid environments.
We monitor your full infrastructure from our global monitoring nodes PLUS a secure monitoring node deployed inside your firewall. We carefully tune the nodes to provide rapid notification of critical and actionable failures and impairments, and when such problems occur our team of experienced system administrators and engineers rapidly deploy to validate and resolve the problem.
During and after critical incidents, we log and communicate all actions taken. After the event, we review everything from monitoring through resolution to improve reliability and response.
All for a fixed monthly cost based on deployment scale.
All of our system monitoring is automated, using Nagios and other tools. All system events, warnings, alerts and tickets are managed in a secure, client-accessible ticketing system. Our monitoring, paging, ticket tracking and reporting systems are integrated and web-accessible. Every item monitored has well-defined thresholds to generate warnings, alerts, daytime work tickets, or escalations for immediate action.
Every alert or other incident is assessed regularly, and thresholds, test frequency, and required response is tuned. Initial deployments (including legacy client-deployed monitoring) tend to be too verbose for effective action. However, after a short period of monitoring, our tuning enables us to focus response efforts on actionable problems.
This tuning turns data floods, chaos, and over-communication into crisp, focused, actionable information. And then each week we review, refine, and continually improve.
Once the monitoring nodes identify an event requiring immediate action, our well-honed systems and staff spring into action:
Following agreed recovery procedures maintained in a shared wiki-based run-book:
In the unlikely event the initial alert does not reach the admin, an escalation to the back-up team is sent. This ensures that an actual, trained admin or engineer is on the case within minutes.
Effective monitoring and rapid response helps real-time events, but to truly improve the reliability, performance, and stability of a site or service, more is required. To that end, all incidents and alerts are reviewed by our most senior engineering staff, providing feedback for our clients, our system administrators, our engineers, and our monitoring systems.
Each week, we identify problem clusters, emerging trends, and other reliability issues. Then, we identify the action to be taken to improve system reliability. Sometimes run-book updates are needed, sometimes system or network tuning is required, sometimes new monitoring methods need to be added. And sometimes reliability improvements require client actions. For each approach, communication occurs, and the situation is flagged for continuing assessment.
The result of this closed-loop feedback process is improved short and long-term reliability.
With a Red-Alert agreement in place, we will have a trained person on-site or on-line as-needed and when-needed.
Log a request on-line, send an e-mail, or drop us a note on Slack. We'll promptly schedule the work and keep you updated.
Many data center tasks are best done regularly. We can arrange weekly, monthly, or quarterly visits to fit the need.
Copyright 2024 Netzinga, LLC. • All rights reserved. • All trademarks and service marks are the property of their respective owners.