Improving your reliability in clouds and colos.

Around-the-clock system admin and engineering support for mission-critical and revenue-critical networks, sites, and services running in clouds, colos, or in hybrid environments.

We monitor your full infrastructure from our global monitoring nodes PLUS a secure monitoring node deployed inside your firewall. We carefully tune the nodes to provide rapid notification of critical and actionable failures and impairments, and when such problems occur our team of experienced system administrators and engineers rapidly deploy to validate and resolve the problem.

During and after critical incidents, we log and communicate all actions taken. After the event, we review everything from monitoring through resolution to improve reliability and response.

All for a fixed monthly cost based on deployment scale.


Tuned Monitoring

All of our system monitoring is automated, using Nagios and other tools. All system events, warnings, alerts and tickets are managed in a secure, client-accessible ticketing system. Our monitoring, paging, ticket tracking and reporting systems are integrated and web-accessible. Every item monitored has well-defined thresholds to generate warnings, alerts, daytime work tickets, or escalations for immediate action.

Every alert or other incident is assessed regularly, and thresholds, test frequency, and required response is tuned. Initial deployments (including legacy client-deployed monitoring) tend to be too verbose for effective action. However, after a short period of monitoring, our tuning enables us to focus response efforts on actionable problems.

This tuning turns data floods, chaos, and over-communication into crisp, focused, actionable information. And then each week we review, refine, and continually improve.

Real-Time Recovery

Once the monitoring nodes identify an event requiring immediate action, our well-honed systems and staff spring into action:

  1. Filter to eliminate false positives.
  2. Consolidate alerts for true positives and send to the on-duty system administrator and to the trouble ticket system.
  3. Notify all parties who request real-time alerts.

Following agreed recovery procedures maintained in a shared wiki-based run-book:

  1. The system admin acknowledges the alert.
  2. The system admin verifies the problem has not self-resolved.
  3. The system admin begins recovery.
  4. If needed, escalate to an engineer.

In the unlikely event the initial alert does not reach the admin, an escalation to the back-up team is sent. This ensures that an actual, trained admin or engineer is on the case within minutes.

Reliability Engineering

Effective monitoring and rapid response helps real-time events, but to truly improve the reliability, performance, and stability of a site or service, more is required. To that end, all incidents and alerts are reviewed by our most senior engineering staff, providing feedback for our clients, our system administrators, our engineers, and our monitoring systems.

Each week, we identify problem clusters, emerging trends, and other reliability issues. Then, we identify the action to be taken to improve system reliability. Sometimes run-book updates are needed, sometimes system or network tuning is required, sometimes new monitoring methods need to be added. And sometimes reliability improvements require client actions. For each approach, communication occurs, and the situation is flagged for continuing assessment.

The result of this closed-loop feedback process is improved short and long-term reliability.

On-Site Data Center Services

Since 2004 we have resolved more than 200,000 issues for dozens of clients.

Who are we? Experienced people who work for you; on-site for emergencies and routine issues. People who can take care of everything from drive replacements to full colo deployments.
People providing responsive service at a reasonable cost.
Red-Alert Emergency Work

With a Red-Alert agreement in place, we will have a trained person on-site or on-line as-needed and when-needed.

  • Colo Failure Events
  • Hardware or Equipment Failures
  • Urgent Console Recovery
  • On-Site Recovery Support
On-Demand Data Center Work

Log a request on-line, send an e-mail, or drop us a note on Slack. We'll promptly schedule the work and keep you updated.

  • Drive Replacement
  • Trouble-Shooting
  • Equipment Installation
  • Vendor escort and assistance
Periodic and Preventative Work

Many data center tasks are best done regularly. We can arrange weekly, monthly, or quarterly visits to fit the need.

  • Visual Inspection and Issue ID
  • Inventory Audits and Correction
  • Cable Management and Clean-Up
  • Power Audits and Balancing

Copyright 2024  Netzinga, LLC. • All rights reserved. • All trademarks and service marks are the property of their respective owners.