Back to Projects
Network Operations

Self-Healing Network Monitor – Ansible, Playwright & AI Analytics

87%
Of incidents auto-remediated

Engineered an autonomous network monitoring platform combining Ansible remediation playbooks, Playwright-driven synthetic testing, and an AI layer that provides real-time health analysis — diagnosing faults and triggering background remediation before users ever notice an issue.

Challenges

  • Network incidents were reactive — engineers only discovered issues after users reported them, often 20–40 minutes into an outage.
  • Repetitive remediation tasks (interface resets, BGP session restarts, DHCP lease flushes) consumed hours of engineer time weekly on toil that could be automated.
  • No single pane of glass for network health — data was fragmented across vendor dashboards with no correlated intelligence.

Solutions

  • Built a real-time network analytics dashboard aggregating SNMP, syslog, and flow data into a unified interface with live topology visualization.
  • Integrated an AI layer that continuously analyzes health signals — flagging anomalies, predicting degradation patterns, and surfacing a plain-English status summary of network condition.
  • Deployed Ansible remediation playbooks covering the most common fault classes — interface bounces, routing protocol resets, DNS/DHCP recovery, and switch config drift correction.
  • Used Playwright for synthetic end-to-end connectivity probes running on a scheduled loop, validating critical paths and triggering playbooks the moment a probe fails.
  • Built an alerting pipeline that pages engineers only when automation fails to resolve an issue, eliminating noise from self-correcting events.

Outcomes

  • 87% of detected network incidents resolved autonomously before user impact — no ticket opened, no engineer paged.
  • Mean time to remediation dropped from an average of 34 minutes to under 4 minutes for automated fault classes.
  • Engineer toil from repetitive remediation tasks reduced by over 70%, freeing the team for higher-value infrastructure work.
  • AI health summaries provided instant situational awareness, reducing time spent correlating logs during incidents by 65%.