Network Operations

Self-Healing Network Monitor – Ansible, Playwright & AI Analytics

87%

Of incidents auto-remediated

Engineered an autonomous network monitoring platform combining Ansible remediation playbooks, Playwright-driven synthetic testing, and an AI layer that provides real-time health analysis — diagnosing faults and triggering background remediation before users ever notice an issue.

Challenges

Network incidents were reactive — engineers only discovered issues after users reported them, often 20–40 minutes into an outage.
Repetitive remediation tasks (interface resets, BGP session restarts, DHCP lease flushes) consumed hours of engineer time weekly on toil that could be automated.
No single pane of glass for network health — data was fragmented across vendor dashboards with no correlated intelligence.

Solutions

Built a real-time network analytics dashboard aggregating SNMP, syslog, and flow data into a unified interface with live topology visualization.
Integrated an AI layer that continuously analyzes health signals — flagging anomalies, predicting degradation patterns, and surfacing a plain-English status summary of network condition.
Deployed Ansible remediation playbooks covering the most common fault classes — interface bounces, routing protocol resets, DNS/DHCP recovery, and switch config drift correction.
Used Playwright for synthetic end-to-end connectivity probes running on a scheduled loop, validating critical paths and triggering playbooks the moment a probe fails.
Built an alerting pipeline that pages engineers only when automation fails to resolve an issue, eliminating noise from self-correcting events.

Outcomes

87% of detected network incidents resolved autonomously before user impact — no ticket opened, no engineer paged.
Mean time to remediation dropped from an average of 34 minutes to under 4 minutes for automated fault classes.
Engineer toil from repetitive remediation tasks reduced by over 70%, freeing the team for higher-value infrastructure work.
AI health summaries provided instant situational awareness, reducing time spent correlating logs during incidents by 65%.

BOOK A CONSULTATION VIEW ALL PROJECTS