When providing network services to large and globally dispersed enterprises, two of your major blind spots (among others) are the inability to trace routing paths fast enough and the inability to trace routing paths at all.
Say customer “IMPORTANT” raises a trouble ticket that its employees at branch “SOMEPLACE” are not able to reach a critical application hosted at data center “FARAWAY.” Your first few troubleshooting steps include, but are not limited to, asking where that problematic SOMEPLACE branch is, finding the PE router for the branch, logging into it, and running a traceroute from the PE router to the unreachable FARAWAY data center.
In a perfect world
“Traceroute” helps quickly trace the path from SOMEPLACE to FARAWAY, with hop-by-hop information of all devices along the path, and information about the link that has high latency or the device that is dropping packets. You quickly log in to that device, find the root cause, fix it, and close the ticket with a resolution and a root cause analysis for customer IMPORTANT. But…
Easy on paper, but from the CLI?
Every network engineer worth his or her salt loves the CLI, even when it is only to run two commands: traceroute and ping. In the real world, when you run a traceroute from the source PE router to a destination, you may run into multiple roadblocks.
To begin with, traceroute shows you the interface details along the path and not the router names in most networks, unless of course you want to do another Star Wars Traceroute or put the effort to add rDNS records on all IP addresses on every device along every path in your network. Implausible?
If you happened to end up with IP addresses, it’s time to go into your spreadsheet, find the device and log in to check for errors, packet drops, latency, memory, and CPU utilization, as well as all other factors that can cause connectivity issues. Once you are done with the first router along the path, log in to the next router, repeat the steps, and continue until you find the offending link and device. Once you do, you may want to check the remaining routers along the path for other error conditions to ensure you resolve the service issue on the first attempt and not after multiple tickets. But, as you perform your tedious diagnosis, you may have violated your MTTR SLA agreement with that large customer IMPORTANT. Even if you didn’t, repeated routing issues can tarnish your reputation as a service provider.
Now imagine that customer IMPORTANT BANK demands to know why their branch “SLEEPLESS” had intermittent connectivity issues on Friday night between 7:00 p.m. and 11:00 p.m. that caused transactions to be aborted, frustrating their customers. Running a traceroute shows you the current state of connectivity and traffic path, but there are no options to go back in time and find out how or why the path changed.
Keep calm and watch a movie… of your network
What’s missing in this scenario is a database of recorded routing events that can be interrogated for back-in-time forensic analysis, especially of intermittent, hard-to-find issues like the one reported by IMPORTANT. With such a database, network engineers could use it to answer questions like: What path was taken by traffic from SLEEPLESS to the FARAWAY data center? Did the path change? If so, when, and how often? What alternate path did the traffic take and did the performance of the new path violate SLAs?
There are tools that can overlay traceroute results in a map view and others that collect routing information periodically to create a topology map. However, these fail to capture service-impacting events that occur between collection cycles.
The only way to have full visibility into historical performance is to capture path information in real time with no human intervention, record it, and store it for play back when you need to go back in time to troubleshoot an issue. We believe that our Route Explorer is the only product that can do this (we’d love to know if you think we’re wrong). If you are not convinced, take a look at this video to find out how its path search and network DVR capabilities can speed the resolution of your IMPORTANT customer issues.