Packet Design’s use by many of the world’s largest service providers and network operators gives us the opportunity to see how they operate. We get to witness the sort of issues they deal with during their everyday operations. We thought we would share some of these stories periodically through Life in the Control Plane blog series. After all, life in the control plane is never dull!
When organizations need stable, scalable and secure connectivity between their geographically spread locations, the solution they turn to is an MPLS Layer 3 VPN offered by their communications service provider (CSP). The CSP delivers MPLS VPN service to different customers over its existing backbone IP network. Each customer uses the MPLS VPN service with their own policies for addressing, routing and security, and their traffic remains isolated from other organizations who share the service.
By using a Layer 3 VPN, the customer achieves significant cost savings because the capex and opex associated with running and managing a network are shared by all customers of the service, and the operational complexities are handled by the CSP. At the same time, the CSP can increase VPN revenues by offering value added services, such as Quality of Service (QoS) guarantees via traffic engineering.
Enterprise customers primarily depend on the MPLS VPN for connectivity between locations, including the datacenter, critical business applications and remote offices, and downtime can severely impact business continuity. So, they often have stringent service level requirements with financial penalties for the CSP in the event of SLA violations. This means that the CSP must be proactive by identifying and triaging issues before customers complain, avoid prolonged outages as well as repeated intermittent service interruptions, notify customers immediately of known issues that will affect their service, and provide ongoing communications to keep customers apprised of problem status.
Assuring Layer 3 MPLS VPN Performance
In large service provider networks, troubleshooting network and routing issues can take an incredibly long time with the threat of SLA violations looming. In the normal course of troubleshooting a connectivity issue, the network team will need to trace the path of traffic from source to destination, determine the performance of every device hop along the path and then identify the root cause of the issue. When there are intermittent connectivity issues with paths changing and converging, troubleshooting is even more difficult.
A Tier 1 CSP customer told us that, before they acquired our products, of the approximately 20,000 trouble tickets opened each month over seventy percent were closed with no resolution. Customers would call to complain about connectivity issues to remote locations or services but by the time the CSP’s network staff began diagnosis, the problem had disappeared and the network was behaving normally. The CSP had no means to go back and analyze past routing path behavior to find the root cause. Sometimes, to retain customer goodwill, the CSP would pay a penalty even though there was no evidence that the fault was theirs. In other cases where the root cause could be traced, the problems were often found (after several hours of tedious diagnosis) to have been caused by customer changes to their CE device.
Route Explorer, the foundation of the Packet Design Explorer Suite, can simplify and accelerate troubleshooting L3 VPN issues. (As explained elsewhere, Route Explorer establishes adjacencies with other routers in the network to receive and collect IGP/BGP routing messages in real time. Using this data, Route Explorer builds and maintains a topology map for the entire network, including overlay services, and provides visibility into routing events. It also provides a DVR-like capability to ‘rewind and playback’ network behavior at a specified point in time.) The link failure use case below shows how Route Explorer enhances visibility into customer VPN services and can accelerate diagnosis of connectivity issues.
When a connectivity issue is reported by a L3 VPN customer, the network team can analyze the routing path from source to destination across the backbone network. The playback feature allows them to select the time when the issue was reported and replay the events for the path from the source to the destination of the service. This will show the links that the path took in the network, which of the links failed, and the new path that traffic took to reach the destination. Drilling into a path will show the contextual performance of all the participating devices and links.
When the replay shows a path change in the backbone, the network team can determine if it was the cause of the customer’s L3 VPN problem. By using the global search feature they can highlight the specific customer’s Layer 3 VPN service for analysis. The mini-map screen shot below shows a customer’s (TSLA) VPN service with all the attached PE devices.
The map also enables the network team to search within the TSLA VPN for the path to a prefix. They can analyze the problem path and see the performance of the routes, devices and links attached to that VPN service and used by the path.
Hopefully, this common use case gives you an idea of the power of route analytics and how path-aware performance analytics can help CSPs assure the performance of their L3 MPLS VPNs and increase customer satisfaction. If you would like to learn more, please read this customer case study and watch a short Route Explorer demo video of these capabilities.