Troubleshooting BGP Disruptions in a Large IP Network
BGP Anomaly Detection in an ISP Jian Wu (U. Michigan) Z. Morley Mao (U. Michigan) Jennifer Rexford (Princeton) Jia Wang (AT&T Labs) http://www.cs.princeton.edu/~jrex/papers/nsdi05-jian.pdf 1 Goal Identify important anomalies Lost reachability Persistent flapping Large traffic shifts Contributions: Build a tool to identify a small number of important routing disruptions from a large volume of raw BGP updates in real time. Use the tool to characterize routing disruptions in an operational network
2 Capturing Routing Changes tes BR BR C BR BR C Bes iBGP t ro ute s BR BR C U st Be BGP CPE
Monit or iBGP P iBG GP B e iBGP Upd a eBGP eBGP Large operational network (8/16/2004 10/10-2004) BR BR C tes u
o r tes a d p BR BR C P eBG P iBG iBG P BR BR C eBG P P eBG 3
Challenges Large volume of BGP updates Millions daily, very bursty Too much for an operator to manage Different than root-cause analysis Identify changes and their effects Focus on actionable events Diagnose causes only in/near the AS 4 System Architecture BGP 6 BR E Updates (10 ) Events (105)
Typed Events Large Disruptions (101) Clusters (10 ) 3 BR E BGP BGPUpdate Update Grouping Grouping BR E Persistent Flapping Prefixes (101) Event Event Classification
Classification Event Event Correlation Correlation Frequent Flapping Prefixes (101) Traffic TrafficImpact Impact Prediction Prediction Netflow Data BR E BR E BR E 5
Grouping BGP Update into Events Challenge: A single routing change leads to multiple update messages affects routing decisions at multiple routers BR E BR E BGP Updates BR E BGP BGPUpdate Update Grouping Grouping Persistent Flapping Prefixes Events
Solution: Group all updates for a prefix with interarrival < 70 seconds Flag prefixes with changes lasting > 10 minutes. 6 Grouping Thresholds Based on data analysis and our understanding of BGP Event timeout: 70 seconds 2 * MRAI timer + 10 seconds 98% inter-arrival time < 70 seconds Convergence timeout: 10 minutes BGP usually converges within minutes
99.9% events < 10 minutes 7 Persistent Flapping Prefixes Surprising finding: 15.2% of updates were caused by persistent flapping prefixes, even though flap damping was enabled! Causes of persistent flapping Conservative damping parameters (78.6%) Protocol oscillations due to MED (18.3%) Unstable interface or BGP session (3.0%) 8 Example: Unstable eBGP Session ISP AE DE Peer BE CE
p Customer Flap damping parameters are session-based Damping not implemented for iBGP sessions 9 Event Classification Challenge: Major concerns in network management Changes in reachability Heavy load of routing messages on the routers Change of flow of traffic through the network Events Event Event Classification Classification Typed Events Solution: classify events by severity of their impacts 10 Event Category No Disruption
p AS2 AS1 DE No Traffic Shift EE AE BE No Disruption: eachISP of the border routers has no traffic shift. (50.3%) CE 11 Event Category Internal Disruption p AS2 AS1 DE
EE AE BE Internal Disruption: ISP all of the traffic shifts are internal traffic shift. (15.6%) CE Internal Traffic Shift 12 Event Category Single External Disruption p AS2 AS1 DE external Traffic Shift EE AE
BE Single External Disruption: only one of the ISP traffic shifts is external traffic shift. (20.7%) CE 13 Statistics on Event Classification Events Updates No Disruption 50.3% 48.6% Internal Disruption 15.6% 3.4% Single External Disruption 20.7%
7.9% Multiple External Disruption 7.4% 18.2% Loss/Gain of Reachability 6.0% 21.9% First 3 categories have significant variations from day to day Updates per event depends on the type of events and the number of affected routers 14 Event Correlation Challenge: A single routing change affects multiple destination prefixes
Typed Events Event Event Correlation Correlation Clusters Solution: group events of same type that occur close in time 15 EBGP Session Reset Caused most single external disruption events Check if the number of prefixes using that session as the best route changes dramatically Number of prefixes session recovery session failure time
Validation with Syslog router report (95%) 16 Hot-Potato Changes Hot-Potato Changes P AE 11 9 BE ISP 10 Hot-potato routing = route to closest egress point CE Caused internal disruption events Validation with OSPF measurement (95%)
[Teixeira et al SIGMETRICS 04] 17 Traffic Impact Prediction Challenge: Routing changes have different impacts on the network which depends on the popularity of the destinations Traffic TrafficImpact Impact Prediction Prediction Clusters Large Disruptions Netflow Data E BR E BR E BR Solution: weigh each cluster by traffic volume
18 Traffic Impact Prediction Traffic weight Per-prefix measurement from Netflow 10% prefixes accounts for 90% of traffic Traffic weight of a cluster Sum of traffic weight of the prefixes A few clusters have large traffic weight Mostly session resets & hot-potato changes 19 Performance Evaluation Memory
Static memory: current routes, 600 MB Dynamic memory: clusters, 300 MB Speed 99% of intervals of 1 second of updates can be process within 1 second Occasional execution lag Every interval of 70 seconds of updates can be processed within 70 seconds Measurements were based on 900MHz CPU 20 Conclusion BGP anomaly detection
Peer should advertise prefixes at all peering points, with the same AS path length Allows the AS to do hot-potato routing Using iBGP feeds from the border routers Some inference tricks to identify inconsistencies Results of the study http://www.nanog.org/mtg-0410/feamster.html http://www.cs.princeton.edu/~jrex/papers/imc04.pdf 22
The small open economy model correctly predicted what would happen to NX and the real exchange rate, but incorrectly predicted that the interest rate and investment would not change. In order to explain the U.S. experience, we need to combine...
Example 4B: Using Trigonometric Ratios to Find Lengths. Find the length. Round to the nearest hundredth. QR. is opposite to the given angle, P. You are given PR, which is the hypotenuse. Since the opposite side and hypotenuse are involved,...
What Is Office 365 for Outlook. Office 365 is the new software that employees of . Shelby County Schools will use for their Outlook needs. Office 365 for Outlook allows you to check e-mail, set. calendar appointments and meetings, and...
Mr Green sees the shorter, straight, green path and Mr. Red sees the longer, curved, red path. In an accelerated frame, time runs slow compared to a non-accelerated frame. The Equivalence Principle tells us that there is no difference between...
PowerPoint Presentation Venezuela is a federal republic with 21 states, a federal district, one federal territory, and certain islands in the Caribbean that are designated as a federal dependency. In the constitution of 1999 there was a president elected to...
*Taxable Compensation = Income that is subject to federal income, social security, Medicare and FUTA taxes and remitted to the proper tax authority. Many types of compensation can be subject to one or more of these taxes. Income and Employment...
Ready to download the document? Go ahead and hit continue!