On many occasions when I talk to network operators about routing security a question of risk comes up.
Quite a few well-known and analyzed incidents, like the YouTube prefix hijack or the China Telecom traffic detour, clearly demonstrate vulnerabilities in the routing system that can be exploited and that present a real threat. But while everyone agrees there are vulnerabilities, the real questions are (1) What are the frequency and probability of security incidents, and (2) What are their operational and economic ramifications?
Not answering these questions means ignoring or accepting the risks by default – and that is the dominant trend among ISPs. Having no answers is also an indication of low operator awareness and, subsequently, lack of motivation to do something about it.
A better grasp of what’s happening in Internet routing would inform operators’ decisions regarding proactive and reactive measures they can deploy to mitigate risk.
To address this issue, the Internet Society organized a routing resiliency measurements workshop on 2-3 November 2012, inviting researchers and operators to share their experience and data and try to find answers to several important questions:
- What level of attack has there been in the past – to what extent do security incidents happen, but go unnoticed, or get dealt with inside a single network, possibly introducing collateral damage?
- Are the number and impact of service disruptions and malicious activity stable, increasing, or decreasing?
- Can we understand why, and track it collectively?
The workshop was divided into three main sections:
- Measurement methodology and frameworks: We looked at different methodologies, their limitations, and available data sets used for the analysis of suspicious events related to inter-domain routing in the Internet.
- Research analysis and operational data: Participants presented and discussed analysis of data related to routing resilience coming from both researchers and operational experience.
- Metrics and long-term monitoring: This discussion focused on which metrics could be a useful representation of routing resiliency in the Internet, both to inform operators’ actions and facilitate a long-term monitoring and trend analysis.
We just published a report on the workshop that documents main points of discussions, main conclusions, and forward-looking suggestions.
One of the conclusions was that not knowing the operator’s real routing policy makes it difficult to separate legitimate changes from attacks. Several approaches presented at the workshop allow the number of false positives to be minimized, but do not provide an answer to how much goes under the radar.
We also observed that many operators do not specifically track routing security incidents, making it difficult to collect sensible operational statistics. This is a missing piece that could facilitate risk assessment and measure the effectiveness of mitigation techniques.
Lack of well-defined and actionable metrics and common vocabulary are two of the main limitations for consistent long-term monitoring and trend analysis. It is difficult to say how the system evolves, or whether it is getting better or worse.
Did the workshop have answers to all of these difficult issues? Well, no, but it provided a good starting point for moving forward based on a common understanding of the challenge.