Disaster Recovery – planning for the real world

Earlier this week, thousands of Vodafone customers around the south of England lost mobile service (calls, SMS, data). The cause? Theft of equipment at one of Vodafone’s operations centres in Basingstoke.

Vodafone provided some details about the service loss, but no mention of why something as commonplace as an equipment theft caused such widespread service loss. It appears that Vodafone has or had a single point of failure (often referred to as ‘SPOF’) in its infrastructure, surprising, since SPOF is usually one of the first aspects of a complex system to be identified and removed/mitigated.

Engineering for fault tolerance (FT)  is a huge subject, but broadly speaking, there are several ways in which SPOF can be avoided (or its effects reduced) in communication systems:

  • Local redundancy – additional functionally identical components located alongside each other. This typically protects against individual component loss or error.
  • Distributed redundancy – additional functionally identical components located at a separate geographical location. this typically protects against ‘whole-location’ loss or errors (such as theft or fire).
  • Data replication – typically used with distributed redundancy, ensuring that data is not bound to a single location.  
  • Diversity of implementation – additional functionally equivalent but differently implemented components. Typically used to mitigate against unknown or known errors (the same error is unlikely to occur in separate implementations).

Let’s assume for the sake of argument that Vodafone had previously identified the SPOF in the data centre, and installed multiple, functionally equivalent components (‘local redundancy’). This would protect the system against a failure of one or more components at that location. However, it’s apparent that Vodafone did not have ‘distributed redundancy’, that is, key components in different geographic locations. If they had implemented such a scheme as part of their disaster recovery plan, then the theft of equipment at a single location should not have caused such major service loss.

The incident raises questions about Vodafone’s core infrastructure and its ability to handle fairly basic failure modes.

Leave a Comment