So… About that Rogers Outage…
This may be an evolving post as new information comes out or as I rethink the arguments presented below.
Friday, July 8, 2022 was a very bad day for Rogers Communications Corp. and its customers. Absolutely every service Rogers provides to its customers was unavailable. Internet, Ignite TV, and Cellular all went down. Strangely, and this is important, leased private networks that businesses rely on to connect their facilities also went down. This wasn’t an “Internet Outage”, this was a core infrastructure outage that took down every service that relies on it. Which happened to include Internet, Ignite TV, Cellular, private leased “dark fiber” and probably other services too.
In other words, saying this was an “Internet Outage” during the core infrastructure outage is like saying there was a “Housing Outage” during a Hurricane.
So what went wrong? We don’t yet know, and, we may never know. Rogers may never know for sure.
I’ve been in IT managing and deploying systems since before dial-up Internet. Over the decades I’ve had many an event where the “root cause” was unknowable. Often we run into a “black box” running some sort of code that we cannot penetrate to understand what went wrong. We send in inputs, we gather the outputs, and nothing we try clarifies the issue. To talk to the random person in Nebraska we have to know he exists in the first place.
We never do.
The modern digital world is complexity layered on complexity. With buried dependencies we are unaware of. It is so vast that what we think are simple systems are actually complex designs that no one person entirely comprehends on their own. It takes a team of experts that each comprehend their portion, their area of expertise, and in between them is where the cracks in understanding happen. Those blocks in the image above each have their own layers of dependency. It’s dependencies in 3 dimensions.
This is why “root cause” is too difficult to know. There are so many parts, some we are not even aware of, that all of the possible interactions are incalculable. The complexity can reach the scale of weather forecasting. Climate and weather are so complex that even with all of the modelling and experts, we still cannot predict where it will rain with absolute certainty every day. We can’t say why it rained on North York, but not on Brampton.
So yes, we can predict that show will go sideways with digital infrastructure. All of us that work in the field can say that with 100% confidence. What we cannot say is where and when and how it will happen. Well, “Everywhere and anytime.” but that isn’t really helpful.
With that complexity comes complex reasons for why things happen too. Multiple factors all had to happen simultaneously for that specific event to happen. The chances of it happening in the first place were 1 in a billion. The chances of it happening again are never. Because one of the factors has been removed (equipment replaced, software upgrade.) It just means there are a whole new group of 1 in a billion specific chain of events that have to happen.
From my own experience, I’ll provide a very simple example. I had a common hardware device installed in a few locations running a firmware version. Call it version 10.1. There were bugs that were causing issues. Vendor releases a fix, version 10.2. I had hardware in the lab I could play with, and units in production.
I test in the lab, so the lab equipment goes from 10.1 to 10.2. The fix doesn’t work, so production stays at 10.1. Vendor released 10.3. I test release 10.3, and it fixes the issue. Lab is at 10.3 and production is at 10.1.
Upgrade production to 10.3, and things get worse. Not only is the problem not resolved, I’m facing more issues. What was not disclosed to me was that 10.3, unlike every other firmware upgrade before, had a hidden dependency that required 10.2 being installed first. I had to revert to buggy 10.1, install buggy 10.2 and then install fix 10.3. In this particular instance. I had to install two versions of bad code and then the one version of good code. The brain has a hard time with that logic. As 10.4 and later were released, it was perfectly safe to go from 10.1 to 10.4 or later in one upgrade. But to go to 10.3 required 10.2 (which also required 10.1).
That was one simple piece of equipment. One time. In one particular circumstance. As more of that hardware was deployed, it was always 10.4 or newer. Problem was never experienced again.
The Rogers network is at a scale of complexity I have never experienced, nor quite frankly, do I ever want to. The fact that it is public infrastructure means that testing upgrades and fixes is really difficult. Because customer networks and the public internet are intertwined with their systems. Rogers doesn’t have “The entire global Internet” in a lab to play with. There is no Internet B to connect to. By that very nature, testing is limited. Every upgrade is in the end “click and pray.”
And like in my example, where I was dependent on a vendor releasing code for their hardware, Rogers is dependent on their vendors too. The customer relies on Rogers, Rogers relies on Ericsson and Cisco among others. What we saw as a “Rogers outage,” Rogers themselves likely saw as an “Ericsson outage.”
And just as I was dependent on that vendor not throwing surprises at me, Rogers is dependent on their vendors to do the same.
This is what leads into my pet theory about what happened. Rogers is aggressively pursuing a 5G modernization that will allow the full potential of the interconnectedness of the promise of 5G to be realized in Canada. Because of this, they are even more dependent on a single vendor to get things right. Rogers is out on the bleeding edge of new communications technology.
And that leads to a conundrum. Because every technology has bugs, and new technology often has more bugs than normal. For every bug fix released for any products (how many updates has your smartphone had?) some customer usually had to experience that bug first. Bugs are found in production every day. And releases for those bugs are then published for other customers to install to avoid tripping over them. The bug fix you received exists because someone else experienced it first.
It really, really sucks to be the customer to find a major bug in a vendor’s code. Been there, done that, have the scars and the nightmares to prove it.
I think that because Rogers is among a very small group of customers worldwide building infrastructure on 5G hardware from Ericsson, they are much more likely to find the catastrophic bugs.
And that leads to an uncomfortable and unanswerable question. Which is the smarter strategy? Being first to market with new innovations? Or sit and wait while your competitors get their first, and then get in when the innovative technology is no longer leading edge? Do you lead the race to the finish, or do you try to catch up from behind?
Fortunately, in my role, I never have to answer that question.
Would more competition avoid this issue?
No. And yes. And perhaps. How’s that for a concrete answer?
Canada has a low density problem outside of the big downtown areas. The further you go, the more often entire communities are solely dependent on one telco. The market isn’t big enough even for the current telcos to compete with each other. Even when there is competition in the service to the customer, the actual deployed network end-to-end may rely on services leased from the major telco in the area.
In Ontario, Telus, Bell and Rogers are the three major cellular providers. Telus is heavily dependent on Bell cellular towers. (Out in Western Canada, Bell is dependent on Telus. They have a shared services agreement with each other. Rogers is the only true national cellular carrier on Canada with their own towers coast to coast.)
But because Telus doesn’t have their own fiber in Ontario, they probably have leased fiber from Rogers to connect some Bell cellular towers to the Telus core network. In Ontario, Telus is dependent on both Bell and Rogers being up. When Rogers went down, Telus cellular service was degraded in some areas.
More competition doesn’t ensure more dedicated infrastructure. It actually ensures the reverse. It ensures more interconnected dependencies as Telcos share costs to try to win your business to have their brand on your device. So a big outage on one Telco can take down the interconnected competitive services as well. A Rogers outage impacting Telus customers.
Would a government regulated infrastructure monopoly for the “backbone” solve the issue?
LOL.
Sorry. No it would not. That would just swap out a private single point of failure and replaced it with a publicly owned single point of failure. (Or government regulated monopoly private entity single point of failure.) It’s still major infrastructure managed by a single entity with sole responsibility for keeping it up. And that single entity is dependent on hardware vendors not releasing bug riddled code. We’ve changed the dress code, but the dance is still to the same tune.
We’re still relying on some random person in Nebraska maintaining that one piece of code.
“CRTC gives Rogers 10 days to explain cause of July 8 outage”
Yeah. A hard deadline really helps. Good job regulator. It’s good to look like a “Pointy Haired Boss.”
Full disclosure. I have no confidence that the Canadian Government Parliamentary Committee investigating this outage will be anything other than a bumbling shambles of performative politics with the uninformed lecturing the informed.
After all, it was the Canadian Government that implemented the Phoenix Payroll disaster, a disaster that might have been avoided with testing that was neglected.