Golborne Vintage Radio

Full Version: Internet crash today 19/7/24
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5
It was bound to happen, it will be worse in the future.
Seems odd that they put it out Nation wide and World wide. Would have thought the design should have allowed gradual installation.

Ironic that a system made to stop the very thing that happened caused it

Gary
The Register reports give some technical insight: https://www.theregister.com/2024/07/19/a...pdate_mess
Let's be clear: This is NOT an "Internet crash".

None of the internet's infrastructure or performance is affected.

What has happened is the CrowdStrike, a supplier of security s/w mainly to corporates, issued a faulty patch which causes some users' machines to enter a "death loop" of continual reboots.

i.e. this is an end-user infrastructure issue, not in any way an internet issue.

Oh, and the only reason that Windows machines are affected is nothing to do with Windows or Microsoft, it's only that the faulty AV product is one that is used on Windows machines and like all AV products regardless of operating system, it requires very low level access, hence the damage that "getting it wrong" can cause. It'd probably be worse if it was a Unix AV product as most of the websites in the world run on Unix servers.
(19-07-2024, 02:08 PM)PerdioPal Wrote: [ -> ]It was bound to happen, it will be worse in the future.

Yes, I wrote a book years ago. it was going to be totally apocalyptic with 100s of millions dying, but I made it more light-hearted.
"No Silver Lining" never blames Windows (or any OS) at all,
The problem is a more subtle management issue.

See also:
… stupid about using GPS as simply an alternative to buying a stable oscillator or clock. Should only be used by for navigation, not Mobile base stations, fibre heads, DAB and DTT timing. It's a single point of failure either by jamming (DOS) or solar flare or management by operator. It's a small once off capital equipment saving that is absolutely stupid.

I cyber security guy writes:
Quote:I think what has shown is that every enterprise cannot have a single threat agent dependency.
We need to ensure that we are dual threaded with ability to pivot quickly and time to resolution is quicker

But what is also bad is potatoes like in 19th C. Too many people using Cloudfare, or the same Cloud services, or the ISPs & so called "cloud" providers using the one kind of thing.

Automatic updates are bad.
Written in 2017
Quote:“Why then do you say the Cloud is like potatoes?”
“Let’s take a step back,” insisted Kate. “What happens when banks, mobile phone billing, shops’ real time transactions are all outsourced to the Cloud and there is an automatic patch, or wrong anti-virus definition or a cyber attack and all or most of the Cloud fails?”
“That can’t happen,” said Jim.
“Really?” said Kate. “There have been major outages on all of them. Admittedly usually only a few hours, sometimes longer. Some in the past that were very bad weren’t noticed by the public because the amount outsourced to the Cloud wasn’t critical. Isn’t it going to get worse now that so much that should be in house has been outsourced because it seems cheaper to the accountants and managers who don’t factor the risks?”
“I see what you mean,” said Jim. “So the canaries are servers that are not up to date so as to be vulnerable?”
“Oddly no,” said Kate. “There are two kinds, one is instances at each of the major cloud providers, the others are really up to date in Genie-Sys data centres with a variety of edge routers such as Cisco, Juniper and Huawei as a problem with the so called Cloud is also the OS in the routers. Even Microsoft is using Linux based routers.”
“Ah, I get the potatoes,” said Jim. “The providers have many things in common. Can Louise explain all this to the students?”

later
Quote:We were doing that anyway before our analysts suggested at the beginning of September that the so called Cloud had reached a tipping point. Failure was deemed to be inevitable, with more severe results than in previous years due to more vital infrastructure and core services being outsourced to it.”
“Amazingly just over two months later it did,” said Jackie. “So is the lesson that the major cloud providers need to do things your way?”
“No. What we did was a temporary fudge, a bodge. People that signed purchase orders, or boards that decided to outsource core business functions should be sued by their shareholders for negligence. It makes no sense. It’s not even saving money in the long term, or only saving money because it’s not as robust solution as is needed. In the future, we will only accept contracts for non-essential hosting. We will be recommending suppliers that can furnish in-house resilient computer solutions. Banks transactions, retail transactions, stock control, major online sales and banking, critical government services, mobile billing etc. should never ever be outsourced to third parties. Those are critical core business functions.”
“So in summary?”
The Cloud for any critical or core business function is inherently a failure, because everything outsourced fails at the same time. It doesn’t matter how much more reliable it might be than an in house solution. Or how much cheaper. Though cloud providers need to make a profit, so it can in reality be no cheaper and no more reliable. Many are selling below cost to build their customer base.”
Jackie said nothing, so to fill the silence, Louise continued.
“Any decent analysis shows it can’t be fixed. I think now I want to go home for a while, maybe have a real ale or cider in an Evesham pub with my mum and dad.”
(19-07-2024, 04:03 PM)Nick Wrote: [ -> ]Let's be clear: This is NOT an "Internet crash".

None of the internet's infrastructure or performance is affected.

What has happened is the CrowdStrike, a supplier of security s/w mainly to corporates, issued a faulty patch which causes some users' machines to enter a "death loop" of continual reboots.

i.e. this is an end-user infrastructure issue, not in any way an internet issue.

Oh, and the only reason that Windows machines are affected is nothing to do with Windows or Microsoft, it's only that the faulty AV product is one that is used on Windows machines and like all AV products regardless of operating system, it requires very low level access, hence the damage that "getting it wrong" can cause. It'd probably be worse if it was a Unix AV product as most of the websites in the world run on Unix servers.

Agree with all of that. TOTALLY! It's NEVER really an OS issue. Mostly a management issue.

But there will be a Friday evening bodged patch/update due to artificial deadlines. You just need two different ones that affect say servers and edge routers and it WILL fall like a house of cards. Then all sorts of things that shouldn't be affected will fail due to outsourcing. It might be a several hours glitch, or very bad indeed.
Why didn't Crowdstrike do extensive tests on Windows PCs,
same as used by their customers, to emulate what happens if an update is about
to be released ?
I
(20-07-2024, 05:26 AM)Doodlebug Wrote: [ -> ]Why didn't Crowdstrike do extensive tests on Windows PCs,
same as used by their customers, to emulate what happens if an update is about
to be released ?
I

They would have done, but their change and release control processes were inadequate.

It's also mainly servers rather than PCs that were affected.
Looking at reports on The Register, it's PCs as much as servers. Since the software provides endpoint security it will, of necessity, be running on ordinary PCs and laptops.

The phrase that comes to mind is "single point of failure".
The critical infrastructure are the servers.

As in all such systems, there are risks: it's up to those responsible to mitigate as many of the risks as possible (within your remit/budget) and then to quantify those risks remaining. These turn have to be signed off and accepted by senior management.

This process is a fundamental tenet of DR &BC (Disaster Recovery and Business Continuity) planning. DR is a different situation than BC - they are related but are very different things.

In this example, DR is how you get your infrastructure back on air; BC is how you keep the business operational in the interim.

I spent a lot of my career as a CTO for financial institutions and hedge funds with DR&BC central to my responsibilities. It's not an easy job as designing the processes needed requires a deep & fundamental knowledge of how the organisation works, plus continual refinement and testing. These are "living" processes - there is no "end" to a DR&BC project. There is also a full spectrum of how minor or severe an individual incident may be, both from a DR or BC perspective: the CFO is seriously ill (BC) or "there's a gas leak nextdoor and we have to evacuate the building" to "the office has burnt down" or "a bomb has gone off" - we were in the West End of London and both the gas leak and bomb scenarios happened, as did a 11KV underground substation fire coupled with a failure of a backup generator.

In the few times over 35 years I've had to execute (part of) such a plan, there have always been unexpected/left field events that have had to be handled in real time - you can't plan for everything, so have to be flexible and very quick thinking. You need a good team, good resources, authority and backing from the executive board as hard decisions may have to be made.

I had to plan for dirty bombs, pandemics, hacking, infrastructure failures, comms failures (roadworks breaking fibres etc.), utility failures, pandemics, strikes, fire, theft, heatwaves, floods, commercial risk (upstream supplier failures) etc. Plans have to workable and regularly rehearsed so everyone is aware of what to do when called upon. It's very very hard to do well.
Pages: 1 2 3 4 5