In the early hours of Friday, July 19th reports began that many Microsoft Windows systems were mysteriously failing to boot. Airlines were affected with ticketing systems unable to function and flights being canceled leaving passengers stranded. Banks were not able to perform some functions. Payrolls were missed and even news organizations were not able to present the latest news (which, ironically, was often about the mysterious Microsoft issues). Even 911 emergency calling was affected in some states.
The “Blue Screen of Death” (BSOD) was seemingly everywhere. Within a short time it was being called by some to be the “Largest IT Outage in History” and that it caused “global chaos”.
It was soon determined that code from a security company called “CrowdStrike” was delivered to some Microsoft customers as an automatic update had failed, not allowing the systems to boot.
Even though the problem was diagnosed quickly, the patch created and distributed, the damage was done and some systems remained down. Estimates of 18 Billion dollars in damages were being made even as early as noon of July 19th. Eventually these were adjusted to “only” five to ten billion dollars but obviously the real costs will be a long time in coming.
How could this happen? Who was responsible? Who will pay for the damages? What could have been done to prevent it?
All of these questions will be answered.
CrowdStrike makes code that tries to prevent attacks on Microsoft (and other) systems. It reaches into the kernel of the system and looks at many features such as opening files to see if there is some pattern that could be a hacker or other malware. If there is, it takes immediate action to shut down the application.
The problem came because a NULL pointer occurred in a part of the program that was looking for data when no data existed and the operating system killed the program. If CrowdStrike was a “user-level” program it would have less of an impact, but because CrowdStrike needed to run in the same privileged mode as the operating system, CrowdStrike crashed and took down the whole system.
As the system started to reboot CrowdStrike once again encountered the same problem and crashed again. And again, And again.
The solution was to distribute a patch to the CrowdStrike program that would fix the NULL pointer issue, but to deliver that patch the Microsoft system has to be up and running so the CrowdStrike patch could be downloaded and applied….and of course the system was not running…it had crashed. Catch-22.
It should be pointed out that CrowdStrike is not made by Microsoft, nor is it installed on all of the (estimated) 1.4 Billion computers world-wide that run Microsoft software. In fact Microsoft estimates that “only’ 8 million computers were affected by the outage. If this is true it affected only about 6% of the Microsoft systems.
It was just those 8 million were very visible computers. Those eight million were responsible for some very expensive transactions by very upset people. And systems included news organizations.
Microsoft is not directly the cause of this outage. Microsoft was simply the channel to the eight million FORMERLY VERY USEFUL commercial systems that CrowdStrike was trying to protect. However Microsoft’s monolithic architecture and delivery system helped CrowdStrike deliver the fatal blow.
You can think of the Microsoft ecosystem like the banana industry. Once there were many types of bananas, but as the banana industry started to grow in the United States, the banana producing companies went to Latin America and started to standardize on bananas that were easy to ship, had a slow ripening rate which allowed them to be shipped and ripen before they went bad. They standardized on a particular variety of banana and everything was fine…until the fungus hit.
Then in a very short time that banana variety was wiped out and the industry frantically looked for a virus-resistant banana that had most of the characteristics they wanted. They found that banana in an English hothouse, the Cavendish banana.
They started growing the Cavendish and now it represents a huge amount of the bananas being shipped and eaten in the USA.
Then another fungus hit….and the Cavendish banana is in the process of being wiped out.
Fortunately there are over 500 other different types of bananas and we also have gene modification that could help change bananas to what we really need. Of course that would be hard to sell as “GMO free”…..
So Microsoft Windows is a bit like bananas…open to a fungus or virus.
was a famous quote of Pogo in the comic strip by Walk Kelly. It represents the final culprit in this play and one that has mostly gone unmentioned….the end customer. To be precise, it is the systems administrator, whether it be a highly paid employee of a company that hosts hundreds of PCs, or the simple store owner with a couple of Point of Sale terminals running Microsoft code.
In either case, the end user had allowed untested code or data to be downloaded and installed into their environment and bring their systems to their collective knees. Caveat Emptor!
Even worse, they are systems that, for all intensive purposes, become a black hole into which the end-user’s company pours money, hopefully to get functionality out of the monolithic Microsoft system until things go bananas.
There are a couple of other actors in this play, mentioned once or twice then forgotten in the discourse. Apple and Linux.
Mentioned by the press as being impervious to the BSOD were Apple products and GNU/Linux (and to be complete Android and BSD). „impervious“ is too strong a word, but there are often major differences in their environments that make them more „resistant“, just like some strains of bananas are more resistant to blight than others.
Apple is a well-crafted, well-controlled system distribution. It is also monolithic, but in smaller numbers and well vetted by Apple engineers who control everything about the operating system and its permissions and inclusions from the top down. Apple also controls the applications in their official store and would (in my humble opinion) not let something run as a system-level application without a huge amount of testing and control.
A monolithic system of hardware and software with everything written by one entity is one of the easiest to control and get right, in my opinion. Of course it is also one of the least flexible in options, since many options are limited to those available only from the vendor.
In addition, the basis of Apple’s OS is a variant of FreeBSD, an offshoot of the original Berkeley Software Distribution of a Unix-like system. BSD is majorly different design from Microsoft’s system, and has different ways of providing anything like CrowdStrike functionality….ways that might allow the same results of security, but would not bring down the whole system.
NetBSD in particular adds another factor in running on several different architectures. This diversity would repel malware attacks that depend on one architecture, such as the Intel/AMD ISA.
GNU/Linux is a complete re-implementation of a Unix-like system with freely available source code and (like some of the BSDs) runs on a range of hardware architectures. However unlike any of the above, the term “Linux Operating System” is a bit of a misnomer. What is generally called “Linux” and can be argued to more properly be called “GNU/Linux” is actually made up of separately made parts: A kernel called “Linux”; a set of libraries, tools and compilers called “GNU”; a method of delivering these parts to the user called a “package manager”, and a method of installing all of this to a system. In some cases you have a graphical subsystem that uses additional libraries to create a “desktop”, and remarkably there are two major (and many minor) ones out there, KDE and GNOME.
In addition to all of this, Microsoft typically has two major file-systems: NTFS and FAT. Linux has many. This is important since much malware may depend on how the data is laid out on the disk.
In addition to all of the diversity in “GNU/Linux” you also have “distributions” that most people know by their project or company name such as Debian, Slackware, Red Hat, SUSE, Fedora, Rocky Linux, etc. etc. These projects take all of the items mentioned before and add several more things, whether it be systems administration programs, database programs, etc.
These distribution binaries are built by the individual distributions. Even if they use the same source code, there can be differences in the binaries created due to differences in compilers and optimization flags on those compilers.
Distributions may use the same kernel binaries as another, but the kernel sources come from a different release, are compiled and distributed at different times and therefore even a company using “GNU/Linux” throughout, if they are using different distributions they are typically being updated and patched at different times. GNU/Linux is not monolithic.
One other difference between GNU/Linux and the other operating systems is that it is licensed under the General Public License (GPL) from the Free Software Foundation. This REQUIRES that all binary distributions of GNU code make available the source code and the methods of building it to the end user customer who receives the binaries. This means the end user customers can see how the software works and either fix the software themselves or hire a third party to fix the software.
I have left Android for last as it is made up of a Linux kernel and other software from an operating system vendor called Alphabet (Google). Its code may be updated directly from Google via the Internet or updated from OEMs that use the Android operating system for their phones, tablets, netbooks and other devices.
From a multi-vendor distribution system viewpoint Android is less open to the type of update situation as Microsoft/CrowdStrike
If you are looking for someone to blame for the recent outages, the blame falls squarely on three sets of shoulders:
CrowdStrike: for not testing and vetting the code properly
Microsoft: for not creating a more robust environment, one that would not allow a third-party program to bring down their whole system
The end user: for having an upgrade strategy that allows new code to go directly to production with no ability to test and stop bad code, or to recover from bad code that has been accepted. When I say “end user”, I am not talking about the home or small business laptop/desktop user. I am talking about the airlines, banks, news organizations, etc. exactly the “end users” we heard about on the news. The ones that thought their systems were so valuable that they bought the licenses for CrowdStrike, but did not spend the money or time to protect themselves from this single point of failure. THOSE END USERS
First of all never have an automatic direct path from a vendor to your working environment. Treat your vendor with at least as much suspicion as you would any other distributor of malware. New versions of the operating system and new patches should first be installed in what is called a “sandbox”, a system that is not used for anything but testing new updates to the OS. If absolutely necessary this system could be used for other things, but NOTHING that is mission critical…like PAYROLL. After the changes have been vetted in the sandbox environment, then they can be deployed to other systems.
Second, have a roll-back plan. Make sure you can roll back your entire system to at least one previous version. This should be as simple and documented a procedure as possible. After all, when your production system has been destroyed is when your administrators are most under pressure and most likely to make mistakes.
Most, if not all, of the Linux Distributions I have used (and I have used a lot) have three characteristics that help with this:
Boot to single-user mode
This allows a system to boot partway. It does not go through any “start up” routines, has no file systems mounted other than a small “root” partition, and is a minimal system for doing anything else. In a properly designed system an application like “CrowdStrike” would not be started at this phase. There are enough tools available to look at system logs (which would tell you what happened in past boots) and stop rogue programs like CrowdStrike from starting
Boot previous kernel
Every time a kernel is changed on a Linux system a backup is made in the root directory and is available for booting. If a mistake has been made in the kernel itself and that keeps the system from booting, the previous kernel is available for booting to allow the process to continue. No waiting for a vendor to send out a patch
Most Linux Distributions allow the end user to make a “live” system that runs off the installation media (CD or Flash drive) and out of RAM. It does not need any of the resources that exist on the affected system, and therefore can run no matter what happened to the down system. This allows the systems administrator to have a completely functioning system with all the tools necessary to investigate and repair the down system(s)
For all of the diversity in “”Linux” systems the administration of these diverse systems can be understood and implemented by Open Source tools that are easily modified by the end user teams.
Here comes the great part….the end user will pay for it.
I will say at this point that I Am Not A Lawyer (IANAL), but I worked for enough system and software companies to have written a series of licenses and warranties to know that the basic software warranty is worthless. It is worth less than the paper it is written on, and for the most part it is not even written on paper.
I also spent four years working for the largest multi-line insurance company in the world, so I know a little about insurance, actuaries, etc.
Finally I had four terms of business law in university (albeit fifty years ago) and I understand a bit about suing companies large and small.
There is a simple reason for useless warranties. The software vendor’s lawyers will not let them write a warranty for their software that has any usefulness whatsoever. The software vendor can not know how large your company is, or what you are doing, therefore they can not warrant that their software will not do something that will destroy your company. I remember a time when software could literally make a hardware monitor burst into flames and could actually burn down your building. How can the software company (or any company) write a warranty to cover that?
So you have the three things which will not allow your lawyer to have a good warranty on the software. The software vendor does not know how you would be using the software. The software vendor does not know the limits of how much damage could be done on a single system. The software vendor does not know how many systems you are talking about so they can not know the limits of their liability.
If you really want someone to cover the damages to your business from failing software, go to a multi-line insurance company (a big one or an underwriter group) and get an insurance plan that will cover your needs. The insurance company, if they take this business, will do a study of the size of the possible loss, the probability that this will happen, will tell you how much you will pay in insurance premiums and will work with you to reduce those premiums to a reasonable amount. They will tell you what “best practices” will help you limit these types of losses. Once you have those premium charges you can then determine the REAL Total Cost of Ownership (TCO) for your computers.
There are companies that write “cyber insurance” (which is different than the all-inclusive disaster insurance I am talking about), and you can read about them here:
CrowdStrike will most probably hurt these cyber insurance companies, which only means that the future premiums will raise, because if disaster happens once it will happen again. This is one of “Hall’s Laws”. Murphy has nothing on me.
Otherwise your calculations of TCO are pretending that CrowdStrike (or something like it) will never happen. And we see how badly that worked.
So we follow the money:
Airlines tell the passengers “sorry, it is not our fault” (this is technically not true for the reasons given above)
Banks tell the customers “Sorry, we will get your payroll done as soon as possible”
911 calls not handled…..well, sorry
End users pay….no one else.
Except CrowdStike’ss stock price is down by 30%. Of course that may eventually recover….but the price drop and the bad press has to hurt. The CEO of Crowdstrike (was the CTO of McAfee when it made a similar mistake) may have to fall on his sword, but that does not pay for the damage, which at the writing of this paper is between five to ten Billion dollars.
Here comes the advertisement. Here is where you pay the price for reading a free white-paper. Here is where I gouge you and get action out of you.
The cost of fixing this is to train your systems administration people better. Then give them the time to investigate better ways of preventing company wide failures. Have them practice system wide failures and how to recover from them.
Have them read books and articles (some of them online and gratis) about how to protect your systems using sandboxes, firewalls, alternate architectures and operating system diversity to run your business.
Use virtual machines, containers and other modern-day methods of isolating troublesome interruptions of your business.
Look for operating systems that allow fast recovery and are designed for resilience.
I could go on for hours talking about good systems administration and how it should be seen as an asset and not as an expense, but that is just me using my fifty+ years of experience in the computer industry to make a recommendation.
Treat your systems administrators with respect everyday, not just on “Systems Administrator’s Day”, the last Friday in July every year. Prevention is your best insurance.
Finally, look to the Linux Professional Institute (lpi.org), of which I am proud to be Board Chair Emeritus after nine years of being a volunteer Board Chair for testing your systems administrators.
There you go…that was the advertisement. Probably a lot less painful than CrowdStrike.
Carpe Diem!