Business And Industry Has A New Swear Word: Outage

So on this beautifully sunny Saturday, I receive news that British Airways has suffered a “major systems outage”. Lots of flights are cancelled and passengers are venting on Twitter. Nothing unusual in people venting their frustration at British Airways on Twitter. Even I have done it in the past! There are two main points here. Firstly its British Airways. They never seem to me to be able to handle an issue correctly and seem to have difficulty in behaving in a good and timely fashion towards customers. The chaos always seems to me to be exacerbated. I have had over 15 years experience of this, be it extreme snow or even swipe cards! I remember the swipe cards at Heathrow Airport incident particularly well as they managed to leave me stranded in a god forsaken hell hole for 4 days! I do feel sorry for the passengers on this one as it is not their fault. I hope that they all manage to get on go about their business, they do not need this disruption. An airport in my opinion is one of the worst places to be stuck, especially if you are airside. I have not seen enough information about this one to make any sort of technical judgement. The second point is not particularly aimed at British Airways as it is an industry wide issue. You see all over business and industry not enough budget ever seems to be given to IT and this make-do-and-mend approach has to be followed despite compliance and standards that have to be followed. In my 18 years in the IT trade I saw this every day. Some organisations are better than others. The other month there was an issue with Amazon Web Services that took out a lot of big services for in some cases more than 24 hours. This is not acceptable. Thing is, these outages are getting larger and even more far reaching and disruptive. We had a cyber attack a few weeks ago that severely hit large parts of the NHS and some other organisations too, some of these for more than 24 hours. Cyber attack, system failure or human error, an outage is an outage. One of my systems suffered an outage the other day. I had everything recovered in 10 minutes, but an outage is an outage. You see I had an instant recovery plan that is tried and tested. If l didn’t have these plans in place it could have been worse. I have always felt that not enough attention is paid to disaster recovery, business continuity and failover. This essentially means that you can swap systems over in the event of a major problem to try and avert it or stem the damage to the business. The problem is that business continuity and disaster recovery costs money. At a senior management level there seems to be an attitude of “it will never happen to us, we do not need to spend the money”. It does happen. Systems fail all the time. Its a fact. The answer is to pay more attention and credence to business continuity and failover. An issue will happen and beefing up these components will stem any damage to a business and indeed stem the levels of customer dissent and dissatisfaction, especially when Twitter is so readily available as a complaint channel! I hope business and industry will look at these outages and the effect they have on customers and take some “preventative medicine”. Outages are getting bigger and even more disruptive such is the way everyone now relies on technology.

AMAZON AWS S3 Failure

So according to this article the Amazon AWS S3 platform is having some problems. Some of the largest websites in the world have been affected by an outage. Amazon is saying it is an increase in error rates as opposed to an outage. Some sites and services are missing data, some are offline and some are running slower than I can walk 🙁

This is not a good thing for businesses who need to have control over their web services. Disaster recovery and business continuity seem to be non existent. This is a fail. When I was designing parts of networks and systems we would always factor in disaster recovery. AWS looks good but it can also rack up to be very very expensive indeed.

What is even more worrying is that Amazon AWS S3 is seen as being too big to fail, therefore my business won’t have any problems because it is with Amazon. This episode shows that Amazon along with anything or anyone is not too big to fail. No one thing, person, business or infrastructure is not immune to failure. Simple.

This is one of the biggest IT failures I have ever seen and I have seen more than a few in my time in the technology industry. Some small but I have also seen some spectacular failures.  IT outages and system failures happen every hour of every day of every week of every year. IT specialists are trained to handle and repair these or even prevent these outages from happening by carrying out preventative maintenance. These things happen and should be prevented where possible.

Amazon will surely take quite a hit on this. I can imagine certain CTO’s are sitting at their desks now and reviewing their relationships with Amazon and other similar providers to ensure they don’t get hit by something like this either again if they have been hot this time or for that matter anytime in the future.

One thing is for sure this story will not go away for a while and its impact will be felt in very varying degrees by so many including users who will be unhappy about being unable to access essential web sites and apps. Amazon clearly need to look at a redesign or upgrade.

I would also like to assure Virtual Office users that they are unaffected by this as I do not use AWS preferring my trusty HP Servers and infrastructure.