PDA

View Full Version : American's computer problem now fixed


DJofSD
04-16-2013, 05:56 PM
http://www.bloomberg.com/news/2013-04-16/american-airlines-u-s-flights-grounded-by-computer-malfunction.html

These things are reported but there is never any details -- at least to the level which would satisfy me.

I have absolutely no idea of the problem but for whatever reason this one has the feeling of a corrupted data base to me.

Dave Schwartz
04-16-2013, 07:11 PM
The airline system can be unbelievably fragile.

In 1998 I was doing a consulting gig for Reno Air, a small airline that was eventually purchased by American. My job there was to be available in case anything broke.

One day I was poking around as I was learning the system, and realized that the main main databases were all completely accessible by anyone with "root" access. I went to the network admin and told him.

He told me a story. He said that the previous week he had been to a seminar on NetAdmin and was told that it was a bad practice to have more than one person with full "root" access at any given time although there always must be a procedure to recover root access/passwords, etc. (Last thing a corporation wants is to find themselves locked out of their own car.)

LOL - So, he brings up the network list of who is logged in... there were 15 different roots logged in at the moment!

He said that he had been tracking it and found that the "average" was about 12! (BTW, I had root access and my Unix/Linux knowledge and experience is VERY weak.)

I said, "Correct me if I am wrong but could not an aggravated employee take the system down simply by deleting some files or folders?"

He said, "Well, the planes wouldn't fall out of the sky but one could certainly cause ALL FLIGHTS to be cancelled until a backup could be restored."

The further punch line arrived a few days later when it was discovered that the nightly backups had been corrupt for about 6 months!"

BTW, the NetAdmin was absolutely powerless to do anything about this because the people at the top were not willing to take on the issue. To its last day, Reno Air was just a single deletion from being all but wiped out!

DJofSD
04-16-2013, 07:47 PM
Part of your story reminded me of an incident in my career.

I, along with a couple of coworkers, were approached one day by a national level manager for a urgent work assignment. No details other than we'd be onsite at a clients location for two weeks at a whack. The team members would rotate in and out with a total duration of the project of about 4 to 6 months.

I was single and looking for some fun (yes, I do have a weird sense of fun -- I love things like D/R drills). I voluntered and found myself at the top of the list.

After making my arrangements, I was on my way to Indy in a couple of days.

Showed up at the client's office, got escorted in and in a short time found that same manager was already there in the middle of some kind of problem. Jumped in and help fix that little crisis then got introduced around, shown the desk our team would be using then had some meetings with the team which normally worked there.

Lunch, few more discussions with some of the other techies until more or less it's the end of the day, I'm still not sure exactly what my duties entail. About that time the manager shows up. Dinner at the hotel, let's go.

Some work related talk over dinner but nothing really specific, and certainly nothing to explain what the heck is going on. The level of urgency we saw when the team was recruited was not matched by the first day's experiences on site.

Afterwards, I joined the manager in the bar for a few after dinner drinks. When I felt it was a good time, I asked: Bob, what the heck was the emergency and what exactly do you need me to do while I'm here for the next 9 days?

He told me.

What it all came down to was another techie type had screwed up royally. It seems the client had some RYO software for performing backups of the source code repository for all of their production applications. Part of his duties were to make sure the backups were taken.

Well, the screw up was he didn't. And, as you can guess, they got corrupted.

But it gets better.

They didn't find out they were corrupted until there was a need to back out a fairly substantial set of changes and to restore a set of apps to a prior know "good" level. They couldn't do it. And, all of the entire set of backups were bad. They were in a sh*tload of hurt.

It turned out, the regular backups were producing an error message but it was not caught or just ignored. If the initial problem with the home grown backup program had been caught early then recovery would have been easy and no real loss of data would have occured.

When the manager of the client's division was told all of the details, he justifiably got livid and demanded in no uncertain terms my coworker be removed from the account -- immediately which he was.

So now I understood what had happened and the team I was a part of was to back fill the position until a permanent replacement could be found.

Dave Schwartz
04-16-2013, 09:32 PM
That is what my limited experience in the corporate world would indicate as well.

Since we're on this topic, I will share more of the Reno Air backup fiasco.

Backups to any business vary in importance. They vary from "no big deal" to "life altering" on the day that they actually need them. LOL

The "brass" at Reno Air decided to invest in a top-notch backup system to replace the one they had previously (which only cost about $80,000). So, they bought a newer version of the same hardware. As I recall, the coast was about $300k.

Picture a cabinet that looks like one of those games that you see outside Walmart with the prizes inside and a joystick to grab them with a little crane.

What you did was load up a bunch of tapes in the box, and the little crane would go grab a tape whenever it needed a new one. This system backed up like midnight to 4am and used 3 tapes per day.

Once a week the NetAdmin would open the box (it was not locked that I am aware of) and transfer the last 7 days of tapes to a storage cabinet. (More on that later.)

Now, most organizations cycle their tapes. Perhaps they have 7 sets of tapes, and every Monday the take the Sunday tapes and archive them. Then on the 1st of the month they take the last weekly tapes and archive those, and put the other tapes back into circulation.

Not at Reno Air. You know those cabinets that house giant 3-ring binders of policies, rules, etc.? They had one of those and put the tapes in there every week.

They never recycled anything. Here is where it starts to get good.

When that cabinet was full, the looked for another cabinet! At 21 tapes per week, that cabinet would fill up every few months. Not to worry... there is always another cabinet available.

(BTW, these tapes were like $35 each as I recall. So, that is like $100 per day for backup tapes.)

Reno Air was in existence for around 6-7 years. With all these backups, pretty soon a new problem arose: where do we put the cabinets? (Never mind that some of the tapes were from the old system and probably couldn't be read by the new system anyway.)

One day, I was talking to one of the VPs, and I was telling him a story about a bad experience I had with a backup more than 20 years before (literally). I had entered a year of trainer data - took me 1,600 hours - when someone broke into my and took my computer systems. (Floppy disks, not hard disk.)

Not to worry - I had the data all backed up in my safety deposit box. That was when I got a lesson in magnetic media. Do you know what happens when you put magnetic media in a METAL box?

The box itself corrupts the data!

The VP says, "What about our backups?" At which point I came to understand what all of those METAL cabinets spread around the bullpen were for. He panicked when I explained that the longer the media is exposed to the metal, the greater the risk.

A team was assembled to check the veracity of the tapes. Logically (and to their credit) they were most concerned with the more recent tapes. Turned out that those tapes were fine.

Except for one thing... There was only one folder on each tape! The backups were simply not being done because the "jobs" had never been programmed! They immediately grabbed the tapes from the previous night and discovered the same thing on it: one folder and all the sub-folders rather than a system-wide backup.


Welcome to corporate America.