US Aviation System Meltdown Tied to Corrupted Digital File
Source: Bloomberg
A corrupted computer file led to the breakdown of an air-safety system that prompted flights to be grounded across the US, according to people familiar with the preliminary findings.
The glitch that affected the Notice to Air Missions, or Notam, system on Wednesday also caused a failure in a related backup system, said the people, who asked not to be identified discussing the ongoing investigation. They cautioned that the information was not final.
The computer system that shares the notices to pilots, airlines and other users began developing problems late Tuesday night and had to be completely taken down in the early hours of Wednesday. The Federal Aviation Administration temporarily halted domestic departures, leading to thousands of flight delays.
Technology workers tried to activate a backup system and it initially seemed to function, but the same or a similar corrupted file caused problems there as well, said one of the people. A halt to all flights across the country is extremely rare and has only been done a handful of times, such as after the Sept. 11, 2001, terrorist attacks.
Read more: https://www.bloomberg.com/news/articles/2023-01-11/us-aviation-system-meltdown-related-to-corrupted-digital-file
iluvtennis
(19,871 posts)this needs not only a hot backup, but a backup that is one rev level back from the official version.
It seems like they have the hot backup which is why the same corrupt file was in place on the backup.
Would love to understand why testing/quality control didn't catch the corrupt file before it was rolled to the production system.
BumRushDaShow
(129,440 posts)it could have been a failing drive that ended up with enough errors to finally trigger a more obvious problem but the errors were already starting to corrupt the backups.
iluvtennis
(19,871 posts)Last edited Wed Jan 11, 2023, 08:59 PM - Edit history (2)
to a fellow tech nerd.BumRushDaShow
(129,440 posts)that mentioned that they apparently noticed the problem yesterday and they decided to reboot early this morning, which normally takes 90 min to do so after completing it's checks. But they were finding it was taking loo long after that normal boot time to push the info out and that is when they decided to do the ground stop to troubleshoot.
And yup - tech nerd here hoping my NAS holds up with my backups!
NullTuples
(6,017 posts)BumRushDaShow
(129,440 posts)also went down during that time - https://www.democraticunderground.com/10143017876#post6
James48
(4,440 posts)The software is now maintained by the contractor- Lockheed Martin, interested in the greatest profit, not the highest quality. FAA hasnt actually had control of the software for years.
Lithos
(26,404 posts)if they are using a Hot-hot HA mode (which is now pushing about 15-20 years old in obsolescence), then a simple sync would have replicated this.
And the file could have been the result of a bad write somewhere caused by a transient side effect (think partial disk failure) which was not caught.
L-
RobinA
(9,894 posts)but I know nothing about IT. I would think there would be redundancy on top of redundancy, but mental health is my game, so what do I know. A little gift to Southwest from the computer gods.
Major Nikon
(36,827 posts)You have static source code and dynamic data. If the corruption is in the source code it's easy enough to roll back. If it's in the dynamic data it's not that simple. NOTAM data is constantly changing as new ones are added and older ones are cancelled. So you can't just roll back to archived data because critical information not included in the backup would be lost. Imagine a closed runway NOTAM at an airport which was lost because they rolled back to a version that didn't include it.
In this case the data loss would be significant. They knew they had a problem hours before and waited until a low traffic period to perform the reset. In that time there would be countless NOTAMs added and removed.
Evolve Dammit
(16,763 posts)bluevoter4life
(788 posts)I'm ATC and some of our computers are still using a DOS-based operating system. Some of our equipment is so old, they are starting to have problems finding replacement parts. Our radar system is several generations behind the rest of the developed world, and we still use paper flight strips.
iluvtennis
(19,871 posts)still crunching business processes.
NullTuples
(6,017 posts)Over the years the city moved from IBM mainframe for that functionality to Windows based COBOL, but his program didn't change. Then they wrapped it in something Java-esque to present the data it exposed to web users. He moved cross-country, then retired. But he's heard from friends that his module is still running, inside several layers of wrappers, because the source is long gone & it's not worth it to reverse engineer (it's for a single, narrow function that's due to be replaced any day now...for the last 20 years). At this point it's just a black box compiled executable that will be in use until 32-bit is retired.
Lithos
(26,404 posts)Sounds like they rehosted the code into an emulator more than likely running in the cloud. Other strategies include taking the old COBOL and C code, compiling it to some intermediate model and then converting it to a more modern architecture with the goal of removing the "COBOL" flavorings. It gives code which is more approachable for today's developers to maintain.
Though frankly you can create extremely unmaintainable Java by over-using Dependency Injection and overly complicated OOP.
L-
Evolve Dammit
(16,763 posts)won't bite us in the ass at some point. I give up.
Major Nikon
(36,827 posts)Nowhere else in the world will you find an ATC system that moves as many airplanes anywhere nearly as efficiently and as safely as in the US.
Newer technology isn't always better. If a system works reliably, does the job intended, and is sustainable, replacing it with a system that happens to be newer doesn't always result in any improvement and could be a step backwards.
As far as radar systems go, ADS-B supplemented data is the way of the future and the US has a far better system than anywhere else in the world and it has far better growth potential.
rickford66
(5,528 posts)There was a manual copy needed in the process. Pretty straight forward. Well the system crashed and I finally had to eyeball the two files line by line. This was on a customer's proprietary S/W with no DIFF command. Thanks Murphy.
BumRushDaShow
(129,440 posts)running a local weather station data capturing and formatting web program that was written in python and I know editing that can be a bear... And although I always keep a backup of the previous config files before editing, simple little misspellings or misplaced brackets can torpedo the program.
rickford66
(5,528 posts)Third shifts in cold computer rooms are not missed.
I should mention the system I was working on was a Falcon 50 simulator and the Notice To Airmen (NOTAM) was one of the updates I had to debug.
BumRushDaShow
(129,440 posts)My dad was a programmer for the VA (before they became a department) from the '50s - '70s programming COBOL (for veterans' checks). He used to bring his punch card decks home and we used to play with the mag tape write-protect tabs that he also would hand us.
Never had the patience to do programming but had a PASCAL course in college and did just rudimentary other languages as they came out, for hobby stuff to at least be able to customize the configs.
I can imagine trying to debug something like that though when your backup is flaky.
rickford66
(5,528 posts)Oh what heaven when I got to use a terminal with a line editor.
BumRushDaShow
(129,440 posts)the "tree-killing" commenced with the printouts!
rickford66
(5,528 posts)AllaN01Bear
(18,384 posts)regnaD kciN
(26,045 posts)I should mention the system I was working on was a Falcon 50 simulator and the Notice To Airmen (NOTAM) was one of the updates I had to debug.
...Notice to Air Missions?
I just noticed that change in terminology today. While I have no problem with adopting gender-neutral language, it amuses me to no end that "NOTAM" is such a familiar acronym that they had to stretch to find a new name that would fit the acronym (never heard of civilian flights being called "air missions" before), instead of just coming up with a logical name, like "Notice to Pilots," and then creating a new acronym from that.
rickford66
(5,528 posts)We would have loved to have Airwomen around.The only women we ever saw were stewardesses at the various airline training centers. They did occasionally stop by for a tour, but I heard rumors of pilots bringing in GF's for a simulated mile high flight.
chowder66
(9,080 posts)Canadas air traffic system suffered a similar outage to the one that occurred in the US for a brief period on Wednesday.
US air travel was badly disrupted by the failure of the Federal Aviation Administrations Notice to Air Missions system (NOTAM) overnight on Tuesday, forcing a full ground stop of domestic aviation on Wednesday morning.
Nav Canada, the Canadian national air navigation service provider, released a statement just after 12.30pm as US airlines struggled to resume normal service.
https://www.independent.co.uk/news/world/americas/canada-flights-system-outage-grounded-b2260500.html
BumRushDaShow
(129,440 posts)were transferred between systems as I expect these systems interoperate through some special data pipe for obvious cross-border continuity purposes.
AllaN01Bear
(18,384 posts)LudwigPastorius
(9,170 posts)would they tell us?
I'm guessing, no.
regnaD kciN
(26,045 posts)It seems unbelievable that the FAA's system is so fragile that a single accidentally-corrupted file could shut down the entire airspace, without the ability to restore a backup and continue as usual.
OTOH, we haven't been hearing much about Russian cybersabotage recently, have we?
Aussie105
(5,432 posts)Or Elon Musk messing about.
More likely though, this was a file error that didn't rear its ugly head until well after it became part of the backups that were kept.
Find the error in your working file, look at the stored time sequence of backups and find the same error.
It happens.