Help - Search - Members - Calendar
Full Version: MGS in Trouble
Unmanned Spaceflight.com > Mars & Missions > Past and Future > Mars Global Surveyor
Pages: 1, 2, 3, 4, 5, 6
Zvezdichko
Any news regarding the images of HRSC?
tuvas
I just found out that the MRO star tracker successfully imaged Odyssey, so there's still a chance that it could image MGS. If it manages to successfully image MGS, then HiRISE will follow suit.
ustrax
QUOTE (elakdawalla @ Nov 28 2006, 06:45 PM) *
However, he said it was a different part of the panel that was reporting problems this time.


What will be the problems tormenting Anahita generation about space exploration?!... rolleyes.gif
How IO would love to be around toeb concerned... smile.gif
jaredGalen
Anyone have more details on this?? MGS possibly seen, perhaps tumbling?

http://www.livescience.com/blogs/author/leonarddavid
tuvas
QUOTE (jaredGalen @ Dec 20 2006, 11:37 AM) *
Anyone have more details on this?? MGS possibly seen, perhaps tumbling?

http://www.livescience.com/blogs/author/leonarddavid


Haven't heard a thing, if it is tumbling though, I imagine it might be a good thing... That's just a guess, that's all folks, I don't know anything about spacecraft flight...
elakdawalla
A tumbling spacecraft is an out-of-control spacecraft. sad.gif

--Emily
djellison
Well - tumbling would suggest ACS failure - and ACS failure (for whatever reason) would almost certainly mean a spacecraft that was not power positive - and MC's words on that were that unless it was power positive, it was good-night Mr Bond.

But - conversely - SOHO was found to be tumbling back when it played truent for 6 months - and was saved ( even from a point when all its fuel had frozen solid ).

However - I think MGS is probably lost - I just hope that HiRISE can get a picture of MGS...even if it's a hard sequence to get right, and a bandwidth whore ( black should compress really well smile.gif ) - I think it's an image we would all want to see (and one that would make it into the main stream press as well )

Doug
tuvas
QUOTE (elakdawalla @ Dec 20 2006, 11:49 AM) *
A tumbling spacecraft is an out-of-control spacecraft. sad.gif

--Emily


Well, I guess I've learned a thing or two... I thought tumbling would lead to a power-positive system, at least half power, which would do something. Hmmm. Would be quite interesting to see with HiRISE, to see if it moves during the picture...
djellison
Problem is - you could have tumbling in an axis that had zero power - or full power...but more likely somewhere inbetween. Call it an average of Half Power. But that half power is over half the orbit - so you never have a charged battery going into eclipse, and always come out with it flattened. Then, if you don't get a good enough angle on the arrays next time around - you end up with a cold spacecraft, a flat battery, and that's when you have actual failures of the hardware required to keep things under control.

Depending on what sort of orientation it is in and what that spin is like - it may well be that after a few months - everything rotates around the sun to give the old girl enough power to wake up and perhaps tell us what's wrong...but it would be a long shot.

Doug
Zvezdichko
That's pretty strange. We had contact on 5th November which means that the batteries were charged... that probably means that ... we don't have zero power. So we could still have a chance...
gpurcell
From NasaWatch's LiveBlog on MEPAG

"We think that failure that a software load we sent up in June of last year was the cause. This software tried to synch up two flight processors. Two addresses were incorrect - two memory addresses were over written. As the geometry evolved. We drove the arrays against a hard stop and the spacecraft went into safe mode. The radiator for the battery pointed at the sun, the temperature went up, and battery failed. But this should be treated as preliminary."

If true, a sad end to a magnificent mission.

http://www.nasawatch.com/archives/2007/01/..._meeting_i.html
PhilCo126
NASA still didn't announce an official " RIP MGS " ... correct?
AlexBlackwell
QUOTE (gpurcell @ Jan 9 2007, 06:31 AM) *
If true, a sad end to a magnificent mission.

There are couple of other words that I might use in addition to "sad."
Sunspot
Oh Dear............... human error
Bob Shaw
QUOTE (Sunspot @ Jan 10 2007, 07:27 PM) *
Oh Dear............... human error



It sounds like 'safe mode' wasn't. The whole idea of safe mode is that the spacecraft gets a breathing space while the humans look at the spaghetti code, but in this instance safe was not the word. The real question to me isn't so much why some commands were wrongly written, as why the lifeboat had a hole in it.

MGS was a fine old bird! Better to go out in action than simply to be switched off because of budget pressures.


Bob Shaw
AlexBlackwell
QUOTE (Bob Shaw @ Jan 10 2007, 11:37 AM) *
It sounds like 'safe mode' wasn't. The whole idea of safe mode is that the spacecraft gets a breathing space while the humans look at the spaghetti code, but in this instance safe was not the word. The real question to me isn't so much why some commands were wrongly written, as why the lifeboat had a hole in it.

Admittedly, there is a dearth of details available, but assuming there is a failure review report, I'd be interested to see how something like this wasn't caught in the testbed.
AlexBlackwell
Panel Will Study Mars Global Surveyor Events
NASA/JPL
January 10, 2007
Lorne Ipsum
More accurately, it was a parameter upload error (somewhat similar to the error that killed one of the Viking landers). I'm in the process of writing up a blog post to explain, should be up later tonight / early Thursday...

Lorne
Geek Counterpoint
Zvezdichko
Space agencies more like miles not meters *crash* sad.gif

It seems that faulty software has doomed more than half of Mars spacecraft.
Firstly it was Viking ( bad antenna positioning )
Secondly it was Mars Climate Orbiter
Thirdly it was Mars Polar Lander and these spurious signals.
I'm not counting Phobos Spacecraft...
I don't know why it's always software. We almost lost Spirit three years ago...
It really makes me sad.
odave
It's the nature of the software beast. I'm a software engineer at an industrial robotics company, and I've been told by a mechanical guy that he detests software because you can't see it or measure it. To him, it's black magic that can fail for no apparent reason. Code can be hideously complex, and even if you test the snot out of it before deploying, it seems like there's always one oddball set of circumstances or sequence that nobody even dreamed of encountering that happens almost immediately (and usually to your most important and sensitive customer smile.gif )

I've encountered unexpected memory address overwrites in our stuff, and they don't always show up during testing for a variety of reasons. It may be that the test cases didn't create a situation where the corrupted memory was accessed, or that the data that was written to the wrong memory locations is benign at the time of the test. I would assume that JPL's testing is much more rigorous than ours since the stakes are so much higher, but it's really hard and time/fund consuming to test for absolutely everything. So yes, a sad end for MGS if this was the case, but hopefully they can learn from it and improve the testing process.
Zvezdichko
A little offtopic but...
I'm very concerned about the future and Phoenix. I don't see how spurious signals could be avoided. We have two successful landers ( Viking 1&2 ), and one failure ( MPL ). Actually, we don't know the exact reason for the failure ( for both MPL and MGS ), this is just a likely scenario.
Any news on the latest december attempt with HiRiSe? ESA said that they ( may ) have detected a tumbling MGS?
djellison
5 landers..V1, V2, MPF, MERA, MERB - all used radar.

Doug
ugordan
One also has to consider that it's software, not hardware that actually "thinks" for the spacecraft. Hardware processors, as complex they may be, have straightforward instruction sets and architecture that can be tested pretty well (though remember that Pentium bug years ago...). Processors are dumb pieces of electronics that expect to be told what to do. Software is what makes the thing "tick" and it's vastly more complex than what is essentially a state machine and a powerful calculator. Were you to develop a processor that did all the thinking by itself, it'd still be bugged because it was designed by humans. Complex tasks mean complex things might happen. They might not always be what you expect. You expect and hope they be, you can test the hell out of the system, but there are always gremlins hiding somewhere. You can't test everything; remember: even test cases are created by humans!
Zvezdichko
I have some information about processors on spacecraft. The statement however is not a processor failure, but overheating of the batteries ( which means death of a spacecraft ).
The previous statement of a tumbling spacecraft could mean at least two things. The spacecraft has lost control after overheating. Or after problems with the solar panel we had improper turn-over of the spacecraft.
Am I right? ( just trying to guess)
mcaplinger
QUOTE (Zvezdichko @ Jan 11 2007, 06:26 AM) *
It seems that faulty software has doomed more than half of Mars spacecraft.

The VL1 and MCO cases are not what I would call software faults. In the Viking case, ground controllers commanded things with raw memory writes instead of a higher level command protocol, and they inadvertently wrote into the wrong locations. You could argue that they should have had better software, but the software they did have was working as it was supposed to -- it was operator error. The MCO loss was more a process problem, stimulated by a simple calculation error. Nor is the MPL failure a pure software error -- it was a miscommunication between hardware and software design. Of your examples, only the Spirit flash anomaly was what I would call a pure software error, and it was recoverable via other software.

I can't discuss the MGS failure because unlike some other people on this forum, I was too straightforward in my choice of user name and can't speak anonymously rolleyes.gif
djellison
Some sort of software/commanding problem caused...

A bad attitude which heated up the battery radiator which caused...

Battery failure which caused...

Loss of vehicle, as I understand it so far.
Littlebit
QUOTE (mcaplinger @ Jan 11 2007, 08:41 AM) *
The MCO loss was more a process problem, stimulated by a simple calculation error...

Another English to Metric conversion? rolleyes.gif

Given the shear number of human interactions with the MGS, it is a extraordinary accomplishment of the MGS team to have kept the ball in the air this long. In a way it is like playing Tetres: No matter how great you are, the result of every mission without a firm time-line will be a failure of some sort, usually human...no matter how super human the effort:)

I hope they will be candid and timely in providing a detailed description of the failure and the lessons learned. Knowing the reason that an un-timed mission failed is one more mission success.
lastof7
I feel for the MGS software team if it turns out to not be a direct software issue. It's understandable that NASA management and the public want to know as quickly as possible the root cause of a fault, but, having experienced similar situations, it's painful to see headlines like "Faulty Software May Have Doomed Mars Orbiter" before we have a definitive answer. Unfortunately, it may be too late to correct the impressions that have been made if it is a parameter issue or something of that nature.
climber
It's a wellknown fact that the majority of car accidents occur on the road you know the better. You can call that statistics or lack of concentration. The longer a mission goes, the more likely an human error will occur. I'm just amazed how long the Voyagers have flown, it'll be 30 years this year.
To the software people : we back you guys, habit is a bad thing and people only remember your failures. We just CAN't fly without you.
tedstryk
QUOTE (Littlebit @ Jan 11 2007, 05:00 PM) *
Another English to Metric conversion? rolleyes.gif

Given the shear number of human interactions with the MGS, it is a extraordinary accomplishment of the MGS team to have kept the ball in the air this long. In a way it is like playing Tetres: No matter how great you are, the result of every mission without a firm time-line will be a failure of some sort, usually human...no matter how super human the effort:)

I hope they will be candid and timely in providing a detailed description of the failure and the lessons learned. Knowing the reason that an un-timed mission failed is one more mission success.


They may not. They will certainly look into scenarios of what might have happened, but since contact was lost and there was only limited contact between November 2 and November 6 (which, I believe, was the last day they picked any signal out), it may be hard to isolate the cause of failure.

I will also say that human error can have a magnified effect on extended missions, which are usually funded at much lower levels than primary missions, stretching staffing to the bone.
Lorne Ipsum
Gang,

This might help explain things a bit:

http://geekcounterpoint.net/files/GC052B.html

Lorne
climber
QUOTE (Lorne Ipsum @ Jan 12 2007, 01:30 AM) *
Gang,

This might help explain things a bit:
Lorne

Lorne, you have a way of explaining rocket science, I've never seen before!
I've learnt a lot of things...and that seams SO simple to understand.
Thanks so much...
nprev
Absolutely superb & highly educational analysis, Lorne; thank you VERY much! smile.gif

The bottom line is that many parts of this read exactly like every aircraft accident report I've ever read: there is always a chain of events that increases unknowns and ultimately leads the entire system (including the human element) into an uncontrollable situation with basically unpredictable, often undesirable outcomes.

I sure hope that the MGS software team member(s) involved near the end don't feel too bad; they shouldn't. Aside from the brilliant performance of the spacecraft that vastly exceeded all reasonable dreams before launch, complex systemic failures just plain happen. They seem to be an inevitable feature of the Universe, and I'm sure that the mathematics of chaos theory could easily prove this.
Lorne Ipsum
nprev & climber,

Thanks -- glad you liked the writeup! I'm with you -- hopefully the poor guy at the bottom of the totem pole doesn't get beat up too severely over this (he'll be reliving it for the rest of his life anyway).

I've worked mission ops for old spacecraft with static memory maps before, and I remember how we ALWAYS got paranoid whenever we did parameter updates. When push comes to shove, the fact that a mistake like this could go unnoticed for months says there's a bad process being followed (or a good one not being followed) somewhere. Hopefully the review board can come up with some lessons that can be applied to more modern architectures.

Lorne
stevesliva
Thanks for the extremely informative blog!

I am unsure about one thing though. Towards the end, in "the spark that lit the fire," you do not mention when MGS was switched back to SCP-1. Was this part of the safe mode? Or had it already been transitioned back to SCP-1? And if the transition was an intentional switch back, I tend to agree that at least some process should have caught the bad parm upload. (ie a comparison of the two memories) But if the switch back was a result of a safing event before the SCP-1 memory repair was fully verified, well, that just sucks but is less faultworthy.
helvick
Lorne - superb write up, one of the best bits of reporting on spacecraft ops I've ever come across. Any chance you're available to help the BBC out as they seem to be in need of a major quality control overhaul at the moment?
edstrick
Viking's case involved a thrown-together set of people from the disbanded engineering and software team. VL1 was on an automatic "eternal" mission that was hopefully not going to require any further commanding ever. They were trying to salvage or extend the mission by uploading battery conditioning commands as the battery started to show similar problems to the VL2 batteries that killed that lander's operations.

Note that the Magellan Venus radar mapper mission was nearly lost early on due to a high-lethality interrupt handling error that could send the computer essentially into runaway crashes. They finally "trapped" the error when the ground duplicate test system did a interupt fault and crashed while full diagnostic info was available.

I'm deeply unhappy with trusting in software driven "safe modes", preferring that the spacecraft be able to fall back into an ultimate nearly lobotomized mechanical safe mode. Remember, Pioneers 10 and 11 never had a software problem, never rebooted, never crashed. No computers. All the way beyond Pluto on direct commands (except for sequencer stored commands for midcourse maneuvers).

I'm also deeply unhappy with spacecraft inside of Jupiter's orbit that do not have essentially 100% omnidirectional coverage with low data rate omni-antennas. We nearly lost the ability to command Mariner 10 when it was being stabilized in a drifting roll mode and it rolled into a null in the receiving antenna pattern shortly before the third Mercury encounter. We also had problems with Magellan getting into nearly communication-unable attitudes during one or more of it's computer crash crises. You really want to get 8 bits/second telemetry as long as a spacecraft has power and live command decoding circuits, and the ability to send 1 bit/second commands.
Guido
QUOTE (Lorne Ipsum @ Jan 12 2007, 01:30 AM) *
Gang,

This might help explain things a bit:

http://geekcounterpoint.net/files/GC052B.html

Lorne

Link sends me to "Episode 52 -- The Antikythera Mechanism"

How do I get to the right one? Joining that forum?
PhilHorzempa
QUOTE (Lorne Ipsum @ Jan 11 2007, 08:30 PM) *
Gang,

This might help explain things a bit:

http://geekcounterpoint.net/files/GC052B.html

Lorne



Lorne,

Have "they" gotten to you? It appears that the excellent post that
you wrote concerning MGS and its software in January is now
"disappeared." In fact, except for TPS' mention of it in their weblog,
and this UMSF thread, there is no hint that that article ever existed.
This is truly bizarre.
What's up Lorne?


Another Phil
Littlebit
Somewhere - but I cannot find where - I read Lorne's scenario was not likely to be correct. This may be why the article was pulled, which is too bad, because it was a very good description of MGS era computer systems.

In any case, it will be disappointing if yet another 'successful mission unplanned ending' investigation is kept under wraps.
PhilHorzempa
Lorne,

Please let us at UMSF know what happened to your
MGS software article on Geek Counterpoint. It was
an excellent presentation and helped a lot of us who
may not know software as well as you, but are technically
informed enough to comprehend the issues.

If there are questions as to whether this is what really
ended the MGS mission, then please consider re-posting
an edited version of the article that omits that conclusion.
It was fascinating to catch this glimpse into a crucial
aspect of unmanned exploration. As I believe someone
else has already said, our robot explorers do exactly what
we tell them. The unfortunate thing is that sometimes
we don't realize what we have told them.


Another Phil
elakdawalla
The preliminary report is out, and it sounds like what Lorne described.
Here's the report:
http://www.nasa.gov/pdf/174244main_mgs_whi...er_20070413.pdf

--Emily

QUOTE
MEDIA RELATIONS OFFICE
JET PROPULSION LABORATORY

NEWS RELEASE: 2007-040 April 13, 2007

REPORT REVEALS LIKELY CAUSES OF MARS SPACECRAFT LOSS

WASHINGTON - After studying Mars four times as long as originally planned, NASA's Mars Global Surveyor orbiter appears to have succumbed to battery failure caused by a complex sequence of events involving the onboard computer memory and ground commands.

The causes were released today in a preliminary report by an internal review board. The board was formed to look more in-depth into why NASA's Mars Global Surveyor went silent in November 2006 and recommend any processes or procedures that could increase safety for other spacecraft.

Mars Global Surveyor last communicated with Earth on Nov. 2, 2006. Within 11 hours, depleted batteries likely left the spacecraft unable to control its orientation.

"The loss of the spacecraft was the result of a series of events linked to a computer error made five months before the likely battery failure," said board Chairperson Dolly Perkins, deputy director-technical of NASA Goddard Space Flight Center, Greenbelt, Md.

On Nov. 2, after the spacecraft was ordered to perform a routine adjustment of its solar panels, the spacecraft reported a series of alarms, but indicated that it had stabilized. That was its final transmission. Subsequently, the spacecraft reoriented to an angle that exposed one of two batteries carried on the spacecraft to direct sunlight. This caused the battery to overheat and ultimately led to the depletion of both batteries. Incorrect antenna pointing prevented the orbiter from telling controllers its status, and its programmed safety response did not include making sure the spacecraft orientation was thermally safe.

The board also concluded that the Mars Global Surveyor team followed existing procedures, but that procedures were insufficient to catch the errors that occurred. The board is finalizing recommendations to apply to other missions, such as conducting more thorough reviews of all non-routine changes to stored data before they are uploaded and to evaluate spacecraft contingency modes for risks of overheating.

"We are making an end-to-end review of all our missions to be sure that we apply the lessons learned from Mars Global Surveyor to all our ongoing missions," said Fuk Li, Mars Exploration Program manager at NASA's Jet Propulsion Laboratory, Pasadena, Calif.

EDITORS NOTE:

NASA will hold a media teleconference today at noon PDT (3 p.m. EDT), to discuss the report.

Audio of the teleconference will stream live at: http://www.nasa.gov/newsaudio
djellison
I hope someone asks what the projected remaining on-orbit lifespan of the spacecraft was before it went awol - that tells us the true value of the loss really.

(And guess who got in with the first question - a great one about orientation...nice one ESL smile.gif - I hope you can manage a trademark timeline of events to break it all down )

Damn - I missed the last 5 minutes.

Doug
elakdawalla
I've now posted a story on the review board report.

http://planetary.org/news/2007/0413_Human_...s_Together.html

--Emily
brellis
QUOTE (elakdawalla @ Apr 13 2007, 02:57 PM) *
I've now posted a story on the review board report.

--Emily


Thanks for the thorough reporting. One of the questions lingering in my head about the more advanced computers onboard the unmanned orbiters launched in the last decade has been operating system maintenance. Most of us here on earth now have to deal with OS updates, compatibility, etc., and most of us by now have experienced a fatal error on a home computer at some point. I have several Macs and PC's, each of which has a different combination of repair engines and potential OS crises waiting - like your very appropriate analogy - like a hammer to fall.

In my experience troubleshooting my 'puters, I try to assume that by the time I'm in trouble with one my my machines, it's not the result of only one problem. Usually a few problems have coagulated into a destructive condition. Your article describes an unfortunate sequence of missteps that could have been avoided with a Disk Repair program of some kind.

--Brad
nprev
An absolutely classic 'chain of mistakes/events' scenario, all too familiar from aircraft accident accounts. Excellent reporting, Emily, and thanks!

There are indeed many lessons to be learned here. The main one is that configuration control is an imperative. Two different groups should never have been responsible for maintaining identical spacecraft software-driven bus functions; that's inviting disaster right there.
helvick
I hope Lorne will re-instate his analysis now too - my recall of the article was that it was fundamentally correct and his explanation of the challenges involved in the "simple" day to day management of MGS systems was enlightening.
edstrick
There is a real need for a computer controlled spacecraft to be able to declare "utter dire emergency" and nearly lobotimize itself, switch to a hopefully nearly bulletproof safety control system and safe itself. There's an increasingly long list of lost, nearly lost, and compromized missions where vehicles couldn't properly safemode (Magellan's computer system crashes and NEAR's pre-orbit-insertion burn screwup at Eros) etc.

Pioneer Jupiter missions never had a computer crash and safemode emergency EVER... (no computer)... The missions were done entirely by direct ground command except for turn and burn stored commands in a sequencer for midcourse maneuvers.
MarsIsImportant
It seems to me that part of the solution is they need to redefine what safe mode is. What they thought was safe mode was actually self-destruct mode. ...Of course, I understand that it's not quite that simple.
mcaplinger
QUOTE (edstrick @ Apr 14 2007, 03:19 AM) *
Pioneer Jupiter missions never had a computer crash and safemode emergency EVER...

The fact that those spacecraft had no need to maintain attitude to the Sun (RTG-powered) and had no articulation makes the problem a lot simpler, doesn't it? Given the complexities of having two separately articulated solar panels, need for battery charge management, an articulated HGA, being in a low orbit with no sun half the time, etc, MGS's safe mode design drivers were vastly more complicated. To think that the way out of these problems is to have a "simpler" safe mode is naive. MGS was lost via a long chain of unlikely errors, any subset of which would have left things OK. We just got unlucky. With 20-20 hindsight, the problems seem rather obvious, as such problems usually do.
This is a "lo-fi" version of our main content. To view the full version with more information, formatting and images, please click here.
Invision Power Board © 2001-2024 Invision Power Services, Inc.