Unmanned Spaceflight.com > MGS in Trouble

Help - Search - Members - Calendar

Full Version: MGS in Trouble

Unmanned Spaceflight.com > Mars & Missions > Past and Future > Mars Global Surveyor

Pages: 1, 2, 3, 4, 5, 6

Zvezdichko

Dec 13 2006, 06:52 PM

Any news regarding the images of HRSC?

tuvas

Dec 15 2006, 06:27 AM

I just found out that the MRO star tracker successfully imaged Odyssey, so there's still a chance that it could image MGS. If it manages to successfully image MGS, then HiRISE will follow suit.

ustrax

Dec 16 2006, 06:37 PM

QUOTE (elakdawalla @ Nov 28 2006, 06:45 PM)

However, he said it was a different part of the panel that was reporting problems this time.

What will be the problems tormenting Anahita generation about space exploration?!...

How IO would love to be around toeb concerned...

jaredGalen

Dec 20 2006, 06:37 PM

Anyone have more details on this?? MGS possibly seen, perhaps tumbling?

http://www.livescience.com/blogs/author/leonarddavid

tuvas

Dec 20 2006, 06:46 PM

QUOTE (jaredGalen @ Dec 20 2006, 11:37 AM)

Anyone have more details on this?? MGS possibly seen, perhaps tumbling?

http://www.livescience.com/blogs/author/leonarddavid

Haven't heard a thing, if it is tumbling though, I imagine it might be a good thing... That's just a guess, that's all folks, I don't know anything about spacecraft flight...

elakdawalla

Dec 20 2006, 06:49 PM

A tumbling spacecraft is an out-of-control spacecraft.

--Emily

djellison

Dec 20 2006, 06:52 PM

Well - tumbling would suggest ACS failure - and ACS failure (for whatever reason) would almost certainly mean a spacecraft that was not power positive - and MC's words on that were that unless it was power positive, it was good-night Mr Bond.

But - conversely - SOHO was found to be tumbling back when it played truent for 6 months - and was saved ( even from a point when all its fuel had frozen solid ).

However - I think MGS is probably lost - I just hope that HiRISE can get a picture of MGS...even if it's a hard sequence to get right, and a bandwidth whore ( black should compress really well

) - I think it's an image we would all want to see (and one that would make it into the main stream press as well )

Doug

tuvas

Dec 20 2006, 06:58 PM

QUOTE (elakdawalla @ Dec 20 2006, 11:49 AM)

A tumbling spacecraft is an out-of-control spacecraft.

--Emily

Well, I guess I've learned a thing or two... I thought tumbling would lead to a power-positive system, at least half power, which would do something. Hmmm. Would be quite interesting to see with HiRISE, to see if it moves during the picture...

djellison

Dec 20 2006, 07:16 PM

Problem is - you could have tumbling in an axis that had zero power - or full power...but more likely somewhere inbetween. Call it an average of Half Power. But that half power is over half the orbit - so you never have a charged battery going into eclipse, and always come out with it flattened. Then, if you don't get a good enough angle on the arrays next time around - you end up with a cold spacecraft, a flat battery, and that's when you have actual failures of the hardware required to keep things under control.

Depending on what sort of orientation it is in and what that spin is like - it may well be that after a few months - everything rotates around the sun to give the old girl enough power to wake up and perhaps tell us what's wrong...but it would be a long shot.

Doug

Zvezdichko

Dec 21 2006, 08:31 PM

That's pretty strange. We had contact on 5th November which means that the batteries were charged... that probably means that ... we don't have zero power. So we could still have a chance...

gpurcell

Jan 9 2007, 04:31 PM

From NasaWatch's LiveBlog on MEPAG

"We think that failure that a software load we sent up in June of last year was the cause. This software tried to synch up two flight processors. Two addresses were incorrect - two memory addresses were over written. As the geometry evolved. We drove the arrays against a hard stop and the spacecraft went into safe mode. The radiator for the battery pointed at the sun, the temperature went up, and battery failed. But this should be treated as preliminary."

If true, a sad end to a magnificent mission.

http://www.nasawatch.com/archives/2007/01/..._meeting_i.html

PhilCo126

Jan 9 2007, 06:46 PM

NASA still didn't announce an official " RIP MGS " ... correct?

AlexBlackwell

Jan 10 2007, 06:20 PM

QUOTE (gpurcell @ Jan 9 2007, 06:31 AM)

If true, a sad end to a magnificent mission.

There are couple of other words that I might use in addition to "sad."

Sunspot

Jan 10 2007, 07:27 PM

Oh Dear............... human error

Bob Shaw

Jan 10 2007, 09:37 PM

QUOTE (Sunspot @ Jan 10 2007, 07:27 PM)

Oh Dear............... human error

It sounds like 'safe mode' wasn't. The whole idea of safe mode is that the spacecraft gets a breathing space while the humans look at the spaghetti code, but in this instance safe was not the word. The real question to me isn't so much why some commands were wrongly written, as why the lifeboat had a hole in it.

MGS was a fine old bird! Better to go out in action than simply to be switched off because of budget pressures.

Bob Shaw

AlexBlackwell

Jan 10 2007, 09:39 PM

QUOTE (Bob Shaw @ Jan 10 2007, 11:37 AM)

Admittedly, there is a dearth of details available, but assuming there is a failure review report, I'd be interested to see how something like this wasn't caught in the testbed.

AlexBlackwell

Jan 10 2007, 10:07 PM

Panel Will Study Mars Global Surveyor Events
NASA/JPL
January 10, 2007

Lorne Ipsum

Jan 10 2007, 11:24 PM

More accurately, it was a parameter upload error (somewhat similar to the error that killed one of the Viking landers). I'm in the process of writing up a blog post to explain, should be up later tonight / early Thursday...

Lorne
Geek Counterpoint

Zvezdichko

Jan 11 2007, 02:26 PM

Space agencies more like miles not meters *crash*

It seems that faulty software has doomed more than half of Mars spacecraft.
Firstly it was Viking ( bad antenna positioning )
Secondly it was Mars Climate Orbiter
Thirdly it was Mars Polar Lander and these spurious signals.
I'm not counting Phobos Spacecraft...
I don't know why it's always software. We almost lost Spirit three years ago...
It really makes me sad.

odave

Jan 11 2007, 03:10 PM

It's the nature of the software beast. I'm a software engineer at an industrial robotics company, and I've been told by a mechanical guy that he detests software because you can't see it or measure it. To him, it's black magic that can fail for no apparent reason. Code can be hideously complex, and even if you test the snot out of it before deploying, it seems like there's always one oddball set of circumstances or sequence that nobody even dreamed of encountering that happens almost immediately (and usually to your most important and sensitive customer

)

I've encountered unexpected memory address overwrites in our stuff, and they don't always show up during testing for a variety of reasons. It may be that the test cases didn't create a situation where the corrupted memory was accessed, or that the data that was written to the wrong memory locations is benign at the time of the test. I would assume that JPL's testing is much more rigorous than ours since the stakes are so much higher, but it's really hard and time/fund consuming to test for absolutely everything. So yes, a sad end for MGS if this was the case, but hopefully they can learn from it and improve the testing process.

Zvezdichko

Jan 11 2007, 03:15 PM

A little offtopic but...
I'm very concerned about the future and Phoenix. I don't see how spurious signals could be avoided. We have two successful landers ( Viking 1&2 ), and one failure ( MPL ). Actually, we don't know the exact reason for the failure ( for both MPL and MGS ), this is just a likely scenario.
Any news on the latest december attempt with HiRiSe? ESA said that they ( may ) have detected a tumbling MGS?

djellison

Jan 11 2007, 03:20 PM

5 landers..V1, V2, MPF, MERA, MERB - all used radar.

Doug

ugordan

Jan 11 2007, 03:22 PM

One also has to consider that it's software, not hardware that actually "thinks" for the spacecraft. Hardware processors, as complex they may be, have straightforward instruction sets and architecture that can be tested pretty well (though remember that Pentium bug years ago...). Processors are dumb pieces of electronics that expect to be told what to do. Software is what makes the thing "tick" and it's vastly more complex than what is essentially a state machine and a powerful calculator. Were you to develop a processor that did all the thinking by itself, it'd still be bugged because it was designed by humans. Complex tasks mean complex things might happen. They might not always be what you expect. You expect and hope they be, you can test the hell out of the system, but there are always gremlins hiding somewhere. You can't test everything; remember: even test cases are created by humans!

Zvezdichko

Jan 11 2007, 03:29 PM

I have some information about processors on spacecraft. The statement however is not a processor failure, but overheating of the batteries ( which means death of a spacecraft ).
The previous statement of a tumbling spacecraft could mean at least two things. The spacecraft has lost control after overheating. Or after problems with the solar panel we had improper turn-over of the spacecraft.
Am I right? ( just trying to guess)

mcaplinger

Jan 11 2007, 03:41 PM

QUOTE (Zvezdichko @ Jan 11 2007, 06:26 AM)

It seems that faulty software has doomed more than half of Mars spacecraft.

The VL1 and MCO cases are not what I would call software faults. In the Viking case, ground controllers commanded things with raw memory writes instead of a higher level command protocol, and they inadvertently wrote into the wrong locations. You could argue that they should have had better software, but the software they did have was working as it was supposed to -- it was operator error. The MCO loss was more a process problem, stimulated by a simple calculation error. Nor is the MPL failure a pure software error -- it was a miscommunication between hardware and software design. Of your examples, only the Spirit flash anomaly was what I would call a pure software error, and it was recoverable via other software.

I can't discuss the MGS failure because unlike some other people on this forum, I was too straightforward in my choice of user name and can't speak anonymously

djellison

Jan 11 2007, 03:41 PM

Some sort of software/commanding problem caused...

A bad attitude which heated up the battery radiator which caused...

Battery failure which caused...

Loss of vehicle, as I understand it so far.

Littlebit

Jan 11 2007, 05:00 PM

QUOTE (mcaplinger @ Jan 11 2007, 08:41 AM)

The MCO loss was more a process problem, stimulated by a simple calculation error...

Another English to Metric conversion?

Given the shear number of human interactions with the MGS, it is a extraordinary accomplishment of the MGS team to have kept the ball in the air this long. In a way it is like playing Tetres: No matter how great you are, the result of every mission without a firm time-line will be a failure of some sort, usually human...no matter how super human the effort:)

I hope they will be candid and timely in providing a detailed description of the failure and the lessons learned. Knowing the reason that an un-timed mission failed is one more mission success.

lastof7

Jan 11 2007, 05:14 PM

I feel for the MGS software team if it turns out to not be a direct software issue. It's understandable that NASA management and the public want to know as quickly as possible the root cause of a fault, but, having experienced similar situations, it's painful to see headlines like "Faulty Software May Have Doomed Mars Orbiter" before we have a definitive answer. Unfortunately, it may be too late to correct the impressions that have been made if it is a parameter issue or something of that nature.

climber

Jan 11 2007, 07:25 PM

It's a wellknown fact that the majority of car accidents occur on the road you know the better. You can call that statistics or lack of concentration. The longer a mission goes, the more likely an human error will occur. I'm just amazed how long the Voyagers have flown, it'll be 30 years this year.
To the software people : we back you guys, habit is a bad thing and people only remember your failures. We just CAN't fly without you.

tedstryk

Jan 11 2007, 10:47 PM

QUOTE (Littlebit @ Jan 11 2007, 05:00 PM)

Another English to Metric conversion?

They may not. They will certainly look into scenarios of what might have happened, but since contact was lost and there was only limited contact between November 2 and November 6 (which, I believe, was the last day they picked any signal out), it may be hard to isolate the cause of failure.

I will also say that human error can have a magnified effect on extended missions, which are usually funded at much lower levels than primary missions, stretching staffing to the bone.

Lorne Ipsum

Jan 12 2007, 12:30 AM

Gang,

This might help explain things a bit:

http://geekcounterpoint.net/files/GC052B.html

Lorne

climber

Jan 12 2007, 12:52 AM

QUOTE (Lorne Ipsum @ Jan 12 2007, 01:30 AM)

Gang,

This might help explain things a bit:
Lorne

Lorne, you have a way of explaining rocket science, I've never seen before!
I've learnt a lot of things...and that seams SO simple to understand.
Thanks so much...

nprev

Jan 12 2007, 12:55 AM

Absolutely superb & highly educational analysis, Lorne; thank you VERY much!

The bottom line is that many parts of this read exactly like every aircraft accident report I've ever read: there is always a chain of events that increases unknowns and ultimately leads the entire system (including the human element) into an uncontrollable situation with basically unpredictable, often undesirable outcomes.

I sure hope that the MGS software team member(s) involved near the end don't feel too bad; they shouldn't. Aside from the brilliant performance of the spacecraft that vastly exceeded all reasonable dreams before launch, complex systemic failures just plain happen. They seem to be an inevitable feature of the Universe, and I'm sure that the mathematics of chaos theory could easily prove this.

Lorne Ipsum

Jan 12 2007, 03:16 AM

nprev & climber,

Thanks -- glad you liked the writeup! I'm with you -- hopefully the poor guy at the bottom of the totem pole doesn't get beat up too severely over this (he'll be reliving it for the rest of his life anyway).

I've worked mission ops for old spacecraft with static memory maps before, and I remember how we ALWAYS got paranoid whenever we did parameter updates. When push comes to shove, the fact that a mistake like this could go unnoticed for months says there's a bad process being followed (or a good one not being followed) somewhere. Hopefully the review board can come up with some lessons that can be applied to more modern architectures.

Lorne

stevesliva

Jan 12 2007, 03:56 AM

Thanks for the extremely informative blog!

I am unsure about one thing though. Towards the end, in "the spark that lit the fire," you do not mention when MGS was switched back to SCP-1. Was this part of the safe mode? Or had it already been transitioned back to SCP-1? And if the transition was an intentional switch back, I tend to agree that at least some process should have caught the bad parm upload. (ie a comparison of the two memories) But if the switch back was a result of a safing event before the SCP-1 memory repair was fully verified, well, that just sucks but is less faultworthy.

helvick

Jan 12 2007, 10:19 AM

Lorne - superb write up, one of the best bits of reporting on spacecraft ops I've ever come across. Any chance you're available to help the BBC out as they seem to be in need of a major quality control overhaul at the moment?

edstrick

Jan 12 2007, 11:45 AM

Viking's case involved a thrown-together set of people from the disbanded engineering and software team. VL1 was on an automatic "eternal" mission that was hopefully not going to require any further commanding ever. They were trying to salvage or extend the mission by uploading battery conditioning commands as the battery started to show similar problems to the VL2 batteries that killed that lander's operations.

Note that the Magellan Venus radar mapper mission was nearly lost early on due to a high-lethality interrupt handling error that could send the computer essentially into runaway crashes. They finally "trapped" the error when the ground duplicate test system did a interupt fault and crashed while full diagnostic info was available.

I'm deeply unhappy with trusting in software driven "safe modes", preferring that the spacecraft be able to fall back into an ultimate nearly lobotomized mechanical safe mode. Remember, Pioneers 10 and 11 never had a software problem, never rebooted, never crashed. No computers. All the way beyond Pluto on direct commands (except for sequencer stored commands for midcourse maneuvers).

I'm also deeply unhappy with spacecraft inside of Jupiter's orbit that do not have essentially 100% omnidirectional coverage with low data rate omni-antennas. We nearly lost the ability to command Mariner 10 when it was being stabilized in a drifting roll mode and it rolled into a null in the receiving antenna pattern shortly before the third Mercury encounter. We also had problems with Magellan getting into nearly communication-unable attitudes during one or more of it's computer crash crises. You really want to get 8 bits/second telemetry as long as a spacecraft has power and live command decoding circuits, and the ability to send 1 bit/second commands.

Guido

Feb 13 2007, 01:42 PM

QUOTE (Lorne Ipsum @ Jan 12 2007, 01:30 AM)

Gang,

This might help explain things a bit:

http://geekcounterpoint.net/files/GC052B.html

Lorne

Link sends me to "Episode 52 -- The Antikythera Mechanism"

How do I get to the right one? Joining that forum?

PhilHorzempa

Feb 15 2007, 10:39 PM

QUOTE (Lorne Ipsum @ Jan 11 2007, 08:30 PM)

Gang,

This might help explain things a bit:

http://geekcounterpoint.net/files/GC052B.html

Lorne

Lorne,

Have "they" gotten to you? It appears that the excellent post that
you wrote concerning MGS and its software in January is now
"disappeared." In fact, except for TPS' mention of it in their weblog,
and this UMSF thread, there is no hint that that article ever existed.
This is truly bizarre.
What's up Lorne?

Another Phil

Littlebit

Feb 16 2007, 03:37 PM

Somewhere - but I cannot find where - I read Lorne's scenario was not likely to be correct. This may be why the article was pulled, which is too bad, because it was a very good description of MGS era computer systems.

In any case, it will be disappointing if yet another 'successful mission unplanned ending' investigation is kept under wraps.

PhilHorzempa

Feb 16 2007, 05:46 PM

Lorne,

Please let us at UMSF know what happened to your
MGS software article on Geek Counterpoint. It was
an excellent presentation and helped a lot of us who
may not know software as well as you, but are technically
informed enough to comprehend the issues.

If there are questions as to whether this is what really
ended the MGS mission, then please consider re-posting
an edited version of the article that omits that conclusion.
It was fascinating to catch this glimpse into a crucial
aspect of unmanned exploration. As I believe someone
else has already said, our robot explorers do exactly what
we tell them. The unfortunate thing is that sometimes
we don't realize what we have told them.

Another Phil

elakdawalla

Apr 13 2007, 03:44 PM

The preliminary report is out, and it sounds like what Lorne described.
Here's the report:
http://www.nasa.gov/pdf/174244main_mgs_whi...er_20070413.pdf

--Emily

QUOTE

MEDIA RELATIONS OFFICE
JET PROPULSION LABORATORY

NEWS RELEASE: 2007-040 April 13, 2007

REPORT REVEALS LIKELY CAUSES OF MARS SPACECRAFT LOSS

WASHINGTON - After studying Mars four times as long as originally planned, NASA's Mars Global Surveyor orbiter appears to have succumbed to battery failure caused by a complex sequence of events involving the onboard computer memory and ground commands.

The causes were released today in a preliminary report by an internal review board. The board was formed to look more in-depth into why NASA's Mars Global Surveyor went silent in November 2006 and recommend any processes or procedures that could increase safety for other spacecraft.

Mars Global Surveyor last communicated with Earth on Nov. 2, 2006. Within 11 hours, depleted batteries likely left the spacecraft unable to control its orientation.

"The loss of the spacecraft was the result of a series of events linked to a computer error made five months before the likely battery failure," said board Chairperson Dolly Perkins, deputy director-technical of NASA Goddard Space Flight Center, Greenbelt, Md.

On Nov. 2, after the spacecraft was ordered to perform a routine adjustment of its solar panels, the spacecraft reported a series of alarms, but indicated that it had stabilized. That was its final transmission. Subsequently, the spacecraft reoriented to an angle that exposed one of two batteries carried on the spacecraft to direct sunlight. This caused the battery to overheat and ultimately led to the depletion of both batteries. Incorrect antenna pointing prevented the orbiter from telling controllers its status, and its programmed safety response did not include making sure the spacecraft orientation was thermally safe.

The board also concluded that the Mars Global Surveyor team followed existing procedures, but that procedures were insufficient to catch the errors that occurred. The board is finalizing recommendations to apply to other missions, such as conducting more thorough reviews of all non-routine changes to stored data before they are uploaded and to evaluate spacecraft contingency modes for risks of overheating.

"We are making an end-to-end review of all our missions to be sure that we apply the lessons learned from Mars Global Surveyor to all our ongoing missions," said Fuk Li, Mars Exploration Program manager at NASA's Jet Propulsion Laboratory, Pasadena, Calif.

EDITORS NOTE:

NASA will hold a media teleconference today at noon PDT (3 p.m. EDT), to discuss the report.

Audio of the teleconference will stream live at: http://www.nasa.gov/newsaudio

djellison

Apr 13 2007, 06:59 PM

I hope someone asks what the projected remaining on-orbit lifespan of the spacecraft was before it went awol - that tells us the true value of the loss really.

(And guess who got in with the first question - a great one about orientation...nice one ESL

- I hope you can manage a trademark timeline of events to break it all down )

Damn - I missed the last 5 minutes.

Doug

elakdawalla

Apr 13 2007, 09:57 PM

I've now posted a story on the review board report.

http://planetary.org/news/2007/0413_Human_...s_Together.html

--Emily

brellis

Apr 14 2007, 03:26 AM

QUOTE (elakdawalla @ Apr 13 2007, 02:57 PM)

I've now posted a story on the review board report.

--Emily

Thanks for the thorough reporting. One of the questions lingering in my head about the more advanced computers onboard the unmanned orbiters launched in the last decade has been operating system maintenance. Most of us here on earth now have to deal with OS updates, compatibility, etc., and most of us by now have experienced a fatal error on a home computer at some point. I have several Macs and PC's, each of which has a different combination of repair engines and potential OS crises waiting - like your very appropriate analogy - like a hammer to fall.

In my experience troubleshooting my 'puters, I try to assume that by the time I'm in trouble with one my my machines, it's not the result of only one problem. Usually a few problems have coagulated into a destructive condition. Your article describes an unfortunate sequence of missteps that could have been avoided with a Disk Repair program of some kind.

--Brad

nprev

Apr 14 2007, 03:43 AM

An absolutely classic 'chain of mistakes/events' scenario, all too familiar from aircraft accident accounts. Excellent reporting, Emily, and thanks!

There are indeed many lessons to be learned here. The main one is that configuration control is an imperative. Two different groups should never have been responsible for maintaining identical spacecraft software-driven bus functions; that's inviting disaster right there.

helvick

Apr 14 2007, 08:59 AM

I hope Lorne will re-instate his analysis now too - my recall of the article was that it was fundamentally correct and his explanation of the challenges involved in the "simple" day to day management of MGS systems was enlightening.

edstrick

Apr 14 2007, 10:19 AM

There is a real need for a computer controlled spacecraft to be able to declare "utter dire emergency" and nearly lobotimize itself, switch to a hopefully nearly bulletproof safety control system and safe itself. There's an increasingly long list of lost, nearly lost, and compromized missions where vehicles couldn't properly safemode (Magellan's computer system crashes and NEAR's pre-orbit-insertion burn screwup at Eros) etc.

Pioneer Jupiter missions never had a computer crash and safemode emergency EVER... (no computer)... The missions were done entirely by direct ground command except for turn and burn stored commands in a sequencer for midcourse maneuvers.

MarsIsImportant

Apr 14 2007, 12:29 PM

It seems to me that part of the solution is they need to redefine what safe mode is. What they thought was safe mode was actually self-destruct mode. ...Of course, I understand that it's not quite that simple.

mcaplinger

Apr 14 2007, 02:35 PM

QUOTE (edstrick @ Apr 14 2007, 03:19 AM)

Pioneer Jupiter missions never had a computer crash and safemode emergency EVER...

The fact that those spacecraft had no need to maintain attitude to the Sun (RTG-powered) and had no articulation makes the problem a lot simpler, doesn't it? Given the complexities of having two separately articulated solar panels, need for battery charge management, an articulated HGA, being in a low orbit with no sun half the time, etc, MGS's safe mode design drivers were vastly more complicated. To think that the way out of these problems is to have a "simpler" safe mode is naive. MGS was lost via a long chain of unlikely errors, any subset of which would have left things OK. We just got unlucky. With 20-20 hindsight, the problems seem rather obvious, as such problems usually do.

This is a "lo-fi" version of our main content. To view the full version with more information, formatting and images, please click here.