NASA rover reboots twice over Easter weekend |
NASA rover reboots twice over Easter weekend |
Apr 14 2009, 12:47 AM
Post
#1
|
|
Martian Photographer Group: Members Posts: 352 Joined: 3-March 05 Member No.: 183 |
|
|
|
Apr 14 2009, 06:04 AM
Post
#2
|
|
Member Group: Members Posts: 184 Joined: 2-March 06 Member No.: 692 |
One hopes that Spirit isn't seeing some wear and tear from the computer doing very many reboots at the start of its mission. Didn't Spirit just have some computer issues before the latest software upload?
And how does this effect Oppy. Does it stand down to see if a common software bug could effect it? Brian |
|
|
Apr 14 2009, 07:36 AM
Post
#3
|
|
Senior Member Group: Moderator Posts: 4279 Joined: 19-April 05 From: .br at .es Member No.: 253 |
> One hopes that Spirit isn't seeing some wear and tear from the computer doing very many reboots at the start of its mission.
Huh? I can't see how reboots may cause wear and tear to the computer but perhaps the opposite. Wear and tear to the computer causing reboots. > Didn't Spirit just have some computer issues before the latest software upload? See here: Unexpected Behavior Edited: > And how does this effect Oppy. Does it stand down to see if a common software bug could effect it? Just checked today's imaging plan for Opportunity and it has all signs of a driving sol. |
|
|
Apr 14 2009, 11:55 AM
Post
#4
|
|
Interplanetary Dumpster Diver Group: Admin Posts: 4404 Joined: 17-February 04 From: Powell, TN Member No.: 33 |
I don't think the early boots would have done permanent damage to the computer. Brian, shouldn't you be looking on the bright side of life?
-------------------- |
|
|
Apr 14 2009, 12:34 PM
Post
#5
|
|
Solar System Cartographer Group: Members Posts: 10197 Joined: 5-April 05 From: Canada Member No.: 227 |
"Brian, shouldn't you be looking on the bright side of life? "
Good one, Ted Phil -------------------- ... because the Solar System ain't gonna map itself.
Also to be found posting similar content on https://mastodon.social/@PhilStooke Maps for download (free PD: https://upload.wikimedia.org/wikipedia/comm...Cartography.pdf NOTE: everything created by me which I post on UMSF is considered to be in the public domain (NOT CC, public domain) |
|
|
Apr 14 2009, 02:06 PM
Post
#6
|
|
Member Group: Admin Posts: 976 Joined: 29-September 06 From: Pasadena, CA - USA Member No.: 1200 |
One hopes that Spirit isn't seeing some wear and tear from the computer doing very many reboots at the start of its mission. Didn't Spirit just have some computer issues before the latest software upload? And how does this effect Oppy. Does it stand down to see if a common software bug could effect it? Brian I am on vacation this week (Spring Break with my kids in the PNW) so I do not know what's going on on at Gusev. I know Opportunity is driving (forwards!!). Related to computer booting: I don't think that adds wear and tear. I know of a company that built an empire around computer rebooting. Paolo -------------------- Disclaimer: all opinions, ideas and information included here are my own,and should not be intended to represent opinion or policy of my employer.
|
|
|
Apr 14 2009, 03:59 PM
Post
#7
|
|
Dublin Correspondent Group: Admin Posts: 1799 Joined: 28-March 05 From: Celbridge, Ireland Member No.: 220 |
I don't think reboots should affect much but Flash memory does degrade with use. It takes a while but we are running into fairly large data volumes for the lifetime of the rovers. I'm pretty sure that the type of Flash memory used in the MER's is good for around 100k write cycles per cell but five years with a few tens of GB of data throughput in the relatively harsh environment of the Martian surface might be enough to start seeing more frequent transient errors if there was any significant "hotspot" on the Flash drive that was getting a lot more write activity than the average. However I suspect that if this was the root cause Opportunity would be more likely to exhibit the problem as I'm pretty sure she has delivered more data - and given the use of deep sleep mode any wear that was related to the boot process should also hit Opportunity sooner than Spirit since the former has made much more use of that than Spirit IIRC.
Here's hoping it was just some freak occurrence of cosmic ray hits. |
|
|
Apr 14 2009, 04:41 PM
Post
#8
|
|
Senior Member Group: Members Posts: 1585 Joined: 14-October 05 From: Vermont Member No.: 530 |
I don't think reboots should affect much but Flash memory does degrade with use. It takes a while but we are running into fairly large data volumes for the lifetime of the rovers. I'm pretty sure that the type of Flash memory used in the MER's is good for around 100k write cycles per cell but five years with a few tens of GB of data throughput in the relatively harsh environment of the Martian surface might be enough to start seeing more frequent transient errors if there was any significant "hotspot" on the Flash drive that was getting a lot more write activity than the average. Even if the memory doesn't use the algorithms that balance write cycling (and flash architecture usually needs only balance by sectors or pages or whatever the minimum memory chunk is that can be erased before reprogramming, not by individual bit), it's worth bearing in mind that, like a rover with a 90-day guarantee, each individual flash cell has a 100k (or more) guarantee, but the average flash cell will achieve far more than that. And if there is overhead in the ECC, a single bad bit isn't going to kill the word. |
|
|
Apr 14 2009, 05:28 PM
Post
#9
|
|
Senior Member Group: Members Posts: 3648 Joined: 1-October 05 From: Croatia Member No.: 523 |
And if there is overhead in the ECC, a single bad bit isn't going to kill the word. Slightly related to this, while browsing through the recent Cassini PDS release info, I noticed they detected a bad spot in one of its SSRs causing double-bit errors (so they're not caught) and various kinds of corruption in the ISS images. They were planning on developing a SW patch to avoid the bad segment. Memory corruption sucks, doesn't it? -------------------- |
|
|
Apr 14 2009, 07:06 PM
Post
#10
|
|
Senior Member Group: Members Posts: 1585 Joined: 14-October 05 From: Vermont Member No.: 530 |
Memory corruption sucks, doesn't it? Yeah, there's nothing worse than field returns in my business. Hard to bring 'em back from space, though. Sure it's not merely uncorrectable? I'd guess double-bit errors are detectable. (But not necessarily.) It is a good point, though, that once you have an always-bad bit, your overhead *is* shot, and your transient errors will be uncorrectable in all likelihood. |
|
|
Apr 16 2009, 04:31 PM
Post
#11
|
|
Member Group: Members Posts: 214 Joined: 30-December 05 Member No.: 628 |
...it's worth bearing in mind that, like a rover with a 90-day guarantee, each individual flash cell has a 100k (or more) guarantee, but the average flash cell will achieve far more than that. And if there is overhead in the ECC, a single bad bit isn't going to kill the word. I don't know whether to interpret the 100K cycles "guarantee" as a minimum, mean, median, or even a modal value. But surely the second moment must be important in this sort of problem. If the sigma for expected failure is wide enough around say, 100K, and if a particular programming operation "samples" from say 20KB (WAG) worth of cells, the chance of the program crashing must become significant long before the average cell accumulates 100K read/writes. (I take it ECC refers to some kind of error correction, which probably can catch and correct the early failures if they are rare enough.) |
|
|
Apr 16 2009, 06:57 PM
Post
#12
|
|
Senior Member Group: Members Posts: 1585 Joined: 14-October 05 From: Vermont Member No.: 530 |
If the program is merely reading from those cells, its not an issue. Just writing. So you could use flash as instruction memory that you might update a few times in a mission, and you can use it as a storage repository for photos. Even if you filled the flash every sol, we're not at 2000 yet. What you cannot use it as, is RAM-- a scratchpad for doing calculation.
Yes, the ECC is there (if it's there) to correct errors in the memory word. For 128 bits, you might write a 16 extra syndrome bits that algorithmically would allow you to correct a single bit in the 128 that is wrong. To my knowledge, the ECC isn't there to correct the hard errors that come with exceeding write cycling, it's there to correct for errors that just happen on occasion, in fantastically mind-boggling, flash-specific ways. But it would help cover up hard errors. To guarantee 100K cycles, you have to bear in mind that, yes, you might be making this guarantee for over 16 billion cells on a 16Gb chip. So if your guarantee for your typical statistical cell meets that to even 10 sigma--or whatever one in a billion cells not meeting the spec would mean--you're still going to get fails on that chip. What they do to spec 100K would be a combination of test (throw out entire bad chips), redundant cells and repair (find the bad bits and fix them... how you find suspect bits without destroying a chip-- top secret), and the aforementioned ECC if your process engineers can't totally solve this particular problem. And yeah, you might still get a cell in an iPod somewhere that goes bad before its time, but the stats guys are trying really hard to ensure that that is extremely rare by eliminating the tail of the distribution. My point was that the actual center of the distribution is still going to be somewhere far far above 100K to make this guarantee. Just delivering a memory chip that works from Day 0 is a similar game of stats... even if your process engineers deliver a process where only one in a million cells is failing a spec, every single 1Gb chip would have on average 1000 bad cells! So after manufacturing, there is a lot of test to be done to fix things and eliminate those fliers. At the same time, there are 900 million cells that greatly exceed the spec. |
|
|
Apr 17 2009, 03:34 AM
Post
#13
|
|
Senior Member Group: Members Posts: 4252 Joined: 17-January 05 Member No.: 152 |
A few details on the Spirit anomalies in the new update.
|
|
|
Apr 18 2009, 08:10 PM
Post
#14
|
|
Newbie Group: Members Posts: 13 Joined: 6-April 09 Member No.: 4720 |
Guarantee is not a term you'd use for a Martian rover. It's a business term. The chips would probably have something like a mean time between, or before failure rate. Age can be a factor, since almost all mechanical failures are from thermal cycles. And of course you have random failures that are as likely day one as day five thousand.
I'd guess they could map out physical bit failures in memory, but don't really know if that was included. |
|
|
Apr 19 2009, 01:03 AM
Post
#15
|
|
Member Group: Members Posts: 753 Joined: 23-October 04 From: Greensboro, NC USA Member No.: 103 |
A few details on the Spirit anomalies in the new update. Can someone please explain in clearer English this extract from the above-referenced update: "no sol number for Spirit corresponded to April 2, 2009, using the criterion of the date in Los Angeles at local solar noon on Mars"? Thanks, Jonathan -------------------- Jonathan Ward
Manning the LCC at http://www.apollolaunchcontrol.com |
|
|
Lo-Fi Version | Time is now: 27th June 2024 - 04:07 PM |
RULES AND GUIDELINES Please read the Forum Rules and Guidelines before posting. IMAGE COPYRIGHT |
OPINIONS AND MODERATION Opinions expressed on UnmannedSpaceflight.com are those of the individual posters and do not necessarily reflect the opinions of UnmannedSpaceflight.com or The Planetary Society. The all-volunteer UnmannedSpaceflight.com moderation team is wholly independent of The Planetary Society. The Planetary Society has no influence over decisions made by the UnmannedSpaceflight.com moderators. |
SUPPORT THE FORUM Unmannedspaceflight.com is funded by the Planetary Society. Please consider supporting our work and many other projects by donating to the Society or becoming a member. |