Unmanned Spaceflight.com > Sol 22 anomaly

Unmanned Spaceflight.com > Mars & Missions > Past and Future > Phoenix

elakdawalla

Jun 18 2008, 09:39 PM

Today's press release from the Phoenix mission contained the following nugget of information:

QUOTE

Newly planned science activities will resume no earlier than Sol 24 as engineers look into how the spacecraft is handling larger than expected amounts of data.

This sounded alarming, and immediately brought the very scary Spirit sol 18 anomaly to mind. (That anomaly, in brief, had to do with too many files being kept in flash memory, which resulted in Spirit descending into a cycle of continuous reboots that might, if not stopped, have depleted the batteries and killed the rover within a day or two. Through heroic efforts Spirit was recovered and obviously returned to perfect health.)

I requested an interview with someone from JPL and am happy to say I just got a call from Barry Goldstein. I'm copying here the entire text of what he said to me. I will be blogging this but am wondering if someone here could help expand a bit on the business about APIDs (Application Process Identifiers) and what part they play in an operating system. I started off by asking for more detail on the problem, and for him to compare and contrast with Spirit Sol 18.

QUOTE ('Barry Goldstein')

When the anomaly happened with Spirit, we lost communication. We never lost communication or control of the vehicle here. It's quite different. On Spirit we had a file management problem that ran amok.

What happened was, at one of the downlinks on sol 22, the engineering housekeeping data was being looked at by the spacecraft team. And they noticed one of the APIDs for a housekeeping data packet, which is normally generated only one to three times every time we do an uplink, was generated 45,000 times. It was a surprise, to say the least. And the reaction of the team was, the obvious which was concern about why the heck did this happen, and the other issue was we were concerned about two things. One, since the APID priority for this data type was very high, would it starve out any of the science data from being saved overnight because it's now so large? And the resolution of that it turned out, yes indeed, it was that large, and we ended up losing very low priority science data from sol 22. But the scientists are not at all concerned about that. The second concern we had yesterday was, we had a restriction on the amount of time it takes for the spacecraft to boot. I can't remember the total value but it's over 60 seconds. If it doesn't boot within a certain amount of time, it will reset and then eventually go over to the B side (it's block redundant, unlike MER). The reason we were concerned is that this data structure, now which is huge because of these 45,000 blocks, it has to pull that out of the flash as part of the boot process. And so we were concerned it would take too long and therefore it would side-swap. So we took some emergency action last night, and I'm happy to say we got the uplinks in due to the following things. Number one, we updated the priority of that APID such that it will restrict the amount of that data type to be saved in flash. Second thing we did is we lost science operations on sol 23. Third thing we did is up the priority of the downlink of that data structure that we generated so often so that we could retrieve what we have so it could help us diagnose the problem. The current state of the spacecraft is as follows.

We have the data down, we have the spacecraft under control, we have the size of the file system in control such that we're no longer worried abou tthe size of the file system growing and keeping us from booting appropriately. The second thing is, the only restriction we put on science activity for sol 24, which the science team is planning right now, is that they can't save the data to the flash because we want to keep the flash small, we don't want this thing to eat us alive. So what the team is doing now is planning sol 24. However, there's a little paradox here. Because we were in this anomalous state, we requested and received a bunch of contingency passes from MRO and Odyssey. So what ends up happening is we told the science team you can do whatever you want, because the only thing we are worried about was flash, we just are not going to save it to flash when we turn off. And we then told them we have all these passes. So as it turns out, what the science team is planning is the most data-rich sol we've had to date, because we have all these extra passes. I was joking with Peter that he should pray for these things more often because he gets more data.

{What other kind of memory is there besides flash?} We execute out of RAM, and every time we turn the vehicle off to save power at night, charge the batteries, we save off the critical data structures which include this file system with the telemetry that has not been marked as received on the ground. And that's what really ate our lunch is the saving of this to the flash. We ran out of room in the flash and that's what caused them to lose the science data, which was low priority. And then it's the time it takes to read it out of flash and get it down on the ground.

{What's generating all these APIDs?} We have a suspect, and I'd prefer not to go into a lot of detail, but the suspect has to do with the packet counter number for each of the packets that are stored. It's been less than 24 hours so I'd like to let the team get a chance to look at this and analyze it completely. At this point it's our prime suspect but that doesn't necessarily mean it will pan out.

Even though we have had this anomaly, the vehicle is under control. We lost a sol of operations, because when this occurred we stopped the uplink for that sol. We have the vehicle under control, we understand the problem, we don't know the root cause, but we've taken preventive measures to make sure it's still functional without risking a problem.

It's much less scary {than Spirit sol 18} but I'll feel a lot better when we know exactly what's going on. All these things are scary to one degree or another. I'd rather have this problem though; not hearing from a vehicle is disconcerting.

--Emily

jmjawors

Jun 18 2008, 09:50 PM

That little blurb caught my eye as well. Thanks for following up on it.

Afraid I can't help with the APIDs, though.

climber

Jun 18 2008, 10:06 PM

An info from AW&ST, june 9th, Craig Covault page 35 "thirty days of the Phoenix 90-day mission are planed as lander "down days" when primary sampling or other science commands will be disrupted by relay difficulties". I understand that this means Phoenix can do unplanned observations.
It's not the case described by Emily & Barry but I guess there is some room for some issues to show up and still be ok within the 90 sols.

mcaplinger

Jun 18 2008, 10:35 PM

QUOTE (elakdawalla @ Jun 18 2008, 01:39 PM)

I... am wondering if someone here could help expand a bit on the business about APIDs (Application Process Identifiers) and what part they play in an operating system.

From http://mars.jpl.nasa.gov/MPF/nasa/pipfaq.html -- this is old Pathfinder data, but the general concept is the same.

How is data acquired by the spacecraft stored in the central computer and prioritized for return to Earth?

Answer:

All Pathfinder downlink data are packetized and assigned to APID ("Application Process IDentifier") queues, from which data is downlinked in FIFO (first in first out) order. The packetization and assignment takes place immediately as a result of execution of commands which acquire data -- separate commands to packetize and enqueue the data are not used. APIDs can be configured as rings, where old data is overlayed, or as queues, where new data is rejected if the size limit is exceeded. APIDs are identified by both name and number (0-42). Specific data formats are permanently assigned to specific classes of queues. For instance, one queue is for rover health data, and only data of that format can be assigned to it. There are multiple queues for IMP image data. Any IMP image can be assigned to any of these IMP queues, and within limits, the assignment is negotiable.

Within a single downlink session, APIDs are prioritized according to a two-dimensional priority matrix called a DPT ("Downlink Priority Table"). The DPT structure is used to make sure the most important data gets in the front of the downlink stream, regardless of when it is acquired, with the proviso that downlink from individual queues is FIFO. In the DPT, APIDs can be assigned to completely override others in priority (ie, completely prevent other APIDs from getting any downlink so long as any data is left in the higher priority APID), or they can be assigned to share a priority level on a percentage-of-bits basis.

Different downlink sessions can be governed by different DPTs, and within limits, the DPT organization is negotiable.

It is not possible to reorder packets within the queue, nor is it possible to move data packets between queues. Data packets can be deleted from the front of the queue up to a commanded time (the time when they were acquired) or by specific packet number at any point in the queue. It is not currently possible to delete specific data reliably from the center of a queue, but further study could mitigate this problem.

helvick

Jun 18 2008, 10:38 PM

Emily,

My understanding is that each spacecraft [telemetry?] function that generates engineering data has an associated APID and at each data point that that particular function samples whatever it is looking at it generates a data packet (which is stored in CCSDS format or something close to that) that includes the APID and an incremented counter in its CCSDS packet header. CCSDS packets also have a priority level associated with them that is also included in the header - that priority level is used to indicate which packets should "win" out in cases where resource contention arise during relay.

The situation that they describe sounds to me as if some telemetry function is returning far more data packets of a high priority level than expected and the sheer volume of those data packets (45,000 at maybe 1Kbyte per packet would chew up 45Mbyte of storage for example) was filling up available data storage. The spacecrafts operating system responds to that by deleting lower priority packets already present if necessary in order to save the packets that it believes are more important.

Given that they clearly know the APID they must know what spacecraft function is generating the packets - that doesn't necessarily mean that the root cause is obvious but it sounds to me that this should be easier to get to the bottom of than the Spirit Sol 18 problem.

The above is mostly speculative but I got some hints from here.

jekbradbury

Jun 18 2008, 10:45 PM

Why is nobody looking on the bright side? We get more science and more images than ever before on sol 24, and we miss only one sol of data (hopefully).

Deimos

Jun 19 2008, 12:04 AM

The current problem is within an engineering APID, 40. The APID structure and use is similar to MPF, but there are some interesting nuances. There are also more APIDs for both engineering and science (SSI has 13). Flash size motivated the science team to ask for more APIDs. There is a downlink priority table (DPT) in use on any given downlink. There is also a nighttime priority table that is used when saving to flash. Higher priority APIDs get saved first, lower ones may not make it. So APIDs map a 2-D space: How urgent is it that we get the data soon? How important is it that we never lose the data?

You can imagine a few kinds of data. An image of the dig we just did may be key for the next planning cycle, thus it has to be high on the DPT. But, if we somehow didn't get it (think electra), we could as easily reaquire it as save it overnight. So maybe it is low on the NPT (in practice, this specific example tends to be high in both). Or a RAC image of a sample in the scoop just before be deliver that sample: you may not need that image to plan the next sol, but you can never take it again. So, high in the NPT, maybe not in the DPT. A TEGA or WCL run ends up being very high in the NPT; they may also be high in the DPT if, for example, a follow-on TEGA ramp is desired the next sol. And, many things are not urgent and can also be redone. An image of some rock several meters away: if it falls out of flash, just take the picture again. So, for every product generated, a decision has to be made on both urgency and the need to save the data--then APIDs are assigned.

In strategic planning, the data is that is neither urgent nor critical to save to flash (especially SSI_LOW) has gotten the nickname "red-shirt" data, and is always vulnerable to loss in the event of even minor problems. Actually we've only lost it a few times though.

A further complication is "sent" data. If the data were specifically for tactical planning, you could treat it as "fire and forget". If the data is a TEGA bake, you cannot. What if the data are lost in transmission and need resending? Thus, the most important "sent" data trumps the least important unsent data (the red-shirts) when saving to flash.

And just when you thought I'd be out of further complications ... what if we could use MRO to get an extra 30-40 Mb of data? But, what if we new there was a larger risk of losing that data compared to the (now normal) ODY passes? You want to take and send the extra data; but you cannot afford to send urgent or critical data the risky way. Send in the red-shirts. So sometimes the lower priority stuff comes down at 2 PM (Mars time) while the more urgent stuff waits until 4 PM (and the first ODY pass).

How close a resemblance does this bear to MER file management? Well, just about none. There are files ... they are managed ... that's about it. Actually, on MER files are managed, on PHX APIDs are managed. MER data have priorities that can be dynamically reassigned (as opposed to moving whole APIDs around) and do not use APIDs for prioritization. MER priorities are for both saving and downlink, and MER is managed to avoid most "auto-deletes" when there is more data than flash. Phoenix cannot be managed that way, since we usually have more downlink in a sol than flash capacity, before even worrying about sent data that needs protecting.

ugordan

Jun 19 2008, 07:34 AM

QUOTE (Deimos @ Jun 19 2008, 02:04 AM)

In strategic planning, the data is that is neither urgent nor critical to save to flash (especially SSI_LOW) has gotten the nickname "red-shirt" data, and is always vulnerable to loss in the event of even minor problems.

I don't suppose "red-shirt" is a Star Trek reference?

Thanks for the detailed explanation, Mark. I was wondering what the "None of the above" comment for sol 23 was as well as that tidbit from the press release.

ElkGroveDan

Jun 19 2008, 01:33 PM

Good catch Gordan. More on red shirts here.

MahFL

Jun 19 2008, 01:57 PM

I loved the series Startrek, you always waited with anticipation for when the person in the redshirt was going to die.

Cargo Cult

Jun 19 2008, 03:18 PM

QUOTE (Deimos @ Jun 19 2008, 02:04 AM)

How close a resemblance does this bear to MER file management? Well, just about none. There are files ... they are managed ... that's about it. Actually, on MER files are managed, on PHX APIDs are managed.

Out of (somewhat nerdy) interest, which operating system (if any) is Phoenix running? I'm sure I read an article somewhere about it being something other than VxWorks as used by the rovers, but I can't remember what exactly it was.

(For everyone else, there's an interesting article here about Spirit's problems - essentially the number of files on flash grew to require more memory than the filesystem module could allocate, forcing the system to reboot, only to try to mount that filesystem again...)

I had a weird sense of mental inversion last night, where Phoenix and friends stopped being space probes with computers inside them, to being computers with space probes built around them. All my laptop asks is - can it go to Mars too? ;-)

PaulM

Jun 19 2008, 04:25 PM

QUOTE (Cargo Cult @ Jun 19 2008, 04:18 PM)

MCAPLINGER very kindly gave me the following link about 2 weeks ago:

http://www.klabs.org/richcontent/MemoryCon...irit_mishap.htm

This article confirms that VxWorks is used by the Mars Rovers and also includes a link to the following more complete description of the MER flash problem:

http://www.klabs.org/mapld04/presentations..._costello_s.ppt

One thing that did suprise me when reading this article is that the Mars Rovers depends so much upon linked lists residing in heap memory. This is because RAM in Space borne microprocessors is very suceptible to Single Event Upsets (SEU). Perhaps the 25 MHz RAD6000 Power PC in MER is less suceptible to SEUs than most microprocessors flown in space?

Single Event Upsets are explained here:

http://en.wikipedia.org/wiki/Single_event_upset

mcaplinger

Jun 19 2008, 04:51 PM

QUOTE (Cargo Cult @ Jun 19 2008, 08:18 AM)

Google is your friend. http://blogs.windriver.com/deliman/2008/05...ou-watch-i.html confirms that Phoenix uses VxWorks 5.2.

As for SEUs, the RAD6000 is not very subject to SEUs: http://www.baesystems.com/BAEProd/groups/p..._eis_sfrwre.pdf says 7.4e-10 errors/bit-day in 90% worst-case GEO. Of course, each system costs about a million dollars IIRC.

hendric

Jun 19 2008, 07:34 PM

QUOTE (Deimos @ Jun 18 2008, 06:04 PM)

The current problem is within an engineering APID, 40. The APID structure and use is similar to MPF, but there are some interesting nuances.

Let me see if I can translate this. You have a desk with multiple outboxes (APIDS). They are stacked one on top of another, with the top of the stack getting sent out first. The boxes are different sizes. Messages are put into the outboxes at the end of each day, depending on their future usefulness - RAC scoop image is not useful now, but priceless, so goes into a big bottom box. Panorama images of next worksite are useful now, but not priceless, so goes into a smaller top box; if the top box is already full, no big, just take another picture tomorrow. Now, when it comes time to send the stuff out, it makes sense to send the stuff needed tomorrow by a reliable sender, so it can be deleted sooner. The not-useful-now-but-priceless stuff can be sent be the (potentially! Don't want to offend any MRO telecoms people

) less reliable sender, to be deleted once it has been acknowledged as received. On top of these are engineering data, mixed in with their own priority levels. Solar panel output is a need-to-know-tomorrow-but-dump-if-necessary data type, while temperature level is probably a not-real-important-but-keep-for-later data type.

So someone was stuffing APID 40 with data, 45,000 times. (I would agree that it's probably obvious who is stuffing the APID with data, and that someone is staying up nights trying to figure out why it's happening.) Meanwhile, any other data sharing that APID can be moved to point to another APID, and that APID's size limited.

Just seat-of-my-pants guessing, someone left a debug message on, and forgot to disable it in the flight software. Maybe something that triggers every 30 seconds, starting about 5 days ago from the 45000. Perhaps there is a thread checking the TEGA door every 30 seconds and within the packets there is a message:

0x43256432: TEGA door #1 did not deploy fully?

I jest, only because I have been there!

glennwsmith

Jun 19 2008, 10:45 PM

Hendric, you're idea of a debug message as the cause of the anomaly is an excellent guess. We programmers have all been there. But it's one thing for a debug message to be scrolling harmlessly across a CRT screen, and quite another for them to be piling up as strings in flash memory!

lastof7

Jun 19 2008, 11:17 PM

QUOTE (mcaplinger @ Jun 19 2008, 12:51 PM)

Google is your friend. http://blogs.windriver.com/deliman/2008/05...ou-watch-i.html confirms that Phoenix uses VxWorks 5.2.

As a side note, a good example of what can sometimes make spacecraft software difficult. 5.2 was released around '95, I think? The RAD6000 can go up to at least 5.3.1, but VxWorks is now up to 6.6 or so. Newer boards such as the RAD750, the LEON3, etc. reach into the 6.x range, but you're still usually a few revs (along with the corresponding features and bug fixes) behind.

helvick

Jun 19 2008, 11:34 PM

45000 data items is a lot of data. Now you can generate that very quickly even on a slow [my terrestrial terms] cpu on Phoenix if something goes very badly wrong but let's assume for a moment that it is just something that is polled regularly that is responding all the time with an interesting data item rather than an expected "nothing to see here, move along" message.
~45k = 1 item every 30 seconds for 15 sols. What happened ~ 15 sols prior to the event being discovered on Sol 22? Sol 7 was the first dig IIRC , could this be something to do with that?

Alternatively if it was 1 item every second or so it would have started ~12 hours prior to the the anomaly being properly noticed. That could mean that one of the Atmosphere experiments that was being carried out on Sol 21 triggered something - I'm reminded of the problem that led to Spirit getting stuck on the side of the Columbia Hills way back around Sol 300 or so - a navigation bug IIRC that had something to do with excessive tilt (or am I badly mis-remembering things).

Glad to see that the recovery process has led to lots of science data being returned, nice silver cloud that.

And finally let me add my name to the list of those who's debugging code has gone haywire - in my case I ended up sending myself about 22K e-mail messages in a few minutes when I failed to realise that the machine I installed the alert on did not support the "sleep" command by default.

mcaplinger

Jun 20 2008, 03:25 AM

QUOTE (lastof7 @ Jun 19 2008, 03:17 PM)

...you're still usually a few revs (along with the corresponding features and bug fixes) behind.

The core set of VxWorks functionality is so small that I don't know that we're missing that much. Sometimes I'd be happier if they didn't keep "upgrading" things.

lastof7

Jun 20 2008, 03:53 AM

It's mostly the bug fixes along with the fact that it's difficult to find support (or people that are still using) versions of VxWorks that are that old. Although I have to say I miss Tornado (but, then again, I still use vi so what do I know!).

EDITED: Removed what was upon reflection an unfair comment about DOS file system.

Greg Hullender

Jun 20 2008, 03:28 PM

QUOTE (lastof7 @ Jun 19 2008, 08:53 PM)

It's mostly the bug fixes, particularly earlier versions of the DOS file system, along with the fact that it's difficult to find support (or people that are still using) versions of VxWorks that are that old.

Just so no one is confused, when you say "DOS" you don't mean the old Microsoft product (nor the even older IBM product). This is a product of Wind River, a well-known maker of "real-time" operating system software. Roughly, real-time just means any OS operation has a guaranteed time to completion. Things like space probes need that, but consumer products don't. It's hard to design, hard to build, and hard to program. (A friend of mine using a different real-time product once told me he'd decided it was called "real-time" because "you have a real time getting it to do anything!")

Microsoft never made a real-time operating system -- not even Windows CE, which I helped build. So whatever is wrong on Mars, it wasn't my fault! :-)

Seriously, I doubt this is Wind River's fault either. The accidentally-turned-on-debug-statement theory sounds plausible to me.

--Greg

Airbag

Jun 20 2008, 05:07 PM

One model of SUV has a badge on the back that states "real-time 4WD"; that always makes me laugh...

Airbag

lastof7

Jun 20 2008, 05:59 PM

QUOTE (Greg Hullender @ Jun 20 2008, 11:28 AM)

Seriously, I doubt this is Wind River's fault either. The accidentally-turned-on-debug-statement theory sounds plausible to me.

--Greg

Yup, I agree that it wasn't Wind River's fault nor did I intend to imply that. My point was more for the benefit of readers who may not know that many of these spacecraft run with older operating systems and older hardware. Software has bugs whether it be written by the application developer or by a vendor, it's a fact of life, and later versions of software tend to correct bugs in previous versions (with the hope of not introducing more bugs--which I've done more than once

Reed

Jun 20 2008, 08:42 PM

QUOTE (Greg Hullender @ Jun 20 2008, 07:28 AM)

Drifting into OT computer trivia, the filesystem in question was FAT, which originated with MS/PC DOS and is referred to in vxWorks as the "DOS filesystem". Various flavors of FAT are very popular for all kinds of embedded applications.

A description of the spirit anomaly can be found here: http://trs-new.jpl.nasa.gov/dspace/bitstre...1/1/04-3354.pdf

Greg Hullender

Jun 21 2008, 11:24 PM

QUOTE (Reed @ Jun 20 2008, 12:42 PM)

the filesystem in question was FAT, which originated with MS/PC DOS

Interesting -- I wouldn't have guessed that Wind River would reverse-engineer the old FAT filesystem, but it makes a lot of sense when you think about it. That means you could pop their disks into an old floppy-disk drive and read them with plain old MS-DOS.

These days, I guess you'd want to use a USB drive or equivalent. Anyone know if that's actually the case?

--Greg (please tell me it doesn't just "text" you the data!)

imipak

Jun 22 2008, 07:57 PM

No need for mad RE skillz; just license it from MS.

PaulM

Jun 23 2008, 11:45 AM

QUOTE (lastof7 @ Jun 20 2008, 04:53 AM)

I talked to someone who used the VxWorks 5.2 flash based (DOS compatible) file system 10 years ago in a flight data recorder. They told me that they found a number of problems with that release of the file system and eventually obtained the source code to make some fixes themselves. They fixed file defragmentation code and speeded up file deletion code.

One thing that they found, and this is the reason for my post, was that the file system slowed down considerably when there were lots of files in any one directory. I can therefore see why storing 45,000 files in one directory might make Phoenix's software run very slowly. I imagine that NASA must be currently be deleting files in Phoenix's the file system much as they did in Spirit's flash in order to work around the Spirit SOL 18 flash file system annomaly.

I can see an argument for moving on beyond VxWorks 5.2 to take advantage of a more robust flash file system in future NASA missions such as the MSL rover, although as was suggested software upgrades can bring their own problems.

elakdawalla

Jun 24 2008, 05:08 PM

An update from Barry Goldstein that I understand a little bit less than the first update. Discuss!

QUOTE ('Barry Goldstein')

It was a problem we'd identified a while ago and we were starting to work a fix for it. It was associated with when we saved when we go to sleep at night, the way we save the packet sequence numbers in the file system and what's supposed to happen is we're supposed to mask off the lower 12 bits, and what happened was we had identified that and had started working a patch to fix this, we knew the symptom, when it happened it would generate duplicate packet sequence numbers. We knew the system could operate that way but we were worried about what would happen, all the permutations. So what happened on sol 22, we actually had one of those issues occur where we basically generated duplicate sequence numbers. It just so happened that morning when we uploaded the sequence for that morning we included those same packet deletes, we do that every morning. And we deleted just enough packets such that because of the other problem we ended up having the file system configured where there were two consecutive packets with the same ID. If we hadn't sent up that exact number of packet deletes this wouldn't have happened. When we did that, we had an unintended consequence. It normally shouldn't happen, if we had corrected the masking issue it would not have happened, but when we ended up with two packets with the same sequence number, our team went to work looking at it, we found a bug in the code that generates packets that if that happens, you end up getting into an infinite loop generating the same packet ID. So as you recall we generated over 45,000 packets with the same sequence number, so because of the first bug we generated a condition where the second bug was exposed.

So the bottom line is, yesterday we completed the patch for the first problem and we uplinked (I believe) the patch to the system to get rid of the first bug. And we're going to have a discussion today to see if we're ready now to release the use of the flash back to the science team, because we've now eliminated the source of the problem. The consequence is still there until we finish the other patch, but it shouldn't happen now, so we'll have a discussion and make a decision on whether we want to release that or wait another couple of sols until we get the second patch uploaded.

--Emily

PaulM

Jun 24 2008, 06:29 PM

QUOTE (elakdawalla @ Jun 24 2008, 06:08 PM)

An update from Barry Goldstein that I understand a little bit less than the first update. Discuss!

--Emily

Masking the lowest 12 bits of the packet sequence number would cause all except the lowest 12 bits of the packet sequence number to be thrown away. The packet sequence numbers 0 and 4096 would both generate a new packet sequence number of 0. This is because decimal 4096 is binary one followed by twelve binary zeros. As a result of the masking operation, the one would be thrown away.

It might reasonably take 22 sols for Phoenix to transmit its first 4096 packets. After 22 sols following the "masking" operation, Phoenix would allocate packet 4096 a packet sequence number of 0 which would generate the first duplicated packet sequence number.

I find fragments of information about space software problems both interesting and frustrating. It said on twitter that Phoenix's software is not Open Source. From my point of view I would like lander software to be Open Source. I am sure that there would be benefits to both NASA and ESA if Mars Rover software development was turned into an Open Source project. I think that EDL software might be the only software that needs to be classified.

djellison

Jun 24 2008, 06:40 PM

QUOTE (PaulM @ Jun 24 2008, 07:29 PM)

. From my point of view I would like lander software to be Open Source.

http://phoenix.lpl.arizona.edu/blogsPost.php?bID=51

http://phoenix.lpl.arizona.edu/blogsPost.php?bID=42
specifically : "Also, the MET team is not allowed access to commands that interface directly with the lander. This means they actually don't have access to the MET_ON and MET_OFF commands! Because those are the ones that interact directly with the lander!"

Not only is the lander software a highly lucrative commercial product of Wind River, much of the lander software falls under ITAR, to an obstructive degree.

Doug

mcaplinger

Jun 25 2008, 01:51 AM

QUOTE (djellison @ Jun 24 2008, 10:40 AM)

Not only is the lander software a highly lucrative commercial product of Wind River...

Strictly speaking, it isn't. The operating system is VxWorks. The stuff that runs the mission is essentially an application that runs on top of VxWorks and is not encumbered by Wind River (as far as I know -- IANAL.)

If somebody wants to write an open-source version of VxWorks, that'd be swell, and not all that hard, since it's a very simple system that basically only provides basic interrupt handing, a task model and intertask communication primitives. But even if that happened, I wouldn't count on the spacecraft-specific code being made available.

I don't think we need to get into a debate about the virtues of open source versus the alternatives on this forum.

AFAIK, MSL isn't using the DOS filesystem. For the cameras, I wrote my own filesystem (the cameras don't use an OS, the software runs on the bare metal.)

PaulM

Jun 25 2008, 11:31 AM

QUOTE (mcaplinger @ Jun 25 2008, 02:51 AM)

AFAIK, MSL isn't using the DOS filesystem. For the cameras, I wrote my own filesystem (the cameras don't use an OS, the software runs on the bare metal.)

I think that it is very sensible to replace the complex DOS compatible flash filesystem used in MER and Phoenix with multiple simpler flash file systems in MSL because this means that in MSL a single flash problem will not result in the loss of all data stored in flash. Presumably in MSL if 45,000 engineering data files were created by mistake then there would be no risk of loosing photos because they will be stored in a separate flash chip on a different processor board.

I presume that MSL has more flash memory than MER which overcomes the need to use a single flash file store which would use flash as efficiently as possible? Am I correct in thinking that MER only uses one microprocessor but MSL uses many? I would be interested in finding out what processor is used by the MSL camera system?

I understand that each MER rover go into safe mode around once each year because two VxWorks tasks write to a rover orientation record at the same time. I can understand that this bug has not been fixed in MER because it only occurs very infrequently. What I would be interested in finding out is whether this bug has been fixed in the MSL software build which I understand is derived from the software build loaded into MER about two years ago?

mcaplinger

Jul 11 2008, 03:50 AM

Presumably people who are interested in flight software read slashdot, but if not, this article was featured today: http://news.oreilly.com/2008/07/the-softwa...the-mars-p.html

nprev

Jul 11 2008, 04:22 AM

Interesting article; thanks for posting it, Mike!

Good question about the processors as well, oDoug. I'm always curious to see how qualification testing for improved hardware is progressing because it takes a REALLY long time. Most C-17 avionics boxes use 80386 processors to this day just because of that fact; space qualification must be an order of magnitude harder to achieve.

This is a "lo-fi" version of our main content. To view the full version with more information, formatting and images, please click here.