QUOTE (mcaplinger @ Dec 5 2014, 10:29 AM)
If you had bought a commercial flash card in 2002 and used it on Earth as heavily as the flash on MER has been used, it would most likely be non-functional by now.
Testing how the flash degrades over time and validating one's flash management code when, say, half the flash blocks have failed, is quite a job. I wouldn't have done it for the short MER mission, so it's great that it's worked as well as it has.
That said, I'd be curious to know the root cause of the problems so they can be mitigated in the future if possible.
Unsure about that. Write frequency per physical wordline might still be higher on Earth, even with seemingly lower utilization. And if it were simply a high utilization issue, I tend to believe it'd be working surprisingly well as the whole flash architecture is geared towards dealing with a statistical tail of a few bits failing, and continuing to function. And that sort of design leaves a lot of actual overhead. I just do not expect the flash to be the crappiest chip on the board if being old is the only hurdle to jump.
I once participated in what was basically a design review of a flash memory targeted at a 130nm RF process, and the whole system's designed around the flash cells alone being terribly unreliable, in that they fail way too often. They over-test, anneal, repair, throw out bad cells they find at test (they may have had extra wordlines in each sector and an entire extra sector to repair with), then throw ECC on top of memories that test perfect, assuming that some outlier single bits will continue to fail throughout the life of the memory so the ECC *should* be able to repair everything statistically expected for the spec life of the memory.
And for what it's worth, the physical addresses of the sectors have no correspondence to the logical addresses of those sectors, even after test (internally there's an entire spare sector so that any can be marked bad, and after that the logical-to-physical mapping pingpongs around with a write balancing method of some sort right within the memory, on top of whatever firmware *thinks* it's doing... and I've always wondered how well the software write-balancing understands what the circuit designers put in.) And so if they're trying to debug some sort of problem with a physical address (and they are) and have to trace through at least 2 tiers of logical-to-physical address obfuscation... yuck. But that said, there's a lot of overhead in the reliability design of the actual memory cells that if it's still truly bad address problem, that there should be a fat part of the distribution that is still working fine.
So *I* guess they're probably dealing with something other than a bad address problem, and that's a pity because the bits are still working, but you can't write and read them because of a problem in the pipe. I think some people jump to the conclusion that because it's flash, and flash cells truly have a limited life, that this has to be that sort of issue. My comment is that if it truly is that sort of issue, that the memory shouldn't be as crippled as recently described. It should be in the tail of the distribution of single-bit failures, not wholesale unusability, even more than 10 years into the chip's life.
The flipside is that the radiation environment just shifted the bell curve, and they're now into the fat lot of bit failures that the chip's devs would've expected sometime in the far future. And that the overhead's all gone. *That* would be important to know thoroughly.