Mike Deliman is the Mars guy at his office. He belongs to an elite fraternity of coders at Alameda, California-based Wind River Systems who wrote the operating system that runs, among other things, the two Mars Exploration Rovers, Spirit and Opportunity. Deliman also leads the team that wrote the operating system for the Mars Oddysey satellite and the previous Mars lander, Pathfinder, and its Sojourner rover.
In Deliman's office, connected to his Sun workstation by a couple of small wires, is a replica of the computer on board each of the rovers. Looking more like a beat-up Samsonite suitcase than the sophisticated brain of interplanetary probes, the device is the result of thousands of coder-hours and millions of dollars in development. Only six exist, and two are on Mars.
We talked to Deliman two days after NASA announced the Spirit rover had resumed normal operation. He discussed the challenges of writing software for a computer that is too far away for a technician to visit if anything goes wrong, the differences between his OS and the ones that run desktop PCs, and how he and NASA technicians revived the ailing Spirit.
Mars Rover Review
PC World: What can you tell us about what went wrong with Spirit?
Deliman: The short, sweet version is we ran out of RAM. When that happens, the system will suspend the task that's asking for more RAM when it runs out, and that causes a cascade effect.
There are hundreds of tasks running simultaneously on the rovers. One of the tasks has the job of writing to a board across the bus, and that board's whole purpose in life is to make sure the computer doesn't wedge (get stuck in a tight loop). If it doesn't get written to periodically, it resets the bus.
PCW: Is that what's called a dead man's switch?
Deliman: In this case it's a watch down timer. The task that's supposed to write to it, to tell it to hold off, didn't get there in time, so it said, "Oh, computer's wedged. Reset." And the system brings itself back up, spawns these hundreds of tasks that have their things that they need to do, and reinitializes the file system.
The file system caches a lot of information in RAM to make things run faster, so it started rebuilding its structures in RAM and allocating memory, and it got to a point where it just ran out of memory again, and it happened all over.
PCW: So that first day when it rebooted a bunch of times, that's was what was going on.
Deliman: Yep. We have the ability to tell the watch down timer, "let things go to sleep; it'll wake up tomorrow." When we were stuck in this loop, we weren't able to tell the watch down timer "let it go to sleep."
We finally got to a point where the batteries were low enough to trigger a low battery alarm. When that happens, the rover actually comes up in a slightly different mode, one where we were able to get diagnostic information out. And that gave us the information that we needed to start looking into what caused the original failure.
PCW: How many times did Spirit reboot before the low battery alarm went off?
Deliman: The number that I heard was over 60 times.
PCW: We all heard about the reboots, including some reports that it was a flash memory problem. Why was flash memory accused of being the source?
Deliman: The flash bank is used to hold the entire file system: all the data, all the directories, all the files. The software that runs the file system caches some of the entries in RAM. In addition, the scientific data, photographs, and other files also get cached in RAM as they're being collected.
We only have 32MB of RAM for the operating system and applications. So, between the amount of files and directories and FAT information it was caching, some application asked for one block of RAM too many, and that was it. The rover is designed to rebuild this cache when the file system reinitializes, so it got into a loop.
PCW: Why so little?
Deliman: RAM that actually can survive in space--an environment that's more harsh than your microwave oven--is tested, and it's abused, and it's very, very, very expensive.
The board I have has a 5MHz CPU with 32MB of RAM total, and it would probably cost around $300,000. The one on the rovers has a 20MHz CPU. It's a radiation-hardened RS 6000, we call it the Rad 6K. It's very similar to the IBM RS 6000 workstation from 1990.
PCW: 1990? That's old.
Deliman: Yeah. It takes a while to go through the process of radiation-hardening CPUs and memory and putting the boards all together and testing them and making sure they'll actually stand up to space.
PCW: What does radiation do to a non-hardened computer?
Deliman: While the rovers were en route to Mars last November, a big solar flare, the biggest ever recorded, bombarded the rovers with thousands of protons per square centimeter, hitting the surface of the flight shells and the solar panels and such. A high-energy proton will actually bore a hole right through silicon, almost like it's getting sandblasted, and then interact with something, and become a different form of radiation. It could hit a wire and actually cause a transient charge, causing fake signals within the computers.
Back in 1997, when we were waiting for Pathfinder to finish its flight to Mars, there were a big set of solar flares. One company had just launched a multi-billion dollar communications satellite into high Earth orbit. They were relying on the Earth's magnetic field to be stable and protect the satellite from solar flares, but when that flare hit, the magnetic field actually collapsed--part of how the Earth protects us--and exposed that brand-new satellite to the harsh flare conditions. It fried the satellite. They lost it completely.
PCW: Do you know what the scientists were doing when the rover malfunctioned?
Deliman: Some of the things that they do are characterizing current draw for various motors at different times of day and night. They were doing that when the first failure happened, changing current supply for a motor that flips a mirror that protects the thermal emission spectrometer, the mini-TES. It could be that they opened a new file to store a new set of results for this mirror operation, and caching that file may have been the straw that broke the camel's back.
PCW: Did you get the opportunity to test the operating system and all of the various inputs that would be coming into the rover before the thing landed?
Deliman: JPL ran tests for over nine days in the lab, but this happened on sol 18, after 18 Martian days on the surface.
PCW: It's double the testing time.
Deliman: Yes, and testing time is expensive: You take systems away from the engineers, who are working on getting the stuff actually built to fly, to run the tests. These boards cost a third of a million dollars, so you can't have ten dedicated to testing and ten of them dedicated to development. You've got two of them in flight and maybe another four of them on the ground, being used for testing and development.
PCW: So now that you know what caused the problem, did you update the rover's software to keep it from overloading its RAM?
Deliman: We removed a bunch of old files from flash memory. That allowed the system to re-initialize and not exhaust its RAM. So what we've done is basically determine an operational constraint: Here's how many files and directories we can have in our cache before there's a problem.
Now I'm working on characterizing the exact amounts of memory that are used for various situations, so we can characterize the exact nature of the problem: This many directories, this many files, this kind of a structure, this big a [file allocation table] will take up exactly this much memory.
PCW: What's a real-time operating system, and how is it different from what runs a desktop computer?
Deliman: PCs use what's called a process model, where you have groups of processes running. Some are related to each other. They have protected memory spaces so nobody can walk over their little chunk of memory.
Linux, UNIX, Windows, they all like to have a big gob of virtual memory to swap stuff in and out, and take their time about handling things. They can store things in virtual memory on the hard disk, and bring it back as they need it.
On the rovers, everything runs in RAM, all the time, because it's faster. A real-time OS has to react to things as they happen in real life; there's no putting off the moment. You have to keep everything as small as possible, so it's fast enough and can all fit into the RAM you have. Our entire operating system fits in 4MB of RAM. Nothing gets swapped to a hard disk; it would be too slow.
Spacecraft have to be able to react to things as they occur. If something happens that the system needs to react to right now, the code has to already be there. Time is of the essence. It's like a pacemaker or an antilock brake system; you can't swap out the next heartbeat.
And, in fact, our software goes into antilock brake systems and pacemakers and medical scanners. We also write the software that controls transmissions; flight navigation systems in commercial aircraft; rockets; digital cameras; pacemakers; medical scanners that have to be able to turn on and off sources of radiation, for instance, at a moment's notice; high-energy colliders; Internet routers and switches; and telecom equipment. You don't want your telephone to pause for five minutes every time you pick it up.
PCW: Why aren't more operating systems built like this?
Deliman: There's no way you could have all of Windows and all your applications in RAM. Windows has to push stuff off into virtual memory on the hard drive, and that slows things down.
PCW: So a desktop operating system couldn't handle the performance requirements?
Deliman: Correct. Our operating system, VxWorks, hails from the old days, when you had a 68000-based processor, and a megabyte of RAM that cost $10,000 dollars, and you had to make the most out of what you had.
PCW: Clean code.
Deliman: Clean was the big thing. It comes back from the days when you didn't have malloc and free.
PCW: Malloc and free?
Deliman: Malloc is a command that allows you to allocate RAM from the computer system, and free gives it back. You used to have to design the memory you were using into your application.
We're in the Spitzer Space Telescope, also called Space Infrared Telescope Facility. We're involved with the Stardust Comet Probe bringing back bits of Comet Wild 2. And there's a satellite in orbit called SORCE sampling the solar wind. There's a lot of other stuff up there that's flying VxWorks. We're in Mars Odyssey 2001, one of the orbiters.
PCW: What can you tell us about your test bed?
Deliman: This pretty much featureless thing with a power light on it is where the RAD 6000 computer lives, with the flash file system board. This is an Ethernet board that is used only in tests, there's no Ethernet in space.
PCW: There's no Ethernet in space?
Deliman: There's no World Wide Web in space. Wires don't reach that far.