Re: process checkpointing

Harvey J. Stein (hjstein@bfr.co.il)
16 Dec 1998 14:51:19 +0200


Oren Laadan <orenl@cs.huji.ac.il> writes:
> Chris Arguin <carguin@iname.com> writes:
> > Guest <guest@manjak.knm.org.pl> writes:
> > > are there any patches to support process checkpointin on x86 ?
> >
> > I'm currently working on this, but I have virtually nothing written yet.
> > Since you mention it, I'll take this opporunity to seek input from the
> > restof the community.
>
> Make sure you don't reinvent the wheel...
>
> For example, take a look at:
>
> http://warp.dcs.st-and.ac.uk/warp/systems/checkpoint/
>
> Also, checkout CoCheck project and Condor at:
>
> http://wwwbode.informatik.tu-muenchen.de/Par/tools/Projects/CoCheck.html
> http://www.cs.wisc.edu/condor/
>
> I believe some of these already work on Linux, or would require probably
> some small effort of porting.
>
> BTW, checkpointing of a single process (unlike a bunch of collaborating
> processes) can be evidently done (see above) entirely in user-space.

It can be, but there's a substantial problem with mmap. You have to
make sure that things that are mmapped get remapped to the same place
when the process is restarted. The condor stuff does this by dumping
all mmapped segments for the checkpoint. These are then mapped back
in from the data dump when the process is restarted. Mmaps can occur
from malloc (depending on the implementation) but the bigger headache
is that shared libs are mmapped in, meaning that a) all shared libs
used by the process are dumped, and b) after restart, the process is
effectively statically linked since it's now using its own copies of
the shared libs.

It'd be nice if you could remap the original libs instead of dumping
them. That could substantially speed up dumping and restarting, and
improve system usage by the restarted binaries, especially if there
are lots of them. The Condor people argued against this, but their
argument only applies when you want to migrate checkpointed apps
across machines (systems might have different versions of the libs or
might have them in different locations, etc)_. They don't really
apply if you only want to checkpoint and later restart on the same
machine.

Another problem with the Condor stuff is that they don't distribute
the source code (at least the last time I checked).

-- 
Harvey J. Stein
BFM Financial Research
hjstein@bfr.co.il

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/