Adding checkpointing API to Linux kernel

Werner Krebs (werner.krebs@yale.edu)
Mon, 18 Jan 1999 15:55:02 -0500


I like your comments, and am forwarding them to the linux-kernel mailing
list.

I'm the maintainer of GNU Queue, a distributed batch and interactive job
load-balancing system. The homepage for GNU Queue is
http://bioinfo.mbb.yale.edu/~wkrebs/queue.html (frequently updated) and
http://www.gnu.org/software/queue (official).

Andy Glew writes about the need for an NT-like (gasp!) API in the
GNU/Linux that would all trapping of OS system calls without recompiling
code. (For use in checkpoint migration facilities, which we are
considering adding to GNU Queue.) Recompilation is necessary to use
these facilities with commercial apps. The scary thing is it appears to
be easier in NT than in GNU/Linux (see Andy's comments). Given NT's
other "features" (lack of true simultaneous multiuser capaibilities), it
would see that GNU/Linux and UNIX are otherwise ideal systems for this
type of distributed application, so this is not a good thing, IMHO.

Could anyone develop such an API?

Andy Glew (glew@cs.wisc.edu)
> I'm not in the Condor group. I'm just a user. I recommend that you contact
> the Condor folk directly, off their web page.
>
> By the way, rumour has it that the Condor source code has been "sold"
> to the NCSA - the National Center for Supercomputer Applications, you know,
> the people who paid for Mosaic. Since the NCSA is not a company, maybe
> they can be approached wrt public use.
>
> Other comments:
>
> I frequently use SPSS, and less frequently SAS, so I know of what you speak.
> You probably saw that Condor has the option of running "vanilla UNIX" jobs,
> without the remote UNIX system calls, and without checkpointing, without
> recompilation. On UNIX, checkpointing requires recompilation. My jobs
> usually run using AFS filesystem access on the local machine, and checkpointing.
>
> Recent Condor work has involved binary editting to insert the Condor code,
> which would five you a way of getting checkpoints and migration without recompiling.
> A student presented a class project in this last year.
>
> Most annoyingly, it looks like the Condor facilities can be obtained most easily
> without recompilation for the ongoing Condor port of NT. It turns out that Windows
> has an API to allow editing of any interface between modules that is exported as a
> DLL interface. So transparency is easier on NT than on UNIX. Providing
> such an API for LINUX might be enough to help jumpstart LINUX applications.
>
>
>
> Werner Krebs wrote:
>
> > Yes, I only recently looked into Condor (I thought it was commercial).
> >
> > With the source code being restricted, it is certainly not ``free'' software in the usual sense of the term.
> >
> > Queue and Condor have different origins. Queue was originally written to migrate SAS and Splus jobs, where checkpoint migration isn't an option. These are large statistical jobs
> which not only use fork and read and write the same files, but create huge temporary files that they need to re-read over and over again. You need a large scratch disk on each
> machine for these to run efficiently, so you simply can't trap all system calls and send them over the network, you need to have both fast local disk space ('/usr/tmp') and global
> disk space (NFS or AFS filesystem) that are distinguished by pathnames. An efficient, reliable, cached network filesystem is also a plus to reduce network traffic as these jobs are
> both I/O and CPU bound at different times.
> >
> > This, in turn, means that once the program is running on machine, migrating it to another machine is basically impossible.
> >
> > However, it turns out Queue is similar to an expensive commercial product, LSF, which apparently uses old Condor code (the network host advertisement code). We're planning to
> reimplement something like this (probably radically different as I have to even read the Condor manual) using a PostgreSQL server to take information from clients. But, it would be
> much simpler if we could existing free code and go along with an existing standard. (Adding old checkpoint migration from Condor to Queue wouldn't be bad, either.)
> >
> > At one point, Condor published its source code and allowed free re-distribution of it; I was wondering it would be possible for Queue to use some of that code. We're GNU, so to
> actually distribute the code (rather than just the hooks), RMS would probably want the code to be licensed under the GPL license, which is a much stricter license than what it
> currently is licensed under (no use of the code in commercial software products without consent of Copyright holders).
> >
> > Since I notice you're at Wisconsin, I was wondering if something like this could be arranged (using an old, widely published version of the Condor code in Queue?) The Condor folks
> seem to have removed all their old source code from the net.
> >
> > Thanks.

[... snip for brevity]

>
> --
> ---
> Andy "Krazy" Glew, glew@cs.wisc.edu, UW Madison and Intel.
> DISCLAIMER: private posting, not representative of university or Intel.
> Please respond by email in addition to replying to newsgroup.
>

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/