Re: Some questions about linux kernel.

From: Jesse Pollard (pollard@cats-chateau.net)
Date: Sun Mar 19 2000 - 16:50:07 EST


On Sun, 19 Mar 2000, James Sutherland wrote:
>On Sun, 19 Mar 2000 13:00:03 -0600, you wrote:
>>On Sat, 18 Mar 2000, Horst von Brand wrote:
>>>Jesse Pollard <pollard@cats-chateau.net> said:
>>>
>>>[...]
>>>
>>>> It makes it better. The specific process causing the problem is now
>>>> identified.
>>>
>>>The only thing you identify this way is the first unsuspecting victim which
>>>happens to step into your trap. The process itself may very well be
>>>innocent, it just finds itself in a position where making a legitimate
>>>request at the wrong moment gets it shot. It depends on all the other stuff
>>>the system is doing at the time: Running NS on an idle, 32Mb machine will
>>>work; if it is also compiling a kernel and rebuilding XFree86 ATM,
>>>something will have to give. Who is at fault? NS isn't, it is doing its
>>>job. Neither are the make, sh, or gcc and whatnot processes doing anything
>>>illegal. Note that all this is completely unrelated to user quotas and
>>>such: User quotas just shift the problem to the individual users, some of
>>>whose innocent processe get killed because of the other things they are
>>>doing. It can be argued that the user is being held responsible this way,
>>>and this is in a sense true. But the process in itself is doing nothing
>>>wrong.
>>
>>Except that the user IS exceeding what was made available via quotas.
>
>Yes - so what? All you are doing is lowering this limit. Someone uses
>more memory than they should, so one or more of their processes dies.
>Nothing whatsoever to do with overcommit.
>
>>>This is one of the reasons why OOM handling is hard: Very often there _is_
>>>no single culprit that can be blamed and clearly deserves to be shot, there
>>>is just a bunch of innocent people some of whom you have to kill or else
>>>all are dead.
>>
>>This is why resource allocation and quotas are important in multiuser systems.
>
>So it would be really nice if we HAD those quotas...
>
>>>You also try to protect the system against malicious (or runaway)
>>>processes. To protect against a runaway memory leak (or dumb mallocbomb)
>>>is rather easy, essentially you shoot the largest/fastest growing process.
>>>Protection against malicious processes is much, much harder, as they _will_
>>>take advantage of blind spots in the detection scheme.
>>
>>Unfortunately, that may not be the right one. The "mallocbomb" may not be
>>growing very fast - but the web server might be as compaired to the bomb.
>>Or it might be telnetd/login starting a new session. Or cron starting
>>a new job. Or inetd starting a new ftpd. Or init trying to restart a
>>getty on the console which just logged out...
>
>Occasional friendly fire is inevitable in emergency situations. I'd
>still rather have one specific process killed (and logged) than have
>random process deaths.
>
>>>Then there is the fact that this is (or should be, at least for a
>>>well-speced machine) a very rare event, so very little (if any) effort
>>>during normal operation is warranted. This includes the OS and machine, and
>>>even much more importantly the sysadmin.
>>
>>It cannot be spec'ed without having something to measure against. This is
>>the other place resource quotas come in.
>>
>>This can be compaired to disk quotas. Nobody will really fault the use
>>of disk quotas.
>>1. How many file systems are overcommited?
>Totally different type of "overcommit".

Not completely - it is a difference of degree. At this point it is
users instead of processes. Users request a certain amount of disk space.
The quota controls prevent the user from exceeding it.

>>2. Whose decision was that?
>The FS designers, in this sense.

NOPE - the mangement that directed it to be overcommitted, if quota is
supported.

>
>>3. Who detects the failure?
>Whoever gets their FS hosed by optimistic code.

Almost - whoever gets their data lost because there is no space. If quotas
are available then it is the user. If quotas are not available then the
OS gets blamed.

>
>>4. Who gets the blame when it occurs?
>The programmer.

NOPE - the administration, if it is overcomitted, The OS if quotas are
not available.

>>If filesystems are not overcommited, the the total of all legal users
>>will equal the size of available data space (meta-data included as
>>"data" here). An overcommited file system is one where the total of
>>all users quota exceeds the size of the filesystem.
>Totally different use of "overcommit".
>
>>Filesystem failure doesn't occur until enough users use up the space.
>>No single user may have even reached their limit. But random write
>>failures occur, databases may become corrupt, and a lot of blame
>>is going to be assigned.
>The type of overcommit we are talking about (i.e. VM), the closest
>analogy is a sparse file. A process requests a 100Mb file, and gets a
>100Mb sparse file.
>
>>Ansers with quota controls:
>> #2: Overcommiting is a management choice.
>> #3: the first user that fills the filesystem without using up their quota
>> #4: Management.
>>
>>Without quotas being available:
>> #2: the analyst that selected the system
>> #3: the first user that fills the filesystem
>> #4: the operating system
>>
>>Do I use quotas? YES.
>>Do I have overcommit? YES - management said to do so.
>>Does the system or analyst get blamed for failure? NO.
>>
>>I need the same controls on Linux servers that I have on other servers.
>>
>>I can be a lot more lax on restrictions where the system is dedicated
>>to a single user than I can where it is a multiuser server.
>
>You are misunderstanding what we mean by "overcommit". I am talking
>about allocating a process memory when it USES it, rather than when it
>REQUESTS it. This has nothing to do with how (or indeed if) I set any
>quota facilities I may (or on Linux, may NOT) have available.

NOPE - if the process requests it and is granted it, then it should have
access to it. It is not up to the system to say "here it is, but don't use
that part of it, because I really didn't give it".

We are talking about the sum of all concurrent requests, and the system
aborting when part of the requests already granted turns out to not be granted.

-------------------------------------------------------------------------
Jesse I Pollard, II
Email: pollard@cats-chateau.net

Any opinions expressed are solely my own.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Thu Mar 23 2000 - 21:00:27 EST