On Thu, 30 Nov 2006 11:05:32 -0500I read this as "It is ok to give system admin(s) commands (that this "drop_pagecache_sb() call" is all about) to drop page cache. It is, however, not ok to give filesystem developer(s) this very same function to trim their own page cache if the filesystems choose to do so" ?
Wendy Cheng <wcheng@xxxxxxxxxx> wrote:
The idea is, instead of unconditionally dropping every buffer associated with the particular mount point (that defeats the purpose of page caching), base kernel exports the "drop_pagecache_sb()" call that allows page cache to be trimmed. More importantly, it is changed to offer the choice of not randomly purging any buffer but the ones that seem to be unused (i_state is NULL and i_count is zero). This will encourage filesystem(s) to pro actively response to vm memory shortage if they choose so.
argh.
In Linux a filesystem is a dumb layer which sits between the VFS and theLinux kernel, particularly the VFS layer, is starting to show signs of inadequacy as the software components built upon it keep growing. I have doubts that it can keep up and handle this complexity with a development policy like you just described (filesystem is a dumb layer ?). Aren't these DIO_xxx_LOCKING flags inside __blockdev_direct_IO() a perfect example why trying to do too many things inside vfs layer for so many filesystems is a bad idea ? By the way, since we're on this subject, could we discuss a little bit about vfs rename call (or I can start another new discussion thread) ?
I/O layer and provides dumb services such as reading/writing inodes,
reading/writing directory entries, mapping pagecache offsets to disk
blocks, etc. (This model is to varying degrees incorrect for every
post-ext2 filesystem, but that's the way it is).
Cluster locks are expensive because:From our end (cluster locks are expensive - that's why we cache them), one of our kernel daemons will invoke this newly exported call based on a set of pre-defined tunables. It is then followed by a lock reclaim logic to trim the locks by checking the page cache associated with the inode (that this cluster lock is created for). If nothing is attached to the inode (based on i_mapping->nrpages count), we know it is a good candidate for trimming and will subsequently drop this lock (instead of waiting until the end of vfs inode life cycle).
Again, I don't understand why you're tying the lifetime of these locks to
the VFS inode reclaim mechanisms. Seems odd.
If you want to put an upper bound on the number of in-core locks, why notDon't take me wrong. DLM *has* a tunable to set the max lock counts. We do drop the locks but to drop the right locks, we need a little bit help from VFS layer. Latency requirement is difficult to manage.
string them on a list and throw away the old ones when the upper bound is
reached?
Did you look at improving that lock-lookup algorithm, btw? Core kernel hasDon't be so confident. I did see some complaints from ext3 based mail servers in the past - when the storage size was large enough, people had to explicitly umount the filesystem from time to time to rescue their performance. I don't recall the details at this moment though.
no problem maintaining millions of cached VFS objects - is there any reason
why your lock lookup cannot be similarly efficient?