Re: [EXT4 set 6][PATCH 1/1]Export jbd stats through procfs

From: Mingming Cao
Date: Mon Jul 16 2007 - 04:24:47 EST


On Tue, 2007-07-10 at 19:31 -0700, Andrew Morton wrote:
> On Sun, 01 Jul 2007 03:38:10 -0400 Mingming Cao <cmm@xxxxxxxxxx> wrote:
>
> > [PATCH] jbd2 stats through procfs
> >
> > The patch below updates the jbd stats patch to 2.6.20/jbd2.
> > The initial patch was posted by Alex Tomas in December 2005
> > (http://marc.info/?l=linux-ext4&m=113538565128617&w=2).
> > It provides statistics via procfs such as transaction lifetime and size.
> >
> > [ This probably should be rewritten to use debugfs? -- Ted]
> >
>
> I suppose that given that we're creating a spot in debugfs for the jbd2
> debug code, yes, this also should be moved over.
>
> But the jbd2 debug debugfs entries were kernel-wide whereas this is
> per-superblock. I think.
>

I take this comment as we still keep the stats info in proc fs...

> That email from Alex contains pretty important information. I suggest that
> it be added to the changelog after accuracy-checking. Addition to
> Documentation/filesystems/ext4.txt would be good.
>
Done.

I hope Alex can help address the rest of comments.

> > --
> >
> > Index: linux-2.6.22-rc4/include/linux/jbd2.h
> > ===================================================================
> > --- linux-2.6.22-rc4.orig/include/linux/jbd2.h 2007-06-11 17:28:17.000000000 -0700
> > +++ linux-2.6.22-rc4/include/linux/jbd2.h 2007-06-13 10:45:21.000000000 -0700
> > @@ -408,6 +408,16 @@
> > };
> >
> >
> > +/*
> > + * Some stats for checkpoint phase
> > + */
> > +struct transaction_chp_stats_s {
> > + unsigned long cs_chp_time;
> > + unsigned long cs_forced_to_close;
> > + unsigned long cs_written;
> > + unsigned long cs_dropped;
> > +};
>
> It would be nice to document what units all these fields are in. Jiffies,
> I assume.
>
> > /* The transaction_t type is the guts of the journaling mechanism. It
> > * tracks a compound transaction through its various states:
> > *
> > @@ -543,6 +553,21 @@
> > spinlock_t t_handle_lock;
> >
> > /*
> > + * Longest time some handle had to wait for running transaction
> > + */
> > + unsigned long t_max_wait;
> > +
> > + /*
> > + * When transaction started
> > + */
> > + unsigned long t_start;
> > +
> > + /*
> > + * Checkpointing stats [j_checkpoint_sem]
> > + */
> > + struct transaction_chp_stats_s t_chp_stats;
> > +
> > + /*
> > * Number of outstanding updates running on this transaction
> > * [t_handle_lock]
> > */
> > @@ -573,6 +598,57 @@
> >
> > };
> >
> > +struct transaction_run_stats_s {
> > + unsigned long rs_wait;
> > + unsigned long rs_running;
> > + unsigned long rs_locked;
> > + unsigned long rs_flushing;
> > + unsigned long rs_logging;
> > +
> > + unsigned long rs_handle_count;
> > + unsigned long rs_blocks;
> > + unsigned long rs_blocks_logged;
> > +};
> > +
> > +struct transaction_stats_s
> > +{
> > + int ts_type;
> > + unsigned long ts_tid;
> > + union {
> > + struct transaction_run_stats_s run;
> > + struct transaction_chp_stats_s chp;
> > + } u;
> > +};
> > +
> > +#define JBD2_STATS_RUN 1
> > +#define JBD2_STATS_CHECKPOINT 2
> > +
> > +#define ts_wait u.run.rs_wait
> > +#define ts_running u.run.rs_running
> > +#define ts_locked u.run.rs_locked
> > +#define ts_flushing u.run.rs_flushing
> > +#define ts_logging u.run.rs_logging
> > +#define ts_handle_count u.run.rs_handle_count
> > +#define ts_blocks u.run.rs_blocks
> > +#define ts_blocks_logged u.run.rs_blocks_logged
> > +
> > +#define ts_chp_time u.chp.cs_chp_time
> > +#define ts_forced_to_close u.chp.cs_forced_to_close
> > +#define ts_written u.chp.cs_written
> > +#define ts_dropped u.chp.cs_dropped
>
> That's a bit sleazy. We can drop the "u" from 'struct transaction_stats_s'
> and make it an anonymous union, then open-code foo.run.rs_wait everywhere.
>
> But that's a bit sleazy too, because the reader of the code would not know
> that a write to foo.run.rs_wait will stomp on the value of
> foo.chp.cs_chp_time.
>
> So to minimize reader confusion it would be best I think to just open-code
> the full u.run.rs_wait at all code-sites.
>
> The macros above are the worst possible choice: they hide information from
> the code-reader just to save the code-writer a bit of typing. But we very
> much want to optimise for code-readers, not for code-writers.
>
> > +#define CURRENT_MSECS (jiffies_to_msecs(jiffies))
>
> hm, that isn't something which should be in an ext4 header file. And it
> shouldn't be in UPPER CASE and it shouldn't pretend to be a constant (or a
> global scalar).
>
> IOW: yuk.
>
> How's about raising a separate, standalone patch which creates a new
> kernel-wide coded-in-C function such as
>
> unsigned long jiffies_in_msecs(void);
>
> ? (That's assuming we don't already have one. Most likely we have seven
> of them hiding in various dark corners).
>
> > +static inline unsigned int
> > +jbd2_time_diff(unsigned int start, unsigned int end)
> > +{
> > + if (unlikely(start > end))
> > + end = end + (~0UL - start);
> > + else
> > + end -= start;
> > + return end;
> > +}
>
> I don't know what this does and I am disinclined to reverse-engineer its
> uncommentedness. But it looks like something which should be in some
> kernel-wide header file?
>
> > /**
> > * struct journal_s - The journal_s type is the concrete type associated with
> > * journal_t.
> > @@ -634,6 +710,12 @@
> > * @j_wbufsize: maximum number of buffer_heads allowed in j_wbuf, the
> > * number that will fit in j_blocksize
> > * @j_last_sync_writer: most recent pid which did a synchronous write
> > + * @j_history: Buffer storing the transactions statistics history
> > + * @j_history_max: Maximum number of transactions in the statistics history
> > + * @j_history_cur: Current number of transactions in the statistics history
> > + * @j_history_lock: Protect the transactions statistics history
> > + * @j_proc_entry: procfs entry for the jbd statistics directory
> > + * @j_stats: Overall statistics
> > * @j_private: An opaque pointer to fs-private information.
> > */
> >
> > @@ -826,6 +908,16 @@
> > pid_t j_last_sync_writer;
> >
> > /*
> > + * Journal statistics
> > + */
> > + struct transaction_stats_s *j_history;
> > + int j_history_max;
> > + int j_history_cur;
> > + spinlock_t j_history_lock;
> > + struct proc_dir_entry *j_proc_entry;
> > + struct transaction_stats_s j_stats;
>
> hm, we seem to be documenting the struct journal_s fields at two different
> places, possibly with duplication.
>
> Documenting the fields at the top in kernedoc-style drives me wild,
> frankly. We really really do want to be able to see the description of
> each field (which is vital information for understanding the code!) right
> there at the definition site.
>
> But at least it's documented somewhere.
>
> > + /*
> > * An opaque pointer to fs-private information. ext3 puts its
> > * superblock pointer here
> > */
> > Index: linux-2.6.22-rc4/fs/jbd2/transaction.c
> > ===================================================================
> > --- linux-2.6.22-rc4.orig/fs/jbd2/transaction.c 2007-06-11 17:22:14.000000000 -0700
> > +++ linux-2.6.22-rc4/fs/jbd2/transaction.c 2007-06-13 10:47:56.000000000 -0700
> > @@ -59,6 +59,8 @@
> >
> > J_ASSERT(journal->j_running_transaction == NULL);
> > journal->j_running_transaction = transaction;
> > + transaction->t_max_wait = 0;
> > + transaction->t_start = CURRENT_MSECS;
>
> Ah, it's in milliseconds!
>
> Good, albeit somewhat inefficient. Usually we'd accumulate jiffies and
> convert to/from milliseconds at the user interface boundary.
>
> > --- linux-2.6.22-rc4.orig/fs/jbd2/checkpoint.c 2007-06-11 17:22:14.000000000 -0700
> > +++ linux-2.6.22-rc4/fs/jbd2/checkpoint.c 2007-06-13 10:48:44.000000000 -0700
> > @@ -232,7 +232,8 @@
> > * Called under jbd_lock_bh_state(jh2bh(jh)), and drops it
> > */
> > static int __process_buffer(journal_t *journal, struct journal_head *jh,
> > - struct buffer_head **bhs, int *batch_count)
> > + struct buffer_head **bhs, int *batch_count,
> > + transaction_t *transaction)
> > {
> > struct buffer_head *bh = jh2bh(jh);
> > int ret = 0;
> > @@ -250,6 +251,7 @@
> > transaction_t *t = jh->b_transaction;
> > tid_t tid = t->t_tid;
> >
> > + transaction->t_chp_stats.cs_forced_to_close++;
> > spin_unlock(&journal->j_list_lock);
> > jbd_unlock_bh_state(bh);
> > jbd2_log_start_commit(journal, tid);
> > @@ -279,6 +281,7 @@
> > bhs[*batch_count] = bh;
> > __buffer_relink_io(jh);
> > jbd_unlock_bh_state(bh);
> > + transaction->t_chp_stats.cs_written++;
> > (*batch_count)++;
> > if (*batch_count == NR_BATCH) {
> > spin_unlock(&journal->j_list_lock);
> > @@ -322,6 +325,8 @@
> > if (!journal->j_checkpoint_transactions)
> > goto out;
> > transaction = journal->j_checkpoint_transactions;
> > + if (transaction->t_chp_stats.cs_chp_time == 0)
> > + transaction->t_chp_stats.cs_chp_time = CURRENT_MSECS;
>
> Oh dear, I have no clue what cs_chp_time represents. Number of
> milliseconds for this transaction, I guess. In which case we must be
> either a) triggering stats collection via some user interface of b)
> averaging these on an ongoing basis or c) something else.
>
> Let me read on...
>
> > this_tid = transaction->t_tid;
> > restart:
> > /*
> > @@ -346,7 +351,8 @@
> > retry = 1;
> > break;
> > }
> > - retry = __process_buffer(journal, jh, bhs,&batch_count);
> > + retry = __process_buffer(journal, jh, bhs, &batch_count,
> > + transaction);
> > if (!retry && lock_need_resched(&journal->j_list_lock)){
> > spin_unlock(&journal->j_list_lock);
> > retry = 1;
> > @@ -668,6 +674,8 @@
> >
> > void __jbd2_journal_drop_transaction(journal_t *journal, transaction_t *transaction)
> > {
> > + struct transaction_stats_s stats;
> > +
> > assert_spin_locked(&journal->j_list_lock);
> > if (transaction->t_cpnext) {
> > transaction->t_cpnext->t_cpprev = transaction->t_cpprev;
> > @@ -693,5 +701,26 @@
> > J_ASSERT(journal->j_running_transaction != transaction);
> >
> > jbd_debug(1, "Dropping transaction %d, all done\n", transaction->t_tid);
> > +
> > + /*
> > + * File the transaction for history
> > + */
> > + if (transaction->t_chp_stats.cs_written != 0 ||
> > + transaction->t_chp_stats.cs_chp_time != 0) {
> > + stats.ts_type = JBD2_STATS_CHECKPOINT;
> > + stats.ts_tid = transaction->t_tid;
> > + stats.u.chp = transaction->t_chp_stats;
> > + if (stats.ts_chp_time)
> > + stats.ts_chp_time =
> > + jbd2_time_diff(stats.ts_chp_time,
> > + CURRENT_MSECS);
> > + spin_lock(&journal->j_history_lock);
> > + memcpy(journal->j_history + journal->j_history_cur, &stats,
> > + sizeof(stats));
> > + if (++journal->j_history_cur == journal->j_history_max)
> > + journal->j_history_cur = 0;
> > + spin_unlock(&journal->j_history_lock);
> > + }
> > +
>
> Ah, it's a per-journal ringbuffer. Fair enough.
>
> > +struct jbd2_stats_proc_session {
> > + journal_t *journal;
> > + struct transaction_stats_s *stats;
> > + int start;
> > + int max;
> > +};
> > +
> > +static void *jbd2_history_skip_empty(struct jbd2_stats_proc_session *s,
> > + struct transaction_stats_s *ts,
> > + int first)
> > +{
> > + if (ts == s->stats + s->max)
> > + ts = s->stats;
> > + if (!first && ts == s->stats + s->start)
> > + return NULL;
> > + while (ts->ts_type == 0) {
> > + ts++;
> > + if (ts == s->stats + s->max)
> > + ts = s->stats;
> > + if (ts == s->stats + s->start)
> > + return NULL;
> > + }
> > + return ts;
> > +
> > +}
> > +
> > +static void *jbd2_seq_history_start(struct seq_file *seq, loff_t *pos)
> > +{
> > + struct jbd2_stats_proc_session *s = seq->private;
> > + struct transaction_stats_s *ts;
> > + int l = *pos;
> > +
> > + if (l == 0)
> > + return SEQ_START_TOKEN;
> > + ts = jbd2_history_skip_empty(s, s->stats + s->start, 1);
> > + if (!ts)
> > + return NULL;
> > + l--;
> > + while (l) {
> > + ts = jbd2_history_skip_empty(s, ++ts, 0);
> > + if (!ts)
> > + break;
> > + l--;
> > + }
> > + return ts;
> > +}
> > +
> > +static void *jbd2_seq_history_next(struct seq_file *seq, void *v, loff_t *pos)
> > +{
> > + struct jbd2_stats_proc_session *s = seq->private;
> > + struct transaction_stats_s *ts = v;
> > +
> > + ++*pos;
> > + if (v == SEQ_START_TOKEN)
> > + return jbd2_history_skip_empty(s, s->stats + s->start, 1);
> > + else
> > + return jbd2_history_skip_empty(s, ++ts, 0);
> > +}
> > +
> > +static int jbd2_seq_history_show(struct seq_file *seq, void *v)
> > +{
> > + struct transaction_stats_s *ts = v;
> > + if (v == SEQ_START_TOKEN) {
> > + seq_printf(seq, "%-4s %-5s %-5s %-5s %-5s %-5s %-5s %-6s %-5s "
> > + "%-5s %-5s %-5s %-5s %-5s\n", "R/C", "tid",
> > + "wait", "run", "lock", "flush", "log", "hndls",
> > + "block", "inlog", "ctime", "write", "drop",
> > + "close");
> > + return 0;
> > + }
> > + if (ts->ts_type == JBD2_STATS_RUN)
> > + seq_printf(seq, "%-4s %-5lu %-5lu %-5lu %-5lu %-5lu %-5lu "
> > + "%-6lu %-5lu %-5lu\n", "R", ts->ts_tid,
> > + ts->ts_wait, ts->ts_running, ts->ts_locked,
> > + ts->ts_flushing, ts->ts_logging,
> > + ts->ts_handle_count, ts->ts_blocks,
> > + ts->ts_blocks_logged);
> > + else if (ts->ts_type == JBD2_STATS_CHECKPOINT)
> > + seq_printf(seq, "%-4s %-5lu %48s %-5lu %-5lu %-5lu %-5lu\n",
> > + "C", ts->ts_tid, " ", ts->ts_chp_time,
> > + ts->ts_written, ts->ts_dropped,
> > + ts->ts_forced_to_close);
> > + else
> > + J_ASSERT(0);
> > + return 0;
> > +}
> > +
> > +static void jbd2_seq_history_stop(struct seq_file *seq, void *v)
> > +{
> > +}
> > +
> > +static struct seq_operations jbd2_seq_history_ops = {
> > + .start = jbd2_seq_history_start,
> > + .next = jbd2_seq_history_next,
> > + .stop = jbd2_seq_history_stop,
> > + .show = jbd2_seq_history_show,
> > +};
> > +
> > +static int jbd2_seq_history_open(struct inode *inode, struct file *file)
> > +{
> > + journal_t *journal = PDE(inode)->data;
> > + struct jbd2_stats_proc_session *s;
> > + int rc, size;
> > +
> > + s = kmalloc(sizeof(*s), GFP_KERNEL);
> > + if (s == NULL)
> > + return -EIO;
>
> ENOMEM.
>
> > + size = sizeof(struct transaction_stats_s) * journal->j_history_max;
> > + s->stats = kmalloc(size, GFP_KERNEL);
> > + if (s == NULL) {
> > + kfree(s);
> > + return -EIO;
>
> ENOMEM.
>
> > + }
> > + spin_lock(&journal->j_history_lock);
> > + memcpy(s->stats, journal->j_history, size);
> > + s->max = journal->j_history_max;
> > + s->start = journal->j_history_cur % s->max;
> > + spin_unlock(&journal->j_history_lock);
> > +
> > + rc = seq_open(file, &jbd2_seq_history_ops);
> > + if (rc == 0) {
> > + struct seq_file *m = (struct seq_file *)file->private_data;
>
> unneeded cast of void*
>
> > + m->private = s;
> > + } else {
> > + kfree(s->stats);
> > + kfree(s);
> > + }
> > + return rc;
> > +
> > +}
> > +
> > +static int jbd2_seq_history_release(struct inode *inode, struct file *file)
> > +{
> > + struct seq_file *seq = (struct seq_file *)file->private_data;
>
> unneeded cast of void*
>
> > + struct jbd2_stats_proc_session *s = seq->private;
> > + kfree(s->stats);
> > + kfree(s);
> > + return seq_release(inode, file);
> > +}
> > +
> >
> > ...
> >
> > +static int jbd2_seq_info_open(struct inode *inode, struct file *file)
> > +{
> > + journal_t *journal = PDE(inode)->data;
> > + struct jbd2_stats_proc_session *s;
> > + int rc, size;
> > +
> > + s = kmalloc(sizeof(*s), GFP_KERNEL);
> > + if (s == NULL)
> > + return -EIO;
>
> ENOMEM
>
> > + size = sizeof(struct transaction_stats_s);
> > + s->stats = kmalloc(size, GFP_KERNEL);
> > + if (s == NULL) {
> > + kfree(s);
> > + return -EIO;
>
> ENOMEM
>
> > + }
> > + spin_lock(&journal->j_history_lock);
> > + memcpy(s->stats, &journal->j_stats, size);
> > + s->journal = journal;
> > + spin_unlock(&journal->j_history_lock);
> > +
> > + rc = seq_open(file, &jbd2_seq_info_ops);
> > + if (rc == 0) {
> > + struct seq_file *m = (struct seq_file *)file->private_data;
> > + m->private = s;
> > + } else {
> > + kfree(s->stats);
> > + kfree(s);
> > + }
> > + return rc;
> > +
> > +}
> > +
> > +static int jbd2_seq_info_release(struct inode *inode, struct file *file)
> > +{
> > + struct seq_file *seq = (struct seq_file *)file->private_data;
>
> Unneeded cast
>
> > + struct jbd2_stats_proc_session *s = seq->private;
> > + kfree(s->stats);
> > + kfree(s);
> > + return seq_release(inode, file);
> > +}
>
> We would usually put a blanck line betwee end-of-locals and start-of-code.
>
> > +static struct file_operations jbd2_seq_info_fops = {
> > + .owner = THIS_MODULE,
> > + .open = jbd2_seq_info_open,
> > + .read = seq_read,
> > + .llseek = seq_lseek,
> > + .release = jbd2_seq_info_release,
> > +};
> > +
> > +static struct proc_dir_entry *proc_jbd2_stats = NULL;
>
> Unneeded (and wasteful) initialisation (checkpatch should catch this)
>
> > +static void jbd2_stats_proc_init(journal_t *journal)
> > +{
> > + char name[64];
>
> This should have size BDEVNAME_SIZE.
>
> > + snprintf(name, sizeof(name) - 1, "%s", bdevname(journal->j_dev, name));
> > + journal->j_proc_entry = proc_mkdir(name, proc_jbd2_stats);
> > + if (journal->j_proc_entry) {
> > + struct proc_dir_entry *p;
> > + p = create_proc_entry("history", S_IRUGO,
> > + journal->j_proc_entry);
> > + if (p) {
> > + p->proc_fops = &jbd2_seq_history_fops;
> > + p->data = journal;
> > + p = create_proc_entry("info", S_IRUGO,
> > + journal->j_proc_entry);
> > + if (p) {
> > + p->proc_fops = &jbd2_seq_info_fops;
> > + p->data = journal;
> > + }
> > + }
> > + }
> > +}
> > +
> > +static void jbd2_stats_proc_exit(journal_t *journal)
> > +{
> > + char name[64];
>
> ditto
>
> > + snprintf(name, sizeof(name) - 1, "%s", bdevname(journal->j_dev, name));
> > + remove_proc_entry("info", journal->j_proc_entry);
> > + remove_proc_entry("history", journal->j_proc_entry);
> > + remove_proc_entry(name, proc_jbd2_stats);
> > +}
> > +
> > +static void journal_init_stats(journal_t *journal)
> > +{
> > + int size;
> > +
> > + if (proc_jbd2_stats == NULL)
> > + return;
> > +
> > + journal->j_history_max = 100;
> > + size = sizeof(struct transaction_stats_s) * journal->j_history_max;
> > + journal->j_history = kmalloc(size, GFP_KERNEL);
> > + if (journal->j_history == NULL) {
> > + journal->j_history_max = 0;
> > + return;
> > + }
> > + memset(journal->j_history, 0, size);
> > + spin_lock_init(&journal->j_history_lock);
> > +}
>
> So the "100" is not alterable at present although the code is pretty much
> set up to do this (code which alters the size of the history buffer will
> need to take j_history_lock).
>
> > + journal_init_stats(journal);
>
> This is all a heck of a lot of new code. Should we plan to make it
> Kconfigurable?
>
> > +#if defined(CONFIG_PROC_FS)
>
> #ifdef CONFIG_PROC_FS
>
> would be more typical
>
> > +#define JBD2_STATS_PROC_NAME "fs/jbd2"
>
> hm, now what's going one here. Maybe I misiunderstood, and the stats are
> not per-superblock at all. That would be a shame.
>
> > +static void __init jbd2_create_jbd_stats_proc_entry(void)
> > +{
> > + proc_jbd2_stats = proc_mkdir(JBD2_STATS_PROC_NAME, NULL);
> > +}
> > +
> > +static void __exit jbd2_remove_jbd_stats_proc_entry(void)
> > +{
> > + if (proc_jbd2_stats)
> > + remove_proc_entry(JBD2_STATS_PROC_NAME, NULL);
> > +}
> > +
> > +#else
> > +
> > +#define jbd2_create_jbd_stats_proc_entry() do {} while (0)
> > +#define jbd2_remove_jbd_stats_proc_entry() do {} while (0)
>
> These could be coded in C. Remember: only use macros for things which
> _cannot_ be coded in C.
>
>
> > +#endif
> > +
> > struct kmem_cache *jbd2_handle_cache;
> >
> > static int __init journal_init_handle_cache(void)
> > @@ -2046,6 +2376,7 @@
> > if (ret != 0)
> > jbd2_journal_destroy_caches();
> > jbd2_create_debugfs_entry();
> > + jbd2_create_jbd_stats_proc_entry();
> > return ret;
> > }
> >
> > @@ -2057,6 +2388,7 @@
> > printk(KERN_EMERG "JBD: leaked %d journal_heads!\n", n);
> > #endif
> > jbd2_remove_debugfs_entry();
> > + jbd2_remove_jbd_stats_proc_entry();
> > jbd2_journal_destroy_caches();
> > }
>

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/