Re: Enhanced profiling support (was Re: vm lock contention reduction)

From: Karim Yaghmour (karim@opersys.com)
Date: Wed Jul 10 2002 - 23:47:28 EST


Andrea Arcangeli wrote:
> btw, on the topic, with our semaphores there's no way to handle priority
> inversion with SCHED_RR tasks, if there's more than one task that runs
> in RT priority we may fall into starvation of RT tasks too the same way.
>
> No starvation can happen of course if all tasks in the systems belongs
> to the same scheduler policy (nice levels can have effects but they're
> not indefinite delays).
>
> The fix Ingo used for SCHED_IDLE is to have a special call to the
> scheduler while returning to userspace, so in the only place where we
> know the kernel isn't holding any lock. But while it's going to only
> generate some minor unexpected cpu load with SCHED_IDLE, generalizing
> that hack to make all tasks scheduling inside the kernel running with RT
> priority isn't going to provide a nice/fair behaviour (some task infact could
> run way too much if it's very system-hungry, in particular with
> -preempt, which could again generate starvation of userspace, even if
> not anymore because of kernel locks). Maybe I'm overlooking something
> simple but I guess it's not going to be easy to fix it, for the
> semaphores it isn't too bad, they could learn how to raise priority of a
> special holder when needed, but for any semaphore-by-hand (for example
> spinlock based) it would require some major auditing.

I wasn't aware of this particular problem, but I certainly can see LTT
as being helpful in trying to understand the actual interactions. The
custom event creation API is rather simple to use if you would like
to instrument some of the semaphore code (the Sys V IPC code is already
instrumented at a basic level):
trace_create_event()
trace_create_owned_event()
trace_destroy_event()
trace_std_formatted_event()
trace_raw_event()

Here's the complete description:
int trace_create_event(char* pm_event_type,
                       char* pm_event_desc,
                       int pm_format_type,
                       char* pm_format_data)
        pm_event_type is a short string describing the type of event.
        pm_event_desc is required if you are using
        trace_std_formatted_event(), more on this below.
        pm_format_type and pm_format_data are required if you are
        using trace_raw_event(). The function returns a unique
        custom event ID.

int trace_create_owned_event(char* pm_event_type,
                             char* pm_event_desc,
                             int pm_format_type,
                             char* pm_format_data,
                             pid_t pm_owner_pid)
        Same as trace_create_event() except that all events created
        using this call will be deleted once the process with pm_owner_pid
        exits. Not really useful in kernel space, but essential for
        providing user-space events.

void trace_destroy_event(int pm_event_id)
        Destroys the event with given pm_event_id.

int trace_std_formatted_event(int pm_event_id, ...)
        Instead of having a slew of printk's all around the code,
        pm_event_desc is filled with a printk-like string at the
        event creation and the actual params used to fill this
        string are passed to trace_std_formatted_event(). Be aware
        that this function uses va_start/vsprintf/va_end. The
        resulting string is logged in the trace as such and is visible
        in the trace much like a printk in a log. Except that you can
        place this in paths where printk's can't go and you can be sure
        that events logged with this are delivered even in high
        throughput situations (granted the trace buffer size is adequate).

int trace_raw_event(int pm_event_id,
                    int pm_event_size,
                    void* pm_event_data)
        This is the easiest way to log tons of binary data. Just give it
        the size of the data (pm_event_size) and a pointer to it
        (pm_event_data) and it gets written in the trace. This is what
        DProbes uses. The binary data is then formatted in userspace.

All events logged using this API can easily be extracted using LibLTT's
API. There is no requirement to use LTT's own visualization tool,
although you can still use it to see your own custom events.

As with the other traced events, each custom event gets timestamped
(do_gettimeofday) and placed in order of occurrence in the trace buffer.

I've attached an example module that uses the custom event API. See
the Examples directory of the LTT package for an example custom trace
reader which uses LibLTT.

One clear advantage of this API is that you can avoid using any
"#ifdef TRACE" or "#ifdef DEBUG". If the tracing module isn't loaded
or if the trace daemon isn't running, then nothing gets logged.

Cheers,

Karim

===================================================
                 Karim Yaghmour
               karim@opersys.com
      Embedded and Real-Time Linux Expert
===================================================


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Mon Jul 15 2002 - 22:00:19 EST