Re: [PATCH] locking/rtmutex: Fix task->pi_waiters integrity

From: Henry Wu
Date: Fri Jul 14 2023 - 13:13:39 EST


Hi, all.

Peter Zijlstra <peterz@xxxxxxxxxxxxx> 于2023年7月8日周六 00:10写道:
>
> Henry reported that rt_mutex_adjust_prio_check() has an ordering
> problem and puts the lie to the comment in [7]. Sharing the sort key
> between lock->waiters and owner->pi_waiters *does* create problems,
> since unlike what the comment claims, holding [L] is insufficient.
>
> Notably, consider:
>
> A
> / \
> M1 M2
> | |
> B C
>
> That is, task A owns both M1 and M2, B and C block on them. In this
> case a concurrent chain walk (B & C) will modify their resp. sort keys
> in [7] while holding M1->wait_lock and M2->wait_lock. So holding [L]
> is meaningless, they're different Ls.
>
> This then gives rise to a race condition between [7] and [11], where
> the requeue of pi_waiters will observe an inconsistent tree order.
>
> B C
>
> (holds M1->wait_lock, (holds M2->wait_lock,
> holds B->pi_lock) holds A->pi_lock)
>
> [7]
> waiter_update_prio();
> ...
> [8]
> raw_spin_unlock(B->pi_lock);
> ...
> [10]
> raw_spin_lock(A->pi_lock);
>
> [11]
> rt_mutex_enqueue_pi();
> // observes inconsistent A->pi_waiters
> // tree order
>
> Fixing this means either extending the range of the owner lock from
> [10-13] to [6-13], with the immediate problem that this means [6-8]
> hold both blocked and owner locks, or duplicating the sort key.
>
> Since the locking in chain walk is horrible enough without having to
> consider pi_lock nesting rules, duplicate the sort key instead.
>
> By giving each tree their own sort key, the above race becomes
> harmless, if C sees B at the old location, then B will correct things
> (if they need correcting) when it walks up the chain and reaches A.
>
> Fixes: fb00aca47440 ("rtmutex: Turn the plist into an rb-tree")
> Reported-by: Henry Wu <triangletrap12@xxxxxxxxx>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
> ---
> kernel/locking/rtmutex.c | 170 ++++++++++++++++++++++++++--------------
> kernel/locking/rtmutex_api.c | 2
> kernel/locking/rtmutex_common.h | 47 ++++++++---
> kernel/locking/ww_mutex.h | 12 +-
> 4 files changed, 155 insertions(+), 76 deletions(-)
>

Sorry for late reply due to overlooked your patch.

I tested your patch on a 6 core machine with kernel 6.5.0-rc1. It
works well and my test program can't find any race on patched kernel.

I tested it with non-debug kernel for almost 3 hours and tested with
debug kernel for 6 hours. I checked my dmesg and there was no
irregular log. I checked my test program on kernel 6.3.12 that it can
reproduce the race we discussed previously.

You can find my test program in attachment and below is test output.
You can add "Tested-by" in commit msg.

Thanks

Henry

For non-debug kernel:

$ sudo cat /sys/kernel/debug/sched/preempt
none voluntary (full)
$ sudo ./a.out
pid: 23845
Before: prio: 20
PID LWP TTY TIME CMD
23845 23845 pts/0 00:00:00 a.out
23845 23846 pts/0 00:00:00 waiter-0
23845 23847 pts/0 00:00:00 waiter-1
23845 23848 pts/0 00:00:00 waiter-2
23845 23849 pts/0 00:00:00 waiter-3
23845 23850 pts/0 00:00:00 waiter-4
23845 23851 pts/0 00:00:00 waiter-5
23845 23852 pts/0 00:00:00 waiter-6
23845 23853 pts/0 00:00:00 waiter-7
23845 23854 pts/0 00:00:00 waiter-8
23845 23855 pts/0 00:00:00 waiter-9
23845 23856 pts/0 00:00:00 waiter-10
23845 23857 pts/0 00:00:00 waiter-11
23845 23858 pts/0 00:00:00 waiter-12
23845 23859 pts/0 00:00:00 waiter-13
23845 23860 pts/0 00:00:00 waiter-14
23845 23861 pts/0 00:00:00 waiter-15
23845 23862 pts/0 00:00:00 waiter-16
23845 23863 pts/0 00:00:00 waiter-17
23845 23864 pts/0 00:00:00 waiter-18
23845 23865 pts/0 00:00:00 waiter-19
23845 23866 pts/0 00:00:00 changer-0
23845 23867 pts/0 00:00:00 changer-1
23845 23868 pts/0 00:00:00 changer-2
23845 23869 pts/0 00:00:00 changer-3
23845 23870 pts/0 00:00:00 changer-4
23845 23871 pts/0 00:00:00 changer-5
23845 23872 pts/0 00:00:00 changer-6
23845 23873 pts/0 00:00:00 changer-7
23845 23874 pts/0 00:00:00 changer-8
23845 23875 pts/0 00:00:00 changer-9
23845 23876 pts/0 00:00:00 changer-10
23845 23877 pts/0 00:00:00 changer-11
23845 23878 pts/0 00:00:00 changer-12
23845 23879 pts/0 00:00:00 changer-13
23845 23880 pts/0 00:00:00 changer-14
23845 23881 pts/0 00:00:00 changer-15
23845 23882 pts/0 00:00:00 changer-16
23845 23883 pts/0 00:00:00 changer-17
23845 23884 pts/0 00:00:00 changer-18
23845 23885 pts/0 00:00:00 changer-19
Prio: -21, elapsed: 2.0 minutes.
Prio: -21, elapsed: 4.0 minutes.
Prio: -21, elapsed: 6.0 minutes.
.............
Prio: -21, elapsed: 166.0 minutes.
Prio: -21, elapsed: 168.0 minutes.
Prio: -21, elapsed: 170.0 minutes.
Prio: -21, elapsed: 172.0 minutes.
Prio: -21, elapsed: 174.0 minutes.

For debug kernel:

$ uname -a
Linux fedora 6.5.0-0.rc1.20230713gteb26cbb1.213.rtmutex.fc38.x86_64+debug
#1 SMP PREEMPT_DYNAMIC Fri Jul 14 14:49:04 CST 2023 x86_64 GNU/Linux
$ sudo cat /sys/kernel/debug/sched/preempt
none voluntary (full)
$ gcc -Wall -g ./pi_650_rc1.c
$ sudo ./a.out
pid: 3141
Before: prio: 20
PID LWP TTY TIME CMD
3141 3141 pts/0 00:00:00 a.out
3141 3142 pts/0 00:00:00 waiter-0
3141 3143 pts/0 00:00:00 waiter-1
3141 3144 pts/0 00:00:00 waiter-2
3141 3145 pts/0 00:00:00 waiter-3
3141 3146 pts/0 00:00:00 waiter-4
3141 3147 pts/0 00:00:00 waiter-5
3141 3148 pts/0 00:00:00 waiter-6
3141 3149 pts/0 00:00:00 waiter-7
3141 3150 pts/0 00:00:00 waiter-8
3141 3151 pts/0 00:00:00 waiter-9
3141 3152 pts/0 00:00:00 waiter-10
3141 3153 pts/0 00:00:00 waiter-11
3141 3154 pts/0 00:00:00 waiter-12
3141 3155 pts/0 00:00:00 waiter-13
3141 3156 pts/0 00:00:00 waiter-14
3141 3157 pts/0 00:00:00 waiter-15
3141 3158 pts/0 00:00:00 waiter-16
3141 3159 pts/0 00:00:00 waiter-17
3141 3160 pts/0 00:00:00 waiter-18
3141 3161 pts/0 00:00:00 waiter-19
3141 3162 pts/0 00:00:00 changer-0
3141 3163 pts/0 00:00:00 changer-1
3141 3164 pts/0 00:00:00 changer-2
3141 3165 pts/0 00:00:00 changer-3
3141 3166 pts/0 00:00:00 changer-4
3141 3167 pts/0 00:00:00 changer-5
3141 3168 pts/0 00:00:00 changer-6
3141 3169 pts/0 00:00:00 changer-7
3141 3170 pts/0 00:00:00 changer-8
3141 3171 pts/0 00:00:00 changer-9
3141 3172 pts/0 00:00:00 changer-10
3141 3173 pts/0 00:00:00 changer-11
3141 3174 pts/0 00:00:00 changer-12
3141 3175 pts/0 00:00:00 changer-13
3141 3176 pts/0 00:00:00 changer-14
3141 3177 pts/0 00:00:00 changer-15
3141 3178 pts/0 00:00:00 changer-16
3141 3179 pts/0 00:00:00 changer-17
3141 3180 pts/0 00:00:00 changer-18
3141 3181 pts/0 00:00:00 changer-19
Prio: -21, elapsed: 2.0 minutes.
Prio: -21, elapsed: 4.0 minutes.
Prio: -21, elapsed: 6.0 minutes.
Prio: -21, elapsed: 8.0 minutes.
Prio: -21, elapsed: 10.0 minutes.
Prio: -21, elapsed: 12.0 minutes.
Prio: -21, elapsed: 14.0 minutes.
....................
Prio: -21, elapsed: 388.0 minutes.
Prio: -21, elapsed: 390.0 minutes.
Prio: -21, elapsed: 392.0 minutes.
Prio: -21, elapsed: 394.0 minutes.
Prio: -21, elapsed: 396.0 minutes.
^C
$
#define _GNU_SOURCE

#include <assert.h>
#include <stdio.h>
#include <pthread.h>
#include <stdlib.h>
#include <stdint.h>
#include <unistd.h>
#include <sched.h>
#include <errno.h>
#include <sys/syscall.h>

#define PCOUNT 20
#define EXPECTED_PRIO -21

// Copied from related header.
struct sched_attr {
uint32_t size;

uint32_t sched_policy;
uint64_t sched_flags;

/* SCHED_NORMAL, SCHED_BATCH */
int32_t sched_nice;

/* SCHED_FIFO, SCHED_RR */
uint32_t sched_priority;

/* SCHED_DEADLINE */
uint64_t sched_runtime;
uint64_t sched_deadline;
uint64_t sched_period;

/* Utilization hints */
uint32_t sched_util_min;
uint32_t sched_util_max;
};

static pthread_mutex_t mutexes[PCOUNT];

static void init_mutex(pthread_mutex_t *mutex) {
pthread_mutexattr_t attr;
int ret;

ret = pthread_mutexattr_init(&attr);
assert(!ret);
ret = pthread_mutexattr_setprotocol(&attr, PTHREAD_PRIO_INHERIT);
assert(!ret);

ret = pthread_mutex_init(mutex, &attr);
assert(!ret);

pthread_mutexattr_destroy(&attr);
}

enum thread_type {
WAITER,
PRIO_CHANGER,
};

static struct thread_spec {
enum thread_type type;
int name_index;
pthread_barrier_t *barrier;

// Used by waiter.
pthread_mutex_t *wait_mutex;
pid_t pid_save;

// Used by prio changer.
pid_t *waiter_pid;
uint32_t prio;
} thread_specs[PCOUNT * 2];

static pthread_t threads[PCOUNT * 2];
static int thread_count = 0;

static int barrier_wait(pthread_barrier_t *barrier) {
if (!barrier) {
return 0;
}

int ret = pthread_barrier_wait(barrier);
assert(!ret || ret == PTHREAD_BARRIER_SERIAL_THREAD);

return ret;
}

static int sched_getattr(pid_t pid, struct sched_attr *attr, int flags) {
return syscall(SYS_sched_getattr, pid, attr, sizeof(*attr), flags);
}

static int sched_setattr(pid_t pid, struct sched_attr *attr, int flags) {
return syscall(SYS_sched_setattr, pid, attr, flags);
}

static void *prio_change_loop(struct thread_spec *spec) {
struct sched_attr attr;
int ret;

for (;;) {
barrier_wait(spec->barrier);

ret = sched_getattr(*spec->waiter_pid, &attr, 0);
if (ret < 0) {
perror("sched_getattr");
assert(0);
}

attr.sched_policy = SCHED_RR;
attr.sched_priority = spec->prio;

ret = sched_setattr(*spec->waiter_pid, &attr, 0);
if (ret < 0) {
perror("sched_setattr");
exit(1);
}

barrier_wait(spec->barrier);
}
}

static void *thread(void *arg) {
struct thread_spec *spec = arg;
char name[64];
int ret;

snprintf(name, sizeof(name), "%s-%d",
spec->type == WAITER ? "waiter" : "changer",
spec->name_index);
ret = pthread_setname_np(pthread_self(), name);
assert(!ret);

switch (spec->type) {
case WAITER:
spec->pid_save = gettid();

barrier_wait(spec->barrier);

ret = pthread_mutex_lock(spec->wait_mutex);
assert(!ret);
break;
case PRIO_CHANGER:
prio_change_loop(spec);
break;
default:
assert(0);
break;
}

return NULL;
}

static void create_thread(struct thread_spec *spec) {
int ret;

ret = pthread_create(&threads[thread_count++], NULL, thread, spec);
assert(!ret);
}

static int get_prio(int need_print) {
FILE *file = fopen("/proc/self/stat", "r");
assert(file);

char comm[64];
int tmp, prio, ret;
ret = fscanf(file, "%d %s %c %d %d %d %d %d %d %d %d %d %d %d %d %d %d %d",
&tmp, comm, comm, &tmp, &tmp, &tmp, &tmp, &tmp, &tmp,
&tmp, &tmp, &tmp, &tmp, &tmp, &tmp, &tmp, &tmp, &prio);
assert(ret == 18);

if (need_print) {
printf("prio: %d\n", prio);
}

fclose(file);

return prio;
}

static void print_threads(pid_t process_pid) {
char cmd[128];
snprintf(cmd, sizeof(cmd), "ps -L -p %d", process_pid);
system(cmd);
}

int main(void) {
pid_t pid;
int ret;

pid = getpid();
printf("pid: %d\n", pid);

for (int i = 0; i < sizeof(mutexes) / sizeof(mutexes[0]); ++i) {
init_mutex(&mutexes[i]);
ret = pthread_mutex_lock(&mutexes[i]);
assert(!ret);
}

pthread_barrier_t barrier;
ret = pthread_barrier_init(&barrier, NULL, PCOUNT + 1);
assert(!ret);

for (int i = 0; i < PCOUNT; ++i) {
struct thread_spec *spec = &thread_specs[i];

spec->type = WAITER;
spec->name_index = i;
spec->wait_mutex = &mutexes[i];
spec->barrier = &barrier;

create_thread(spec);
}

// Wait for filling of pid_save.
barrier_wait(&barrier);

for (int i = 0; i < PCOUNT; ++i) {
struct thread_spec *spec = &thread_specs[i + PCOUNT];

spec->type = PRIO_CHANGER;
spec->name_index = i;
spec->prio = 99;
spec->barrier = &barrier;
spec->waiter_pid = &thread_specs[i].pid_save;

create_thread(spec);
}

printf("Before: ");
get_prio(1);
print_threads(pid);

time_t ts, start_ts;
srandom(time(&ts));
start_ts = ts;
int iter = 0;
for (;;) {
++iter;
for (int i = 0; i < PCOUNT; ++i) {
thread_specs[i + PCOUNT].prio = (iter % 2) ? i + 1 : PCOUNT - i;
}

for (int i = 0; i < PCOUNT; ++i) {
int pos = random() % PCOUNT;
int tmp = thread_specs[pos + PCOUNT].prio;
thread_specs[pos + PCOUNT].prio = thread_specs[PCOUNT].prio;
thread_specs[PCOUNT].prio = tmp;
}

barrier_wait(&barrier);
barrier_wait(&barrier);

//system("echo iteration > /sys/kernel/debug/tracing/trace_marker");

for (int try = 0; ; ++try) {
int prio = get_prio(0);
time_t now = time(NULL);
if (now - ts >= 120) {
printf("Prio: %d, elapsed: %.1f minutes.\n", prio, (double)(now - start_ts) / 60);
ts = now;
}

if (prio == EXPECTED_PRIO) {
// Try a new iteration.
break;
}

if (try >= 10) {
print_threads(pid);
printf("found race, hang...\n");
while (1) {
sleep(3600);
}
}
}
}

for (int i = 0; i < thread_count; ++i) {
pthread_join(threads[i], NULL);
}

return 0;
}