Re: [PATCH 00/13] scsi: Support LUN/target based error handle

From: haowenchao (C)
Date: Tue Aug 15 2023 - 10:09:15 EST


On 2023/7/24 7:44, Wenchao Hao wrote:
The origin error handle would set host to recovery state and perform
error recovery operations, and makes all LUNs which share a same host
can not handle IOs. This phenomenon is unbearable for systems which
deploy many LUNs in one HBA.

This patchset introduce support for LUN/target based error handle,
drivers can chose if to implement it. They can implement LUN, target or
both of LUN and target based error handle by their own error handle
strategy. The first patch defined this framework, it abstract three
key operations which are: add error command, wake up error handle, block
ios when error command is added and recoverying. Drivers should
implement these three function callbacks and setup to SCSI middle level.

Besides the basic framework, this patchset also add a basic LUN/target
based error handle strategy.

For LUN based eh, it would try check sense, start unit and reset LUN,
if all above steps can not recovery all error commands, fallback to
further recovery like tartget based (if implemented) or host based error
handle.

It's same for tartget based eh, it would try check sense, start unit,
reset LUN and reset target. If all above steps can not recovery all error
commands, fallback to further recovery which is host based error handle.

This patchset is tested by scsi_debug which support single LUN error
injection, the scsi_debug patches is here:

https://lore.kernel.org/linux-scsi/20230723234105.1628982-1-haowenchao2@xxxxxxxxxx/T/#t


I tested this patch set with scsi_debug with following scenarios, check
attachments to get my test script and result logs.

+-----------+---------+-------------------------------------------------------+
| lun reset | TUR | Desired result |
+ --------- + ------- + ------------------------------------------------------+
| success | success | retry or finish with EIO(may offline disk) |
+ --------- + ------- + ------------------------------------------------------+
| success | fail | fallback to host recovery, retry or finish with |
| | | EIO(may offline disk) |
+ --------- + ------- + ------------------------------------------------------+
| fail | NA | fallback to host recovery, retry or finish with |
| | | EIO(may offline disk) |
+ --------- + ------- + ------------------------------------------------------+

+-----------+---------+--------------+---------+------------------------------+
| lun reset | TUR | target reset | TUR | Desired result |
+-----------+---------+--------------+---------+------------------------------+
| success | success | NA | NA | retry or finish with |
| | | | | EIO(may offline disk) |
+-----------+---------+--------------+---------+------------------------------+
| success | fail | success | success | retry or finish with |
| | | | | EIO(may offline disk) |
+-----------+---------+--------------+---------+------------------------------+
| fail | NA | success | success | retry or finish with |
| | | | | EIO(may offline disk) |
+-----------+---------+--------------+---------+------------------------------+
| fail | NA | success | fail | fallback to host recovery, |
| | | | | retry or finish with EIO(may |
| | | | | offline disk) |
+-----------+---------+--------------+---------+------------------------------+
| fail | NA | fail | NA | fallback to host recovery, |
| | | | | retry or finish with EIO(may |
| | | | | offline disk) |
+-----------+---------+--------------+---------+------------------------------+

+-----------+---------+--------------+---------+------------------------------+
| lun reset | TUR | target reset | TUR | Desired result |
+-----------+---------+--------------+---------+------------------------------+
| success | success | NA | NA | retry or finish with |
| | | | | EIO(may offline disk) |
+-----------+---------+--------------+---------+------------------------------+
| success | fail | success | success | lun recovery fallback to |
| | | | | target recovery, retry or |
| | | | | finish with EIO(may offline |
| | | | | disk |
+-----------+---------+--------------+---------+------------------------------+
| fail | NA | success | success | lun recovery fallback to |
| | | | | target recovery, retry or |
| | | | | finish with EIO(may offline |
| | | | | disk |
+-----------+---------+--------------+---------+------------------------------+
| fail | NA | success | fail | lun recovery fallback to |
| | | | | target recovery, then fall |
| | | | | back to host recovery, retry |
| | | | | or fhinsi with EIO(may |
| | | | | offline disk) |
+-----------+---------+--------------+---------+------------------------------+
| fail | NA | fail | NA | lun recovery fallback to |
| | | | | target recovery, then fall |
| | | | | back to host recovery, retry |
| | | | | or fhinsi with EIO(may |
| | | | | offline disk) |
+-----------+---------+--------------+---------+------------------------------+


Wenchao Hao (13):
scsi: Define basic framework for driver LUN/target based error handle
scsi:scsi_error: Move complete variable eh_action from shost to sdevice
scsi:scsi_error: Check if to do reset in scsi_try_xxx_reset
scsi:scsi_error: Add helper scsi_eh_sdev_stu to do START_UNIT
scsi:scsi_error: Add helper scsi_eh_sdev_reset to do lun reset
scsi:scsi_error: Add flags to mark error handle steps has done
scsi:scsi_error: Define helper to perform LUN based error handle
scsi:scsi_error: Add LUN based error handler based previous helper
scsi:core: increase/decrease target_busy without check can_queue
scsi:scsi_error: Define helper to perform target based error handle
scsi:scsi_error: Add target based error handler based previous helper
scsi:scsi_debug: Add param to control if setup LUN based error handle
scsi:scsi_debug: Add param to control if setup target based error handle

drivers/scsi/scsi_debug.c | 19 +
drivers/scsi/scsi_error.c | 705 ++++++++++++++++++++++++++++++++++---
drivers/scsi/scsi_lib.c | 23 +-
drivers/scsi/scsi_priv.h | 20 ++
include/scsi/scsi_device.h | 97 +++++
include/scsi/scsi_eh.h | 4 +
include/scsi/scsi_host.h | 2 -
7 files changed, 813 insertions(+), 57 deletions(-)

Attachment: logs.tar.gz
Description: GNU Zip compressed data

#!/bin/sh

scsi_debug=/mnt/mainline/drivers/scsi/scsi_debug.ko

function clear_error()
{
error=$1
tmpfile=$$_clear
cat $error | grep -v Type | awk '{print $1,$3}' > $tmpfile
while read -r line; do echo "- $line" > $error; done < $tmpfile
rm -rf $tmpfile

echo 0 > /sys/kernel/debug/scsi_debug/target$target_id/fail_reset
}

function lun_test_sense1()
{
echo "LUN reset success, TUR success"

# inject timeout command for write command
echo "0 -10 0x2a " > ${error}
# inject abort command for write command
echo "3 -1 0x2a " > ${error}

dd if=/dev/zero of=/dev/$disk bs=1K count=10 oflag=direct
echo $(cat /sys/block/$disk/device/state)

clear_error $error
echo running > /sys/block/$disk/device/state
}

function lun_test_sense2()
{
echo "LUN reset success, TUR failed"

# inject timeout command for write command
echo "0 -10 0x2a " > ${error}
# inject abort command for write command
echo "3 -1 0x2a " > ${error}
# inject timeout command for TUR command
echo "0 -1 0x0 " > ${error}

dd if=/dev/zero of=/dev/$disk bs=1K count=10 oflag=direct
echo $(cat /sys/block/$disk/device/state)

clear_error $error
echo running > /sys/block/$disk/device/state
}

function lun_test_sense3()
{
echo "LUN reset failed, fallback to target reset success"

# inject timeout command for write command
echo "0 -10 0x2a " > ${error}
# inject abort command for write command
echo "3 -1 0x2a " > ${error}
# inject lunreset failed
echo "4 -1 0xff" > ${error}

dd if=/dev/zero of=/dev/$disk bs=1K count=10 oflag=direct
echo $(cat /sys/block/$disk/device/state)

clear_error $error
echo running > /sys/block/$disk/device/state
}

function target_test_sense1()
{
echo "LUN reset success, TUR success"

# inject timeout command for write command
echo "0 -10 0x2a " > ${error}
# inject abort command for write command
echo "3 -1 0x2a " > ${error}

dd if=/dev/zero of=/dev/$disk bs=1K count=10 oflag=direct
echo $(cat /sys/block/$disk/device/state)

clear_error $error
echo running > /sys/block/$disk/device/state
}

function target_test_sense2()
{
echo "LUN reset success, TUR failed, target reset success, TUR success"

# inject timeout command for write command
echo "0 -10 0x2a " > ${error}
# inject abort command for write command
echo "3 -1 0x2a " > ${error}
# inject timeout command for TUR command
echo "0 -1 0x0 " > ${error}

dd if=/dev/zero of=/dev/$disk bs=1K count=10 oflag=direct
echo $(cat /sys/block/$disk/device/state)

clear_error $error
echo running > /sys/block/$disk/device/state
}

function target_test_sense3()
{
echo "LUN reset failed, target reset success, TUR success"

# inject timeout command for write command
echo "0 -10 0x2a " > ${error}
# inject abort command for write command
echo "3 -1 0x2a " > ${error}
# inject lunreset failed
echo "4 -1 0xff" > ${error}

dd if=/dev/zero of=/dev/$disk bs=1K count=10 oflag=direct
echo $(cat /sys/block/$disk/device/state)

clear_error $error
echo running > /sys/block/$disk/device/state
}

function target_test_sense4()
{
echo "LUN reset failed, target reset success TUR failed"

# inject timeout command for write command
echo "0 -10 0x2a " > ${error}
# inject abort command for write command
echo "3 -1 0x2a " > ${error}
# inject lunreset failed
echo "4 -1 0xff" > ${error}
# inject timeout command for TUR command
echo "0 -1 0x0 " > ${error}

dd if=/dev/zero of=/dev/$disk bs=1K count=10 oflag=direct
echo $(cat /sys/block/$disk/device/state)
clear_error $error
echo running > /sys/block/$disk/device/state
}

function target_test_sense5()
{
echo "LUN reset failed, target reset failed, fallback to host recovery"

# inject timeout command for write command
echo "0 -10 0x2a " > ${error}
# inject abort command for write command
echo "3 -1 0x2a " > ${error}
# inject lunreset failed
echo "4 -1 0xff" > ${error}
# inject target reset failed
echo 1 > /sys/kernel/debug/scsi_debug/target$target_id/fail_reset

dd if=/dev/zero of=/dev/$disk bs=1K count=10 oflag=direct
echo $(cat /sys/block/$disk/device/state)

clear_error $error
echo running > /sys/block/$disk/device/state
}

scsi_logging_level -s --error 4 > /dev/null 2>&1

insmod $scsi_debug lun_eh=Y target_eh=N
str=$(lsscsi | grep scsi_debug | head -n 1 | awk '{print $1}')
scsi_id=${str#*\[}
scsi_id=${scsi_id%\]*}
error=/sys/kernel/debug/scsi_debug/$scsi_id/error
str=$(lsscsi | grep scsi_debug | head -n 1 | awk '{print $6}')
disk=$(basename $str)
target_id=${scsi_id%\:*}
echo none > /sys/block/$disk/queue/scheduler
echo 1 > /sys/block/$disk/device/timeout
echo 1 > /sys/block/$disk/device/eh_timeout

for((loop=1;loop<=3;loop++))
do
time=$(date "+%Y-%m-%d-%H-%M-%S")
since=$(date "+%Y-%m-%d %H:%M:%S")
lun_test_sense$loop
sleep 3
until=$(date "+%Y-%m-%d %H:%M:%S")
mkdir logs/lun_sense$loop
journalctl --since="$since" --until="$until" > logs/lun_sense$loop/$time.log
done
rmmod scsi_debug

insmod $scsi_debug lun_eh=N target_eh=Y
str=$(lsscsi | grep scsi_debug | head -n 1 | awk '{print $1}')
scsi_id=${str#*\[}
scsi_id=${scsi_id%\]*}
error=/sys/kernel/debug/scsi_debug/$scsi_id/error
str=$(lsscsi | grep scsi_debug | head -n 1 | awk '{print $6}')
disk=$(basename $str)
echo none > /sys/block/$disk/queue/scheduler
echo 1 > /sys/block/$disk/device/timeout
echo 1 > /sys/block/$disk/device/eh_timeout
for((loop=1;loop<=5;loop++))
do
time=$(date "+%Y-%m-%d-%H-%M-%S")
since=$(date "+%Y-%m-%d %H:%M:%S")
target_test_sense$loop
sleep 3
until=$(date "+%Y-%m-%d %H:%M:%S")
mkdir logs/target_sense$loop
journalctl --since="$since" --until="$until" > logs/target_sense$loop/$time.log
done
rmmod scsi_debug

insmod $scsi_debug lun_eh=Y target_eh=Y
str=$(lsscsi | grep scsi_debug | head -n 1 | awk '{print $1}')
scsi_id=${str#*\[}
scsi_id=${scsi_id%\]*}
error=/sys/kernel/debug/scsi_debug/$scsi_id/error
str=$(lsscsi | grep scsi_debug | head -n 1 | awk '{print $6}')
disk=$(basename $str)
echo none > /sys/block/$disk/queue/scheduler
echo 1 > /sys/block/$disk/device/timeout
echo 1 > /sys/block/$disk/device/eh_timeout
for((loop=1;loop<=5;loop++))
do
time=$(date "+%Y-%m-%d-%H-%M-%S")
since=$(date "+%Y-%m-%d %H:%M:%S")
target_test_sense$loop
sleep 3
until=$(date "+%Y-%m-%d %H:%M:%S")
mkdir logs/lun_target_sense$loop
journalctl --since="$since" --until="$until" > logs/lun_target_sense$loop/$time.log
done
rmmod scsi_debug