Sunday, February 28, 2021

Re: OpenBSD 6.8 - softraid issue: "uvm_fault(0xffffffff821f5490, 0x40, 0, 1) -> e"

# OpenBSD 6.8 RAID5 configuration with three 1TB "Samsung SSD PRO 860" drives


sysctl hw.disknames

disklabel sd1
disklabel -E sd1
disklabel -E sd2
odisklabel -E sd3

bioctl -c 5 -l sd1a,sd2a,sd3a softraid0
disklabel -E sd4

newfs sd4a

obsdarc# mkdir /arc-3xssd
obsdarc# mount /dev/sd4a /arc-3xssd/
obsdarc# df -h | grep 3xssd
/dev/sd4a 1.8T 8.0K 1.8T 0% /arc-3xssd





# ------------------------------------------------------------------------------
dd if=/dev/urandom of=/arc-3xssd/1GB-urandom.bin bs=1M count=1024

# Error messages

uvm_fault(0xffffffff821ede50, 0x40, 0, 1) -> e
kernel: page fault trap, code=0
Stopped at sr_validate_io+0x44: cmpl $0,0x40(%r9)
ddb{4}>


# ------------------------------------------------------------------------------
obsdarc# disklabel sd1

# /dev/rsd1c:
type: SCSI
disk: SCSI disk
label: Samsung SSD 860
duid: cb0d589d6d25894e
flags:
bytes/sector: 512
sectors/track: 63
tracks/cylinder: 255
sectors/cylinder: 16065
cylinders: 124519
total sectors: 2000409264
boundstart: 0
boundend: 2000409264
drivedata: 0

16 partitions:
# size offset fstype [fsize bsize cpg]
a: 2000409264 0 RAID
c: 2000409264 0 unused

...


# ------------------------------------------------------------------------------
obsdarc# disklabel sd4

# /dev/rsd4c:
type: SCSI
disk: SCSI disk
label: SR RAID 5
duid: 2f9692cd2e3a048f
flags:
bytes/sector: 512
sectors/track: 63
tracks/cylinder: 255
sectors/cylinder: 16065
cylinders: 249039
total sectors: 4000817408
boundstart: 0
boundend: 4000817408
drivedata: 0

16 partitions:
# size offset fstype [fsize bsize cpg]
a: 4000817408 0 4.2BSD 8192 65536 52270
c: 4000817408 0 unused


# ------------------------------------------------------------------------------
obsdarc# dd if=/dev/urandom of=/arc-3xssd/1GB-urandom.bin bs=1M count=1024


Hi Karel,

Thank you very much for your feedback and hints.
I have already opened a bug request for this issue, however I am not
able to deliver the output of "trace" and "ps" commands from the ddb{4}>
or ddb{2}> prompts as the crashed system is frozen so I can not type or
see output typing blind.

In another email to misc (or just further below) I described some more
tests.
I have to check how to compile a kernel with debug support and install
it on the OpenBSD 6.8 box for further investigations.

Kind regards
Mark


# --- copy of the previous email to misc

Thank you very much for your feedback, suggestions and hints.

Indeed yesterday I saw one read and one write error related to Samsung
PRO SSDs before another OS crash (I run more different tests writing big
files to the RAID5 using "dd" or "cat" commands)
Today I have installed three new 1TB Samsung PRO 960 SSD drives inside a
third box (however also an ASUS mainboard with AMD FX CPU and 16GB ECC
RAM) and set RAID5 as described in the attached file.

And again a similar error after dd (slightly different values):
# ---
dd if=/dev/urandom of=/arc-3xssd/1GB-urandom.bin bs=1M count=1024

# Error messages

uvm_fault(0xffffffff821ede50, 0x40, 0, 1) -> e
kernel: page fault trap, code=0
Stopped at      sr_validate_io+0x44:    cmpl     $0,0x40(%r9)
ddb{4}>

The error happens on the RAID5 level (there is no encryption).

In the test case above I used 30cm long SATA 3G cables (Samsung PRO 860
and the SATA controller are 6G) as I did not have the 6G SATA cables
available.
I run the original tests with 6G SATA cables.

For some reason the "ddb{4}>" is frozen so I am not able to type
anything on the ddb input prompt on the console (and I don't see any
output typing  blind "trace" or "ps" ).

I have somewhere some older Samsung PRO 850 SSDs so I will try to test
the RAID5 configuration with them.



On 28.02.21 19:55, Karel Gardas wrote:
>
> Hi,
>
> compile kernel with debug enabled so you will get line number from the
> crash. See what's there. Go thorough git/cvs logs and see if anybody
> did anything with global mutex over sata/sr raid. Read the code. The
> possibility is you are hitting a bug which is there since raid5 was
> added to obsd, none
> just tested with that amount of ssds so you are in unique position to
> hunt this bug down. Congratulations and good luck!
>
> Karel
>
> On 2/28/21 3:05 AM, Mark Schneider wrote:
>> Hi again,
>>
>> I have repeated softraid tests using six pcs of 1TB Samsung HDD 3G
>> SATA drives as RAID5 and I do not face the crash issue of the OS when
>> using SSDs in the RAID5.
>> Details of the RAID5 setting are in the attached file.
>>
>> It looks like using SSD drives as RAID5 leads for some reason to the
>> OpenBSD 6.8 crash. Samsung 512MB PRO 860 SSDs have 6G SATA interface
>> (what is different compared to tested HDDs)
>>
>> NB: Using those SSDs as RAID6 on debian Linux (buster - mdadm /
>> cryptoLUKS) does not face any issues
>>       There are also no issues using those SSDs as RAID on FreeBSD
>> (TrueNAS).
>>
>> Kind regards
>> Mark
>>
>>
>> On 27.02.21 04:30, Mark Schneider wrote:
>>> Hi,
>>>
>>>
>>> I face system crash on OpenBSD 6.8 when trying to use softraid RAID5
>>> drive trying to write big files (like 10GBytes) to it.
>>>
>>> I can reproduce the error (tested on two different systems with
>>> OpenBSD 6.8 installed on an SSD drive or an USB stick). The RAID5
>>> drive itself consist of six Samsung PRO 860 512GB SSDs.
>>>
>>> In short:
>>>
>>> bioctl -c 5 -l sd0a,sd1a,sd2a,sd3a,sd4a,sd5a softraid0
>>>
>>> obsdssdarc# disklabel sd7
>>> # /dev/rsd7c:
>>> type: SCSI
>>> disk: SCSI disk
>>> label: SR RAID 5
>>> duid: a50fb9a25bf07243
>>> flags:
>>> bytes/sector: 512
>>> sectors/track: 255
>>> tracks/cylinder: 511
>>> sectors/cylinder: 130305
>>> cylinders: 38379
>>> total sectors: 5001073280
>>> boundstart: 0
>>> boundend: 5001073280
>>> drivedata: 0
>>>
>>> 16 partitions:
>>> #                size           offset  fstype [fsize bsize cpg]
>>>   a:       5001073280                0  4.2BSD   8192 65536 52270
>>>   c:       5001073280                0  unused
>>>
>>> #
>>> --------------------------------------------------------------------------------
>>>
>>> obsdssdarc# time dd if=/dev/urandom of=/arc-ssd/1GB-urandom.bin
>>> bs=1M count=1024
>>> 1024+0 records in
>>> 1024+0 records out
>>> 1073741824 bytes transferred in 8.120 secs (132218264 bytes/sec)
>>>     0m08.13s real     0m00.00s user     0m08.14s system
>>>
>>> # Working as expected
>>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>>
>>>
>>> obsdssdarc# time dd if=/dev/urandom of=/arc-ssd/10GB-urandom.bin
>>> bs=10M count=1024
>>>
>>> # Error messages
>>>
>>> uvm_fault(0xffffffff821f5490, 0x40, 0, 1) -> e
>>> kernel: page fault trap, code=0
>>> Stopped at      sr_validate_io+0x44:    cmpl     $0,0x40(%r9)
>>> ddb{2}>
>>>
>>> # Crashing OpenBSD 6.8
>>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>>
>>>
>>> # After reboot:
>>>
>>> obsdssdarc# mount /dev/sd7a /arc-ssd/
>>> mount_ffs: /dev/sd7a on /arc-ssd: Device not configure
>>>
>>> obsdssdarc# grep sd7 /var/run/dmesg.boot
>>> softraid0: trying to bring up sd7 degraded
>>> softraid0: sd7 was not shutdown properly
>>> softraid0: sd7 is offline, will not be brought online
>>>
>>>
>>> More details in attached files. Thanks a lot in advance for short
>>> feedback.
>>>
>>>
>>> Kind regards
>>>
>>> Mark
>>>
>>

No comments:

Post a Comment