OpenBSD Mail Box: what tools exist to help a beginner debug a hung syscall?

Heyya, misc@,

I'm very new to using OpenBSD for anything more than what's 'on the tin'
(DNS RR, Router, etc), and have got myself stuck and have tried during
nights and weekends for about a week to try and unwedge myself here,
unsuccessfully. I'm hoping someone can help point me to a OpenBSD kernel
for newbies guide/help of some sort.

I've emailed a few lists, but so far everyone either looked the other
way quickly[0], didn't know, or didn't have time to help me out (fair
enough!)

I've been trying to take a library[1] I use on my Linux boxen, and coax
it into working on OpenBSD[2], and have been able to get a compiled .so
that looks good, with the exception of the USB transport. Given the
history of related libraries having similar-sounding[3] issues[4] with
libusb1, I suspect the issue isn't in the library I'm trying to port,
rather, in libusb1's OpenBSD support, or in (*shudder*) the kernel.
Either that or this is a common libusb 'gotcha' that everyone eventually
finds and patches that presents itself on OpenBSD by always locking up.

To get my library working on OpenBSD, I've had to use -snapshot (it
requires waitid(2)), and to debug the system (which is now entirely
just for testing this singular problem), I've built a -current kernel.
Specifically, it's mostly based on the source tree that the Git mirror
knows as e3f6ba90cc00f3d7457f857a0fd00f2b435bc0ec (Wed, Mar 29th,
2023).[5]

Everything below is the state as I understand it while the program is
locked up:

[Split-cut to: userland]

When reading from the device, my 3-line test program (3-line is
disengenious here since it's calling into a library that calls into a
library that eventually talks to the OS) eventually hangs while invoking
'libusb_submit_transfer', which is, specifically, hung on a read(2)
against a ugen device (specifically, the read in '_sync_gen_transfer' in
the libusb OpenBSD implementation). A pthread in the background is
continuing to exercise the 'libusb_handle_events' endpoint, to no avail.

I asked the upstream list about this[6], no reply yet.

[Fade to: kernelspace]

[Scene one: read(2) syscall]

The kernelspace end of the read(2) has wound up in `usbd_transfer`,
roundabout line 406 of usbdi.c, waiting for the xfer on the other end of
the endpoint's pipe to be complete. This intrepid syscall earnestly
stands by, waiting for the day when it'll be called back.

[Scene two: ugen driver state]

Everything is using USB3/xhci. That's the only bus on the system, and
this device communicates using USB3 in this case.

The ugen device (other USB devices are not a factor as far as I can
tell, and will continue to operate, even in the locked state,
interestingly) looks something like:

ugen device unit 1:
- endpoint 2: dir=0 (OUT) -- NULL pipe
- endpoint 4: dir=0 (OUT) -- NULL pipe
- endpoint 6: dir=1 (IN) -- NULL pipe
- endpoint 8: dir=1 (IN) -- pipe alloc'd

The one related endpoint, which is endpoint 8 here, has a refcount of 1.
It's running (not aborting).

Here's the output of some USB_DEBUG knobs:

usbd_dump_device: dev=0xffff800001291500
bus=0xffff800000132000 default_pipe=0xffff8000001bd000
address=3 config=1 depth=1 speed=4 self_powered=0 power=2 langid=1033
pipe=0xffff8000008b6000

usbd_dump_endpoint: endp=0xffff8000012d6ee0
edesc=0xffff8000001b785d refcnt=1
bEndpointAddress=0x88
(usbd_dump_pipe:)
running=1 aborting=0
intrxfer=0x0, repeat=0, interval=-1

Since this doesn't give me a lot of insight into the xfer state, I wrote
a small stub and threw it into a dark corner where I could let it be,
and have the following from it:

pipe=0xffff8000008b6000
xfer=0xfffffd827f59daf0 status=1 done=0 length=1024 flags=0x16 timeout=0

My understanding is that this means that endpoint 8 has an active pipe
that is running (not aborting), with an in-flight xfer which is
IN_PROGRESS, with the (USBD_SYNCHRONOUS | USBD_SHORT_XFER_OK |
USBD_CATCH) flags set [0x02 | 0x04 | 0x10 - as seen in usbdi.h].

Even while waiting a signficant (in human-time) length of time, the xfer
remains in this state, and is never marked as done until I C-c the
process, and everything cleans up on close(2)/EINTR

[Cut to: author]

Right, thank you for sticking around. I'm looking for some help for what
steps to take next on trying to isolate what may be blocking the xfer
from being processed -- and trying to figure out what userland knobs are
invoking kernelspace in a way that's causing it. Since other libraries
are working OK, I have to assume this isn't actually a kernelspace bug,
but I'd feel a lot better root causing it.

During the wedged state, xhci messages don't come through after the
transfer was submitted (I have xhcidebug, as well as other
{usb,ugen,*}debug set high), and I've very quickly got myself lost past
this point. This makes me suspect some sort of inturupt is maybe
being suspended while a call is in progress; I just have no idea where
to start with that theory, much less trick 'ddb' into prying that
information from RAM.

I've been able to set up the serial console, and get 'ddb' working well,
but I am having a hard time using it without poking myself on the sharp
bits. Is there a good way to explore the wedged system that isn't using
ddb off a serial line? Looks like 'ddb' can 'boot crash'; is there a
good workflow documented there?

As I mentioned above, my gut was pushing me look into checking if the
USB intrurupts are inhibited somehow, as well as checking for lock
contention (although, all the kernelspace locks I've instrumented in the
specific files i'm in are properly locked/unlocked during the program's
runtime). Failing that we're onto plans G, H, I and J.

Right, so the ask:

0) Is there a good place to have this conversation? I don't see
topical "usb subsystem interest group" mailing list(s) where this
may be less tedious to most of the readers. I tried ports@[7], but
I don't think that list was right either. I feel like a nunsense
posting on all these lists right now.

1) How would I go about checking into some of the above? Does that
sound like an interesting idea? Am I wasting my time? What tools
exist that could help me out here?

2) Does anyone have other good ideas for what could be to blame here?
Similar bugs that you may know about?

3) Is there a guide to the OpenBSD kernel or something? I hadn't so
much as built an OpenBSD kernel until last Sunday, so I know very
little about how this is all working, much less conventions or
internal architecture of a BSD kernel. FWIW it's been very pleasent
compared to building and running a modified Linux kernel.

Fondly,
paultag

[0]: <insert gif of grandpa simpson walking out of the bar from
The Simpsons episode "Bart After Dark">
[1]: libuhd
[2]: https://github.com/EttusResearch/uhd/pull/668
[3]: https://marc.info/?l=libusb-devel&m=144466221830349&w=2 - although
interestingly the version in ports doesn't have any patches; I suspect
something got fixed in libhackrf, libusb1 or the kernel to fix that
during the last decade-ish. I confirmed libhackrf functions
normally on my system.
[4]: comms/rtl-sdr has a few patches, such as patch-src_rtl_fm_c and their
ilk that comprehensively removes async code from the binaries -- although
this, too, works unpatched (from upstream sources) now on -current,
with the exception of a few patches I sent in for CMake[4.1], which I don't
think will get merged at this rate. I confirmed librtlsdr workes
normally on my system.
[4.1]: https://lists.osmocom.org/hyperkitty/list/osmocom-sdr@lists.osmocom.org/thread/NOIG7LNP3VXM4JQYVVVXJ4TZR6IDQCQJ/
[5]:
"Mostly" in this case means that I toggled a few options, and added
a bunch of DPRINTFN and an ioctl stub for me to extract kernelspace
state without launching ddb and starting over again. I pop my
patches off when they get to be too much now and again.

The specific options as of the writing of this email:

```
option MULTIPROCESSOR
option MP_LOCKDEBUG
option WITNESS
option UGEN_DEBUG
option USB_DEBUG
option XHCI_DEBUG
```
[6]: https://sourceforge.net/p/libusb/mailman/message/37796867/
[7]: https://marc.info/?t=167993667800007&r=1&w=2

--
:wq

OpenBSD Mail Box

Friday, March 31, 2023

what tools exist to help a beginner debug a hung syscall?

No comments:

Post a Comment