Tuesday, May 01, 2018

Re: Troubleshooting rl instability on OpenBSD 6.1

On 2018-05-01, Stuart Longland <stuartl@longlandclan.id.au> wrote:
>> So, what are you after? A magic, secret sysctl, "sysctl
>> rl.work.properly=1" ? Nope, no such thing.
>
> How about rl.tellmewhatsgoingon=1? :-) You know, maybe log something to
> the kernel log when a packet is received, when a packet is transmitted,
> etc. Something lower-level than pflogd.
>
> I was hoping there might be some flag to enable some debug logging to
> troubleshoot it further. Like the sort of debug logging that might've
> been used to develop the driver in the first place.

It would likely have been a combination of temporary printf's, ddb,
tcpdump (directly on the machine, and/or on another machine connected
directly). This of course would have been 20 years ago during the
port to OpenBSD, and earlier when it was originally written for FreeBSD.
For something originally written as open-source there's not a lot of
benefit to keeping that sort of debug code around after initial
development because it's unlikely to align with bugs found later..

> Stuart Henderson provided some commands that could be useful in tracking
> this, and I've got those noted down. Catching the problem has been the
> tricky bit as it's intermittent, which is always the hardest type of
> problem to solve. I do intend to look into this.

The question is how much time it's worth investing into this. At this
point, the nic chip could be hanging (i.e. bug in the chip), there could
be a hardware fault specific to this individual device, it could be a
driver bug, some switch bug (perhaps it's not good at handling devices
which don't support flow control), could be something we haven't
thought of at all. And whatever it is might only be triggered under
certain load conditions or with certain data patterns in the ethernet
frames.

My gut feeling is that 6.3 probably won't help, the most significant
changes between 6.1 and 6.3 that might affect this are related to
networking on multiprocessor systems (this one is single-CPU and
running GENERIC so less likely to be affected by that work, though not
impossible). On the other hand it is probably one of the lowest-effort
things to try.

> As it happens, the industrial PC was a freebie and may have developed
> faults that are not yet diagnosed, so it may be that the answer is
> replace it with something new.
>
> OpenBSD has been a good platform for this, so I'd be looking to replace
> with something that can run this OS.

If you go down this route, see if the PCEngines APU2 is a good fit
for your needs. There are other industrial-type systems but in this case
many developers have the exact same device, so a lot more people are
invested in fixing problems found with it. Low cost but it's a nicely
made board and I always like supporting companies who publish things
like https://www.pcengines.ch/schema/apu2c.pdf even if I'm not going
to do board-level repair myself :)

>> But it boils down to this: if you want help on OpenBSD, you play by the
>> rules and run either -current or at least a supported release (and if
>> you contend it's an OS issue, you verify it still exists in -current!).
>> If you don't need OpenBSD help...this isn't the place. And if you can
>> say with certainty, "everything is the same", you will have no trouble
>> adding debugging info and figure out your own problem.
>
> I'll be clear on this: I am not looking for a backported fix to OpenBSD 6.1.
>
> We don't even know what needs fixing yet. There might be an edge case
> that's tripping something up. That edge case might be "your PCI bus is
> stuffed because of vibration from too many Superhornets flying overhead"
> (the box came from a RAAF base). It might be that "in rare cases,
> register bit X doesn't get set when event Y occurs, do Z as a
> workaround". This is unknown.
>
> What I was trying to determine was:
> 1. whether there are known issues with this combination of
> hardware/software? (e.g. maybe it's known that some 100-BaseT Ethernet
> chip does not play nice with 1000-BaseT switches?)

I don't think there will be enough readers with similar hardware who might
respond here that you'd learn about any specific issues. rl(4) is no longer
particularly common (around 1% of dmesg submissions since 2015), but even
less likely when used for routing, it's mostly just seen on old i386
laptops/desktops and mips64el machines.

> 2. whether there are additional debugging flags, commands or tools that
> might help debug whether a given Ethernet frame was received, acted
> upon, or replied to
>
> Option 2 would then lead to "what causes this state", and ultimately, a
> fix in -current. If I want the backport to 6.1, guess what, I have the
> code, it's within my power to backport that myself if that's what I
> truly want.

Oh, one practical thing you could try. "ifconfig rl0 down; ifconfig rl0 up"
might clear the problem when it occurs. If so, perhaps you can detect it
happening (e.g. pings failing) from a cronjob and do that as a workaround.

No comments:

Post a Comment