Monday, April 30, 2018

Re: Troubleshooting rl instability on OpenBSD 6.1

On 01/05/18 11:10, Nick Holland wrote:
> Here's the thing. There are rules to the game with every OS. With
> OpenBSD, if you have to stay up to date -- the support tail is only
> about a year long, and that is really only security issues.

I accept this, no problem with that whatsoever, you can't support a
particular release infinitely. If that were important, I'd be paying
megabucks for some "enterprise" solution.

> So, what are you after? A magic, secret sysctl, "sysctl
> rl.work.properly=1" ? Nope, no such thing.

How about rl.tellmewhatsgoingon=1? :-) You know, maybe log something to
the kernel log when a packet is received, when a packet is transmitted,
etc. Something lower-level than pflogd.

I was hoping there might be some flag to enable some debug logging to
troubleshoot it further. Like the sort of debug logging that might've
been used to develop the driver in the first place.

Stuart Henderson provided some commands that could be useful in tracking
this, and I've got those noted down. Catching the problem has been the
tricky bit as it's intermittent, which is always the hardest type of
problem to solve. I do intend to look into this.

I've also turned up the logging on the switch. So far, no log messages
have been reported by said switch.

> Sorry. A patch to fix it?
> Not going to happen against 6.1, 6.2, or even 6.3, most likely. -current
> is where development happens, only security issues and maybe some
> behavior regressions are ever pushed back to old releases...not
> operational improvements, new features, or new hw support.

A patch against -current would be fine. I'd then know to work around
the issue until 6.4 comes out, then the problem would be truly fixed.
As discovered earlier, the source file that builds rl has not changed
since before 6.1 was released.

Essentially, I am already running the -current version of rl. What
isn't -current, is everything else. That could be a factor too.

> Now, rl chips were considered the worst pieces of network junk around
> until the ARM systems started sprouting networking chips. Don't get me
> wrong, I've used a lot of them, and had pretty good luck with them, but
> a lot of people I respect and who know better than me hate the #$%^ things.

Yeah, not my favourite chip either… but it's what Advantech picked,
probably on price. Unfortunately, unless I want to break out the solder
re-work station and bodge on a new chip, it's what I'm stuck with.

> You say a couple things that catch my eye -- 1) 6.1 is over a year old,
> and you say you have been battling the problem for a month. So
> something changed. That's hinting hw, not sw. (typically. Or the load
> changed. or something). 2) you say you had "similar" problems with
> another OS. Similar to what, I'm not sure, but that sounds like you
> have a HW problem. Keep in mind, when it comes to networks, it's not
> just the computer -- the wire and the switch are also all suspect.

I accept this too. I was asking in case there was a known issue or if
there was a way to confirm absolutely where the problem might lie.

It's a question of "here's a problem; what avenues have I got available
for debugging it". Not, "here's a problem, I demand you fix it
immediately!" There's a *big* difference.

As it happens, the industrial PC was a freebie and may have developed
faults that are not yet diagnosed, so it may be that the answer is
replace it with something new.

OpenBSD has been a good platform for this, so I'd be looking to replace
with something that can run this OS.

> But it boils down to this: if you want help on OpenBSD, you play by the
> rules and run either -current or at least a supported release (and if
> you contend it's an OS issue, you verify it still exists in -current!).
> If you don't need OpenBSD help...this isn't the place. And if you can
> say with certainty, "everything is the same", you will have no trouble
> adding debugging info and figure out your own problem.

I'll be clear on this: I am not looking for a backported fix to OpenBSD 6.1.

We don't even know what needs fixing yet. There might be an edge case
that's tripping something up. That edge case might be "your PCI bus is
stuffed because of vibration from too many Superhornets flying overhead"
(the box came from a RAAF base). It might be that "in rare cases,
register bit X doesn't get set when event Y occurs, do Z as a
workaround". This is unknown.

What I was trying to determine was:
1. whether there are known issues with this combination of
hardware/software? (e.g. maybe it's known that some 100-BaseT Ethernet
chip does not play nice with 1000-BaseT switches?)
2. whether there are additional debugging flags, commands or tools that
might help debug whether a given Ethernet frame was received, acted
upon, or replied to

Option 2 would then lead to "what causes this state", and ultimately, a
fix in -current. If I want the backport to 6.1, guess what, I have the
code, it's within my power to backport that myself if that's what I
truly want.

Regards,
--
Stuart Longland (aka Redhatter, VK4MSL)

I haven't lost my mind...
...it's backed up on a tape somewhere.

No comments:

Post a Comment