Wednesday, July 31, 2019

Re: How to debug hanging machines / proc: table is full

On Mon, Jul 29, 2019 at 01:20:58PM +0000, Stuart Henderson wrote:
> On 2019-07-29, Raimo Niskanen <raimo+openbsd@erix.ericsson.se> wrote:
> > A new hang, I tried to invstigate:
> >
> > At July 19 the last log entry from my 'ps' log was from 14:55, which is
> > also the time on the 'systat vmstat' screen when it froze. Then the machine
> > hums along but just after midnight at 00:42:01 the first "/bsd: process:
> > table is full" entry appears. That message repeats until I rebooted it
> > today at July 29 10:48.
> >
> > I had a terminal with top running. It was still updating. It showed about
> > 98% sys and 2% spin on one of 4 CPUs, the others 100% idle. Then (after
> > the process table had gotten full) it had 1282 idle processes and 1 on
> > processor, which was 'top' itself.
> > Memory: Real: 456M/1819M act/tot Free: 14G Cache: 676M Swap: 0K/16G.
> >
> > I had 8 shells under tmux ready for debugging. 'ls worked.
> > 'systat' on one hung. 'top' on another failed with "cannot fork".
> > 'exec ps ajxww" printed two lines with /sbin/init and /sbin/slaac
> > and then hung. 'exec reboot' did not succeed. Neither did a short power
> > button, that at least caused a printout "stopping daemon nginx(failed)",
> > but got no further. I had to do a hard power off.
> >
> > My theory now is that our daily tests right before 14:55 started a process
> > (this process is the top 'top' process with 10:14 execution time) that
> > triggers a lock or other contention problem in the kernel which causes
> > one CPU to spin in the system, and blocks processes from dying.
> > About 10 hours later the process table gets full.
> >
> > Any, ANY ideas of how to proceed would be appreciated!
> >
> > Best Regards
>
> Did you notice any odd waitchan's (WAIT in top output)?
>
> Maybe set ddb.console=1 in sysctl.conf and reboot (if not already
> set), then try to break into DDB during a hang and see how things look
> in ps there. (Test breaking into DDB before a hang first so you know
> that you can do it .. you can just "c" to continue).
>
> There might also be clues in things like "sh malloc" or "sh all pools".
>
> Perhaps you could also get clues from running a kernel built with
> 'option WITNESS', you may get some messages in dmesg, or it adds commands
> to ddb like "show locks", "show all locks", "show witness" (see ddb(4) for
> details).

I have enabled Witness, it went so-so. We'll see what it catches.

I downloaded 6.5 amd64 src.tar.gz and sys.tar.gz, unpacked them,
applied all patches for stable 001-006 and built a kernel with:
include "arch/amd64/conf/GENERIC"
option MULTIPROCESSOR
option MP_LOCKDEBUG
option WITNESS

Then I activated in /etc/sysctl.conf:
ddb.console=1
kern.witness.locktrace=1
kern.witness.watch=3

For fun, I pressed Ctrl+Alt+Esc at the console, got a ddb> prompt and typed
"show witness". It printed lots of info, I scrolled down to the end, but
during the printout there was an UVM fault:

Spin locks:
/usr/src/sys/....
:
bla bla bla
:
uvm_fault(0xffffffff81e03b50, 0xffff800022368360, 0, 1) -> e
kernel: page fault trap, code=0
Faulted in DDB: continuing...

Then I typed "cont" and it panicked.
If anybody want details I took a picture.

Have I combined too many debugging options, or is this sh*t that happens?

Nevertheless, now the machine is running again, with Witness...

I'll be back.


>
> Can you provoke a hang by running this process manually?

--

/ Raimo Niskanen, Erlang/OTP, Ericsson AB

No comments:

Post a Comment