Thursday, May 23, 2024

Re: advice debugging lockups with swap-thrashing symptoms?

On Thu, May 23, 2024 at 08:00:37AM GMT, Nick Holland wrote:
>On 5/23/24 03:18, Stuart Henderson wrote:
>>On 2024-05-22, James Cook <falsifian@falsifian.org> wrote:
>>>One of my OpenBSD boxes sometimes gets in a weird locked-up or
>>>almost-locked-up state. I'm wondering what I can do to debug it
>>>further next time it happens.
>>...
>>>I would also expect the cache number to be much higher. E.g. on
>>>this occasion, I was running "git annex fsck", which reads plenty
>>>of data from disk.
>>
>>Heavy filesystem access can result in this sort of thing, I used to
>>have unpacked ports source on one of my machines for grepping over,
>>the machine was pretty much unusable for anything else while that was
>>running.
>>
>>Might be worth trying some noatime mount flags if you don't already have
>>them, at least then you can avoid turning some reads into writes.
>>
>
>Definitely a possibility. Long time ago, I think I asked about the
>possibility of a "disknice" to throttle disk access on individual
>tasks. TedU@ came through for me with something that definitely solved
>my problem, and I use it from time to time since -- basically, it just
>suspends a particular program occasionally, which lets other programs
>have a chance to get disk access. I saved it (and made a tiny update
>that is needed now) and put it here:
>
>https://holland-consulting.net/scripts/disknice.html
>
>
>Also...
>I've seen disks "fail" where they get super-slow. The failure modes
>seems to be difficulty reading data...but after enough retries, it
>succeeds, resetting the retry counter back to zero, and then the next
>read encounters the same problem. You may be able to hear lots of
>activity on the drive with little obvious progress. I'm not convinced
>this is your problem, but ... something to consider.
>
>Nick.

Thanks for the pointers. disknice sounds useful. However I am
skeptical that this can be explained away as a normal consequence
of intense filesystem access, for a few reasons.

1. In the past, even the mouse pointer has frozen. (I'm 95% sure
of this from memory. Will note it more carefully next time this
happens.) Surely that shouldn't depend on disk access? See also
tmux/xterm updating very slowly; does that depend on the filesystem?

2. The low 165M cache number makes me suspicious. With 14G free
and plenty of data being read, shouldn't that grow? E.g. right now
it's at 11G (and I'm running git annex fsck like I was before; I
have a lot of data to fsck). I believe I've seen similar small cache
numbers in the past.

3. The git annex fsck was running on a different hard disk. (Normally
it sits in a cubpoard; I've hooked it up temporarily.) Swap, /, /home
etc are all on a different SSD. I am running the same thing now
(different disk) and perceive no impact on performance. That's not
to say there wasn't intense access to the SSD, though; Firefox is
a suspect here.

Nonetheless, if I can't make any other progress, I'll look into
noatime and/or disknice. (I really wish I could reliably reproduce
this, but unfortunately it just happens every few days or weeks
with no apparent pattern other than the system being under some
load when it happens.)

(I'll note one other thing, just in case: I also experience random
crashes and restarts with this machine that seem to be hardware-related.
Very different from what I'm describing here; has even happened
during BIOS POST, and with no disks inside the machine. I just
mention it because it opens the possibility of unreliable hardware
involved, in case that changes things.)

--
James

No comments:

Post a Comment