Monday, March 29, 2021

Re: The case of the phantom reboot

On 3/29/21 5:28 AM, Nick Holland wrote:
> On 3/28/21 12:13 PM, David Newman wrote:
>> On 3/28/21 4:58 AM, Kristjan Komloši wrote:
>>
>>> On 3/27/21 10:27 PM, David Newman wrote:
>>>> OpenBSD 6.8 GENERIC#5 i386
>>>>
>>>> One of my systems rebooted at 03:01 local time today. I've seen kernel
>>>> panics and bad hardware but I've never seen OpenBSD "just reboot" by
>>>> itself, ever.
>
> OpenBSD, not usually.  Hardware OpenBSD is running on? Sure.
>
>>>> There's no cron job that would do this. last(1) is no help; it shows
>>>> the
>>>> reboot command but not the shutdown that preceded it:
>>>>
>>>> root@ns ~ 4# last -f /var/log/wtmp.0
>>>> reboot   
>>>> ~                                
>>>> Sat Mar 27 03:01
>>>> root      ttyp0    192.168.0.132            Wed
>>>> Mar 24 11:23 - 11:23
>>>> (00:00)
>>>>
>>>> wtmp.0 begins Wed Mar 24 11:23 2021
>>>> root@ns ~ 5# last -f /var/log/wtmp.1
>>>> root      ttyp0    192.168.0.132            Tue
>>>> Mar 16 21:30 - 21:30
>>>> (00:00)
>>>> root      ttyp0    75.82.86.131             Tue
>>>> Mar 16 13:14 - 21:30
>>>> (08:15)
>>>> root      ttyp0    75.82.86.131             Sun
>>>> Mar 14 21:20 - 21:29
>>>> (00:08)
>>>> root      ttyp0    75.82.86.131             Sat
>>>> Mar 13 17:42 - 21:13
>>>> (03:31)
>>>>
>>>> The date gaps seem odd. I've ssh'd into this system multiple times
>>>> between March 16-27. I don't see other signs of trouble in /var/log.
>>>>
>>>> I could use some help in looking for evidence of foul play, or "just" a
>>>> hardware or software problem.
>>>>
>>>> Thanks in advance for further troubleshooting clues.
>>>>
>>>> dn
>>>>
>>> What kind of a machine is it running on? I remember having reboot
>>> problems on certain HP and Supermicro servers with hardware watchdogs.
>>
>> This is a 10+-year-old Dell 1U server with a 2-GHz Celeron 440, part of
>> a pair running CARP. Aside from having to replace spinning disks with
>> SSDs a couple of years ago, they've been rock solid.
>
> basic machine, worked for a long time, then starts giving problems, almost
> certainly a hw problem unless you can tie the problem to a recent upgrade.
> And that's not terribly likely on a "basic" hardware.
>
> Every broken device started out "rock solid" ... until it isn't.  That's
> the definition of "Broken".
>
>> I too have seen issues with Supermicros but that's with other OSs. I've
>> never had a spontaneous reboot, on this system, and am concerned from
>> the wtmp stuff above that this *may* have been triggered externally. I
>> could use some clues in other things to check. Thanks.
>
> As Stuart pointed out, that comes from the boot process, not the shutdown.
>
> If you are really curious, you could put a serial console on it and wait
> for the next event.  PROBABLY won't see much, however.
>
> Believe me, I'm all in favor of recycling computers -- in fact, as I
> often tell skeptical employers, I'd rather have two ten year old systems
> than one brand new system with a service contract, but computers don't
> last as long as they used to, and curiously, some big-name servers seem
> to sometimes have a shorter life than some desktops,  A ten year old
> computer that does the job reliably is good, but not an expectation.

I hope it is "just" a hardware problem. These ancient machines don't owe
me anything. If anything they've been a testament to how well OpenBSD
just works, year in, year out.

Until I can swap in a replacement (the unit in question is in a colo in
another state), I may try Stuart's suggestion of enabling accounting.
The only concern I have about an external actor is that there seem to be
some missing entries in wtmp, but I don't know enough about init or wtmp
to rule out a hardware glitch.

Someone else suggested a battery problem, which seems plausible for a
unit this old.

Appreciate all the feedback -- many thanks.

dn

No comments:

Post a Comment