Hi All,
Just had to recover one of my boxes.
Looking for a steer or any advice as to what might be going on. I returned to a console which had dropped into ddb, the host itself is running a -current release and was updated on the 23rd Feb. The logs I've managed to pull from it that seem pertinent are:
Feb 24 04:05:00 fw0 vnstatd[58770]: Error: Commit transaction to database failed (10): disk I/O error
Feb 24 04:08:06 fw0 /bsd: ahci0: NCQ errored slot 0 is idle (70003000 active)
Feb 24 04:09:10 fw0 /bsd: ahci0: attempting to idle device
Feb 24 04:09:10 fw0 /bsd: ahci0: stopping the port, softreset slot 31 was still active.
Feb 24 04:09:10 fw0 /bsd: ahci0: failed to soft reset device
Feb 24 04:09:10 fw0 /bsd: ahci0: couldn't recover NCQ error, failing all outstanding commands.
Feb 24 04:09:10 fw0 /bsd: ahci0: log page read failed, slot 31 was still active.
Feb 24 04:09:10 fw0 /bsd: ahci0: stopping the port, softreset slot 31 was still active.
Feb 24 04:09:10 fw0 /bsd: ahci0: attempting to idle device
Feb 24 04:09:10 fw0 /bsd: ahci0: stopping the port, softreset slot 31 was still active.
Feb 24 04:09:10 fw0 /bsd: ahci0: failed to soft reset device
Feb 24 04:09:10 fw0 /bsd: ahci0: couldn't recover NCQ error, failing all outstanding commands.
Feb 24 04:09:10 fw0 pflogd[90658]: Logging suspended: fwrite: Input/output error
Feb 24 04:10:00 fw0 vnstatd[58770]: Error: Exec step failed (11: database disk image is malformed): "update hour set rx=rx+0, tx=tx+1500 where interface=4 and date=strftime('%Y-%m-%d %H:00:00', datetime(1645675500, 'unixepoch'), 'localtime')"
Feb 24 04:10:00 fw0 vnstatd[58770]: Error: Fatal database error detected, exiting.
Feb 24 04:10:10 fw0 /bsd: ahci0: NCQ errored slot 29 is idle (00002000 active)
Feb 24 04:10:10 fw0 /bsd: ahci0: attempting to idle device
Feb 24 04:10:10 fw0 /bsd: ahci0: stopping the port, softreset slot 31 was still active.
Feb 24 04:10:10 fw0 /bsd: ahci0: failed to soft reset device
Feb 24 04:10:10 fw0 /bsd: ahci0: couldn't recover NCQ error, failing all outstanding commands.
Feb 24 04:11:12 fw0 /bsd: ahci0: attempting to idle device
Feb 24 04:11:12 fw0 /bsd: ahci0: stopping the port, softreset slot 31 was still active.
Feb 24 04:11:12 fw0 /bsd: ahci0: failed to soft reset device
Feb 24 04:11:12 fw0 /bsd: ahci0: NCQ errored slot 8 is idle (00000200 active)
The box should have been quiet at this time, no heavy load expected.
Running fsck on the filesystem didn't end well for me - it resulted in a slew of NCQ error messages, and lost data. The partitions that I didn't run fsck against kept all their data. I've since wiped and restored all the filesystem partitions.
I've also replaced the SATA cable, but wondering if anyone can shine a light as to what might have happened - the disk (SSD) is only 30 days old, and seems to be OK after restoring a backup onto it.
Disk Info:
Model Family: Phison Driven SSDs
Device Model: KINGSTON SA400S37240G
Thanks,
Simon.
No comments:
Post a Comment