Heho,
I am running a small setup, where recently the boarder router VMs of a user caused prolonged and consistent low bandwidth (2-3mb/s) yet high utilization (many IOPS) disk utilization on the virtualization nodes (more writeup at [1]).
With a bit of digging, we figured out that this was caused by rpki-client, mostly due to the nature of /var/cache/rpki-client being 'lots of small files (~230k), subsequently opened and closed (during validation), with atime probably doing the rest of the hurt. This lead to rpki-client running for ~30-60minutes, sometimes dying due to exceeding 3600 seconds runtime. The problem becoming so pronounced may also relate to the RPKI blow-up due to some recent experiments (currently not finding a fitting link, though; Recall cloudflare suffered with some DB bloat because of that in their validators.).
I ultimately resorted to giving an mfs on /var/cache/rpki-client a try. This worked surprisingly well, (naturally) removed all disk i/o usage, and improved the rpki-client runtime from ~30min to ~16min (CPUs aren't the freshest, so this is fine, I guess). Of course the trade-off here is a full sync after every reboot.
I recon that this is mostly a fragment of spinning disks being used for storage in the virt environment, but it makes me wonder whether it would not make sense to note that in the man page? Would like to hear some opinions, though, before actually suggesting the change/typing up a fitting section.
With best regards,
Tobias
[1] https://doing-stupid-things.as59645.net/networking/bgp/nsfp/2022/07/31/making-it-ping-part-5.html
No comments:
Post a Comment