Wednesday, May 31, 2017

Optimizing for PF with high number of states and searches

Hi all,

We have an oldish (2013) but well-spec'd pair of servers (active-backup) ,
running OpenBSD 6.0 and PF.
The only difference between the server hardware is that the primary has two
physical processors, the secondary has one.

This primary firewall is worked pretty hard (see pfctl -si below) and of
late it seems to struggle when load increases.
If we fail over to the secondary we see jitter/dropped packets on priority
traffic.
If we reload PF we often see jitter (testing with world-ping) and drops of
icmp (which has prio 7) which seem to settle if we reload PF again.
If I make a change the queue config, I need to reboot.

Problems seem to coincide with spikes in congestion, congestion is usually
approx 0.7/s if it rises much above 1 we see problems.

Most of the CPU cores aren't used much, two of the 8 cores average about
40%, one went up to 75% when I had problem with a ruleset.

I am trying to get hard figures rather than a 'feeling'. Stats that seem
high are when there are problems (see vmstat output at the bottom, when
things are relatively quiet , context switching and interrupts are <3000).

context-switching > 15,000
interrupts >14,000
Searches > 500,000
net.inet.ip.ifq.len is usually < 100 (I've seen it at >700 briefly). This
seems to suggest that changing net.inet.ip.ifq.maxlen may not make a
difference.

FWIW the ruleset as loaded is around 1300 lines when displayed with pfctl
-vvsr

I am looking for ways to optimize performance and would appreciate any
suggestions as to what to try and what stats to look at.
The alternative is to buy new hardware, but need to be convinced a faster
processor will make a big difference.

1) I am thinking of trying higher values of net.inet.ip.ifq.maxlen,
currently 2048. I tried 2500, didn't see much difference but suspect I can
go quite a bit higher. Does this setting require a reboot and am I right in
thinking this may help congestion, lower interrupts and context-switching?
2) Memory use is low according to collectd/snmp graphs , we have plenty can
we utilise it more?
3) Is an upgrade to OpenBSD 6.1 likely to make a significant difference?
4) We log all dropped traffic to pflog0, will disk I/O be a problem?

Sorry for vagueness, thanks in advance.
Kevin.



Possibly useful output and spec below:

Hardware:

2 x Quad Core Intel(R) Xeon(R) CPU E5-2637 v2 @ 3.50GHz, 3600.54 MH
OpenBSD 6.0 GENERIC.MP#2 amd64


NICs
Inside type ix 10Gbps, e,g, ix1 at pci5 dev 0 function 1 "Intel 82599"
rev 0x01
Outside and pfsync type em 1Gbps e.g. em1 at pci2 dev 0 function 1 "Intel
I350" rev 0x01

Of the 8 cores, two average about 40% utilisation. One of them peaked at
about 75% when struggling.
Memory =64Gbps


[LIVE]root@ar1300:~# pfctl -si
Status: Enabled for 0 days 09:04:42 Debug: err

State Table Total Rate
current entries 1205635
searches 16678281544 <(667)%20828-1544>
510320.1/s
inserts 157481830 4818.6/s
removals 156276195 4781.7/s
Counters
match 149125447 4562.9/s
bad-offset 0 0.0/s
fragment 0 0.0/s
short 3395 0.1/s
normalize 296 0.0/s
memory 0 0.0/s
bad-timestamp 0 0.0/s
congestion 14523 0.4/s
ip-option 0 0.0/s
proto-cksum 0 0.0/s
state-mismatch 103949 3.2/s
state-insert 10397 0.3/s
state-limit 0 0.0/s
src-limit 0 0.0/s
synproxy 0 0.0/s
translate 0 0.0/s
no-route 0 0.0/s

[LIVE]root@ar1300:~# vmstat -si
4096 bytes per page
16257397 pages managed
16007891 pages free
13720 pages active
4146 pages inactive
0 pages being paged out
16 pages wired
2000987 pages zeroed
4 pages reserved for pagedaemon
6 pages reserved for kernel
16830030 swap pages
0 swap pages in use
0 total anon's in system
0 free anon's
119821710 page faults
119081179 traps
474725184 interrupts
510456927 cpu context switches
255355 fpu context switches
3224063 software interrupts
317594717 syscalls
0 pagein operations
329258 forks
640 forks where vmspace is shared
37 kernel map entries
51141141 zeroed page hits
2222 zeroed page misses
0 number of times the pagedaemon woke up
0 revolutions of the clock hand
0 pages freed by pagedaemon
0 pages scanned by pagedaemon
0 pages reactivated by pagedaemon
0 busy pages found by pagedaemon
14089163 total name lookups
cache hits (90% pos + 9% neg) system 0% per-directory
deletions 0%, falsehits 0%, toolong 0%
0 select collisions
interrupt total rate
irq0/clock 26133597 798
irq0/ipi 1582624 48
irq144/acpi0 2 0
irq113/em0 21289414 650
irq114/em1 232449860 7106
irq116/ix1 220960739 6754
irq101/ehci0 51 0
irq104/ehci1 55 0
irq105/ahci0 25073 0
Total 502441415 15360


[LIVE]root@ar1300:~# vmstat
procs memory page disks traps cpu
r b w avm fre flt re pi po fr sr sd0 sd1 int sys cs us
sy id
1 1 0 54960 64032024 3662 0 0 0 0 0 0 0 14513 9707 15606
0 9 91


[LIVE]root@ar1300:~# vmstat
procs memory page disks traps cpu
r b w avm fre flt re pi po fr sr sd0 sd1 int sys cs us
sy id
3 2 0 53392 63829852 3601 0 0 0 0 0 0 0 1903 2971 2430 0
10 90




[LIVE]root@ar1300:~# sysctl net.inet.ip.ifq.maxlen
net.inet.ip.ifq.maxlen=2048
[LIVE]root@ar1300:~# sysctl net.inet.ip.ifq.len
net.inet.ip.ifq.len=0
[LIVE]root@ar1300:~# sysctl net.inet.ip.ifq.drops
net.inet.ip.ifq.drops=66419




Regards,

*Kevin Gee*

No comments:

Post a Comment