Wednesday, August 30, 2023

Re: pf state-table-induced instability

Dear David,

Thank you very much for your detailed answer! Now I have got the
explanation for seemingly rather strange things. :-)

However, I have some further questions. Let me explain what I do now so
that you can more clearly see the background.

I have recently enabled siitperf to use multiple IP addresses. (Siitperf
is an IPv4, IPv6,  SIIT, and stateful NAT64/NAT44 bechmarking tool
implementing the measurements of RFC 2544, RFC 8219, and this draft:
https://datatracker.ietf.org/doc/html/draft-ietf-bmwg-benchmarking-stateful
.)

Currently I want to test (and demonstrate) the difference this
improvement has made. I have already covered the stateless case by
measuring the IPv4 and IPv6 packet forwarding performance of OpenBSD using
1) the very same test frames following the test frame format defined in
the appendix of RFC 2544
2) using only pseudorandom port numbers required by RFC 4814 (resulted
in no performance improvement compared to case 1)
3) using pseudorandom IP addresses from specified ranges (resulted in
significant performance improvement compared to case 1)
4) using both pseudorandom IP addresses and port numbers (same results
as in case 3)

Many thanks to OpenBSD developers for enabling multi-core IP packet
forwarding!

https://www.openbsd.org/plus72.html says: "Activated parallel IP
forwarding, starting 4 softnet tasks but limiting the usage to the
number of CPUs."

It is not a fundamental issue, but it seems to me that during my tests
not only four but five CPU cores were used by IP packet forwarding:

load averages:  1.34,  0.35, 0.12                              
dut.cntrg 20:10:15
36 processes: 35 idle, 1 on processor up 1 days 02:16:56
CPU00 states:  0.0% user,  0.0% nice,  0.0% sys,  0.2% spin, 6.1% intr,
93.7% idle
CPU01 states:  0.0% user,  0.0% nice, 55.8% sys,  7.2% spin, 5.2% intr,
31.9% idle
CPU02 states:  0.0% user,  0.0% nice, 53.6% sys,  8.0% spin, 6.2% intr,
32.1% idle
CPU03 states:  0.0% user,  0.0% nice, 48.3% sys,  7.2% spin, 6.2% intr,
38.3% idle
CPU04 states:  0.0% user,  0.0% nice, 44.2% sys,  9.7% spin, 6.3% intr,
39.8% idle
CPU05 states:  0.0% user,  0.0% nice, 33.5% sys,  5.8% spin, 6.4% intr,
54.3% idle
CPU06 states:  0.0% user,  0.0% nice,  3.2% sys,  0.2% spin, 7.2% intr,
89.4% idle
CPU07 states:  0.0% user,  0.0% nice,  0.0% sys,  0.8% spin, 6.0% intr,
93.2% idle
CPU08 states:  0.0% user,  0.0% nice,  0.0% sys,  0.2% spin, 5.4% intr,
94.4% idle
CPU09 states:  0.0% user,  0.0% nice,  0.0% sys,  0.2% spin, 7.2% intr,
92.6% idle
CPU10 states:  0.0% user,  0.0% nice,  0.0% sys,  0.2% spin, 8.9% intr,
90.9% idle
CPU11 states:  0.0% user,  0.0% nice,  0.0% sys,  0.2% spin, 7.6% intr,
92.2% idle
CPU12 states:  0.0% user,  0.0% nice,  0.0% sys,  0.0% spin, 8.6% intr,
91.4% idle
CPU13 states:  0.0% user,  0.0% nice,  0.0% sys,  0.4% spin, 6.1% intr,
93.5% idle
CPU14 states:  0.0% user,  0.0% nice,  0.0% sys,  0.2% spin, 6.4% intr,
93.4% idle
CPU15 states:  0.0% user,  0.0% nice,  0.0% sys,  0.4% spin, 4.8% intr,
94.8% idle
Memory: Real: 34M/2041M act/tot Free: 122G Cache: 825M Swap: 0K/256M

The above output of the "top" command show significant system load at
CPU cores form CPU1 to CPU5.

*Has the number of softnet tasks been increased from 4 to 5?*

What it more crucial for me, are the stateful NAT64 the measurements
with PF.

My stateful NAT64 measurement are as follows.

1. Maximum connection establishment rate test uses a binary search to
find the highest rate, at which all connections can be established
through the stateful NAT64 gateway when all test frames create a new
connection.

2. Throughput test also uses a binary search to find the highest rate
(called throughput) at which all test frames are forwarded by the
stateful NAT64 gateway using bidirectional traffic. (All test frames
belong to an already existing connection. This test requires to load the
connections into the connection tracking table of the stateful NAT64
gateway in a previous step using a safely lower rate than determined by
the maximum connection establishment rate test.)

And both tests need to repeat multiple times to acquire statistically
reliable results.

As for the explanation of the seemingly deteriorating performance of PF,
now I understand from your explanation that the "pfctl -F states"
command does not delete the content of the connection tracking table.

*Is there any way to completely delete its entire content?*

(E.g., under Linux, I can delete the connection tracking table of
iptables or Jool by deleting the appropriate kernel module.)

Of course, I can delete it by rebooting the server. However, currently I
use a Dell PowerEdge R730 server, and its complete reboot (including
stopping OpenBSD, initialization of the hardware, booting OpenBSD and
some spare time) takes 5 minutes. This is a way too long overhead, if I
need to do it between every single elementary steps (that is, the steps
of the binary search) which are in the order of magnitude of 1 minute. :-(

(Currently I use the compromise that I reboot the OpenBSD server after
finishing each binary search.)

Thank you very much for all your further advice in advance!

Best regards,

Gábor

On 8/29/2023 12:01 AM, David Gwynne wrote:
> On Mon, Aug 28, 2023 at 01:46:32PM +0200, Gabor LENCSE wrote:
>> Hi Lyndon,
>>
>> Sorry for my late reply. Please see my answers inline.
>>
>> On 8/24/2023 11:13 PM, Lyndon Nerenberg (VE7TFX/VE6BBM) wrote:
>>> Gabor LENCSE writes:
>>>
>>>> If you are interested, you can find the results in Tables 18 - 20 of
>>>> this (open access) paper:https://doi.org/10.1016/j.comcom.2023.08.009
>>> Thanks for the pointer -- that's a very interesting paper.
>>>
>>> After giving it a quick read through, one thing immediately jumps
>>> out. The paper mentions (section A.4) a boost in performance after
>>> increasing the state table size limit. Not having looked at the
>>> relevant code, so I'm guessing here, but this is a classic indicator
>>> of a hashing algorithm falling apart when the table gets close to
>>> full. Could it be that simple? I need to go digging into the pf
>>> code for a closer look.
>> Beware, I wrote it about iptables and not PF!
>>
>> As for iptables, it is really so simple. I have done a deeper analysis of
>> iptables performance as the function of its hash table size. It is
>> documented in another (open access) paper:
>> http://doi.org/10.36244/ICJ.2023.1.6
>>
>> However, I am not familiar with the internals of the other two tested
>> stateful NAT64 implementations, Jool and OpenBSD PF. I have no idea, what
>> kind of data structures they use for storing the connections.
> openbsd uses a red-black tree to look up states. packets are parsed into
> a key that looks up states by address family, ips, ipproto, ports, etc,
> to find the relevant state. if a state isnt found, it falls through to
> ruleset evaluation, which is notionally a linked list, but has been
> optimised.
>
>>> You also describe how the performance degrades over time. This
>>> exactly matches the behaviour we see. Could the fix be as simple
>>> as cranking 'set limit states' up to, say, two milltion? There is
>>> one way to find out ... :-)
>> As you could see, the highest number of connections was 40M, and the limit
>> of the states was set to 1000M. It worked well for me then with the PF of
>> OpenBSD 7.1.
>>
>> It would be interesting to find the root cause of the phenomenon, why the
>> performance of PF seems to deteriorate with time. E.g., somehow the internal
>> data structures of PF become "polluted" if many connections are established
>> and then deleted?
> my first guess is that you're starting to fight agains the pf state
> purge processing. pf tries to scan the entire state table every 10
> seconds (by default) looking for expired states it can remove. this scan
> process runs every second, but it tries to cover the whole state table
> by 10 seconds. the more states you have the more time this takes, and
> this increases linearly with the number of states you have.
>
> until relatively recently (post 7.2), the scan and gc processing
> effectively stopped the world. at work we run with about 2 million
> states during business hours, and i was seeing the gc processing take up
> approx 70ms a second, during which packet processing didnt really
> happen.
>
> now the scan can happen without blocking pf packet processing. it still
> takes cpu time, so there is a point that processing packets and scanning
> for states will fight each other for time, but at least they're not
> fighting each other for locks now.
>
>> However, I have deleted the content of the state table after each elementary
>> measurement step using the "pfctl -F states" command. (I am sorry, this
>> command is missing from the paper, but it is there in my saved "del-pf"
>> file!)
>>
>> Perhaps PF developers could advise us, if the deletion of the states
>> generate a fresh state table or not.
> it marks the states as expired, and then the purge scan is able to take
> them and actually free them.
>
>> Could anyone help us in this question?
>>
>> Best regards,
>>
>> G??bor
>>
>>
>>
>>
>> I use binary search to find the highest lossless rate (throughput).
>> Especially w
>>
>>
>>> --lyndon

No comments:

Post a Comment