Thursday, August 31, 2023

Re: pf state-table-induced instability

Dear David,

Thank you very much for all the new information!

I keep only those parts that I want to react.

>> It is not a fundamental issue, but it seems to me that during my tests not
>> only four but five CPU cores were used by IP packet forwarding:
> the packet processing is done in kernel threads (task queues are built
> on threads), and those threads could be scheduled on any cpu. the
> pf purge processing runs in yet another thread.
>
> iirc, the schedule scans down the list of cpus looking for an idle
> one when it needs to run stuff, except to avoid cpu0 if possible.
> this is why you see most of the system time on cpus 1 to 5.

Yes, I can confirm that any time I observed, CPU00 was not used by the
system tasks.

However, I remembered that PF was disabled during my stateless tests, so
I think its purge could not be the one that used CPU05. Now I repeated
the experiment, first disabling PF as follows:

dut# pfctl -d
pf disabled

And I can still see FIVE CPU cores used by system tasks:

load averages:  0.69,  0.29, 0.13                              
dut.cntrg 14:41:06
36 processes: 35 idle, 1 on processor up 0 days 00:03:46
CPU00 states:  0.0% user,  0.0% nice,  0.0% sys,  0.2% spin, 8.1% intr,
91.7% idle
CPU01 states:  0.0% user,  0.0% nice, 61.1% sys,  9.5% spin, 9.5% intr,
19.8% idle
CPU02 states:  0.0% user,  0.0% nice, 62.8% sys, 10.9% spin, 8.5% intr,
17.8% idle
CPU03 states:  0.0% user,  0.0% nice, 54.7% sys,  9.1% spin, 10.1% intr,
26.0% idle
CPU04 states:  0.0% user,  0.0% nice, 62.7% sys, 10.2% spin, 9.8% intr,
17.4% idle
CPU05 states:  0.0% user,  0.0% nice, 51.7% sys,  9.1% spin, 7.6% intr,
31.6% idle
CPU06 states:  0.2% user,  0.0% nice,  2.8% sys,  0.8% spin, 10.0% intr,
86.1% idle
CPU07 states:  0.0% user,  0.0% nice,  0.0% sys,  0.2% spin, 7.2% intr,
92.6% idle
CPU08 states:  0.0% user,  0.0% nice,  0.0% sys,  0.0% spin, 8.4% intr,
91.6% idle
CPU09 states:  0.0% user,  0.0% nice,  0.0% sys,  0.0% spin, 9.2% intr,
90.8% idle
CPU10 states:  0.0% user,  0.0% nice,  0.0% sys,  0.2% spin, 10.8% intr,
89.0% idle
CPU11 states:  0.0% user,  0.0% nice,  0.0% sys,  0.2% spin, 9.2% intr,
90.6% idle
CPU12 states:  0.0% user,  0.0% nice,  0.2% sys,  0.8% spin, 9.2% intr,
89.8% idle
CPU13 states:  0.0% user,  0.0% nice,  0.0% sys,  0.2% spin, 7.2% intr,
92.6% idle
CPU14 states:  0.0% user,  0.0% nice,  0.0% sys,  0.8% spin, 9.8% intr,
89.4% idle
CPU15 states:  0.0% user,  0.0% nice,  0.0% sys,  0.2% spin, 7.8% intr,
92.0% idle
Memory: Real: 34M/1546M act/tot Free: 122G Cache: 807M Swap: 0K/256M

I suspect that top shows an average (in a few seconds time window) and
perhaps one of the cores from CPU01 to CPU04 are skipped (e.g. because
it was used by the "top" command?), this is why I can see system load on
CPU05. (There is even some low amount of system load on CPU06.)


>> *Is there any way to completely delete its entire content?*
> hrm.
>
> so i just read the code again. "pfctl -F states" goes through the whole
> state table and unlinks the states from the red-black trees used for
> packet processing, and then marks them as unlinked so the purge process
> can immediately claim then as soon as they're scanned. this means that
> in terms of packet processing the tree is empty. the memory (which is
> what the state limit applies to) won't be reclaimed until the purge
> processing takes them.
>
> if you just wait 10 or so seconds after "pfctl -F states" then both the
> tree and state limits should be back to 0. you can watch pfctl -si,
> "systat pf", or the pfstate row in "systat pool" to confirm.
>
> you can change the scan interval with "set timeout interval" in pf.conf
> from 10s. no one fiddles with that though, so i'd put it back between
> runs to be representative of real world performance.

I usually wait 10s between the consecutive steps of the binary search of
my measurements to give the system a chance to relax (trying to ensure
that the steps are independent measurements). However, the timeout
interval of PF was set to 1 hour (using "set timeout interval 3600").
You may ask, why?

To have some well defined performance metrics, and to define repeatable
and reproducible measurements, we use the following tests:
- maximum connection establishment rate (during this test all test
frames result in a new connection)
- throughput with bidirectional traffic as required by RFC 2544 (during
this test no test frames result in a new connection, neither connection
time out happens -- a sufficiently high timeout could guarantee it)
- connection tear down performance (first loading N number of
connections and then deleting all connections in a single step and
measuring the execution time of the deletion: connection tear down rate
= N / deletion time of N connections)

It is a good question, how well the above performance metrics can
represent the "real word" performance of a stateful NAT64 implementation!

If you are interested (and have time) I would be happy to work together
with you in this area. We could publish a common paper, etc. Please let
me know, if you are open for that.

The focus of my current measurements is only to test and demonstrate if
using multiple IP addresses in stateless (RFC 2544) and stateful
benchmarking measurements makes a difference. And it definitely seems to
be so. :-)

(Now, I try to carry out the missing measurements and finish my paper
about the extension of siitperf with pseudorandom IP addresses ASAP.)

Best regards,

Gábor

No comments:

Post a Comment