Thursday, August 31, 2023

Re: pf state-table-induced instability

On Thu, Aug 31, 2023 at 04:10:06PM +0200, Gabor LENCSE wrote:
> Dear David,
>
> Thank you very much for all the new information!
>
> I keep only those parts that I want to react.
>
> > > It is not a fundamental issue, but it seems to me that during my tests not
> > > only four but five CPU cores were used by IP packet forwarding:
> > the packet processing is done in kernel threads (task queues are built
> > on threads), and those threads could be scheduled on any cpu. the
> > pf purge processing runs in yet another thread.
> >
> > iirc, the schedule scans down the list of cpus looking for an idle
> > one when it needs to run stuff, except to avoid cpu0 if possible.
> > this is why you see most of the system time on cpus 1 to 5.
>
> Yes, I can confirm that any time I observed, CPU00 was not used by the
> system tasks.
>
> However, I remembered that PF was disabled during my stateless tests, so I
> think its purge could not be the one that used CPU05. Now I repeated the
> experiment, first disabling PF as follows:

disabling pf means it doesnt get run for packets in the network stack.
however, the once the state purge processing is started it just keeps
running. if you have zero states, there wont be much to process though.

there will be other things running in the system that could account for
the "extra" cpu utilisation.

> dut# pfctl -d
> pf disabled
>
> And I can still see FIVE CPU cores used by system tasks:

the network stack runs in these threads. pf is just one part of the
network stack.

>
> load averages:?? 0.69,?? 0.29,
> 0.13???????????????????????????????????????????????????????????? dut.cntrg
> 14:41:06
> 36 processes: 35 idle, 1 on processor up 0 days 00:03:46
> CPU00 states:?? 0.0% user,?? 0.0% nice,?? 0.0% sys,?? 0.2% spin, 8.1% intr,
> 91.7% idle
> CPU01 states:?? 0.0% user,?? 0.0% nice, 61.1% sys,?? 9.5% spin, 9.5% intr,
> 19.8% idle
> CPU02 states:?? 0.0% user,?? 0.0% nice, 62.8% sys, 10.9% spin, 8.5% intr,
> 17.8% idle
> CPU03 states:?? 0.0% user,?? 0.0% nice, 54.7% sys,?? 9.1% spin, 10.1% intr,
> 26.0% idle
> CPU04 states:?? 0.0% user,?? 0.0% nice, 62.7% sys, 10.2% spin, 9.8% intr,
> 17.4% idle
> CPU05 states:?? 0.0% user,?? 0.0% nice, 51.7% sys,?? 9.1% spin, 7.6% intr,
> 31.6% idle
> CPU06 states:?? 0.2% user,?? 0.0% nice,?? 2.8% sys,?? 0.8% spin, 10.0% intr,
> 86.1% idle
> CPU07 states:?? 0.0% user,?? 0.0% nice,?? 0.0% sys,?? 0.2% spin, 7.2% intr,
> 92.6% idle
> CPU08 states:?? 0.0% user,?? 0.0% nice,?? 0.0% sys,?? 0.0% spin, 8.4% intr,
> 91.6% idle
> CPU09 states:?? 0.0% user,?? 0.0% nice,?? 0.0% sys,?? 0.0% spin, 9.2% intr,
> 90.8% idle
> CPU10 states:?? 0.0% user,?? 0.0% nice,?? 0.0% sys,?? 0.2% spin, 10.8% intr,
> 89.0% idle
> CPU11 states:?? 0.0% user,?? 0.0% nice,?? 0.0% sys,?? 0.2% spin, 9.2% intr,
> 90.6% idle
> CPU12 states:?? 0.0% user,?? 0.0% nice,?? 0.2% sys,?? 0.8% spin, 9.2% intr,
> 89.8% idle
> CPU13 states:?? 0.0% user,?? 0.0% nice,?? 0.0% sys,?? 0.2% spin, 7.2% intr,
> 92.6% idle
> CPU14 states:?? 0.0% user,?? 0.0% nice,?? 0.0% sys,?? 0.8% spin, 9.8% intr,
> 89.4% idle
> CPU15 states:?? 0.0% user,?? 0.0% nice,?? 0.0% sys,?? 0.2% spin, 7.8% intr,
> 92.0% idle
> Memory: Real: 34M/1546M act/tot Free: 122G Cache: 807M Swap: 0K/256M
>
> I suspect that top shows an average (in a few seconds time window) and
> perhaps one of the cores from CPU01 to CPU04 are skipped (e.g. because it
> was used by the "top" command?), this is why I can see system load on CPU05.
> (There is even some low amount of system load on CPU06.)
>
>
> > > *Is there any way to completely delete its entire content?*
> > hrm.
> >
> > so i just read the code again. "pfctl -F states" goes through the whole
> > state table and unlinks the states from the red-black trees used for
> > packet processing, and then marks them as unlinked so the purge process
> > can immediately claim then as soon as they're scanned. this means that
> > in terms of packet processing the tree is empty. the memory (which is
> > what the state limit applies to) won't be reclaimed until the purge
> > processing takes them.
> >
> > if you just wait 10 or so seconds after "pfctl -F states" then both the
> > tree and state limits should be back to 0. you can watch pfctl -si,
> > "systat pf", or the pfstate row in "systat pool" to confirm.
> >
> > you can change the scan interval with "set timeout interval" in pf.conf
> > from 10s. no one fiddles with that though, so i'd put it back between
> > runs to be representative of real world performance.
>
> I usually wait 10s between the consecutive steps of the binary search of my
> measurements to give the system a chance to relax (trying to ensure that the
> steps are independent measurements). However, the timeout interval of PF was
> set to 1 hour (using "set timeout interval 3600"). You may ask, why?
>
> To have some well defined performance metrics, and to define repeatable and
> reproducible measurements, we use the following tests:
> - maximum connection establishment rate (during this test all test frames
> result in a new connection)
> - throughput with bidirectional traffic as required by RFC 2544 (during this
> test no test frames result in a new connection, neither connection time out
> happens -- a sufficiently high timeout could guarantee it)
> - connection tear down performance (first loading N number of connections
> and then deleting all connections in a single step and measuring the
> execution time of the deletion: connection tear down rate = N / deletion
> time of N connections)

the above sounds like you're trying to prevent connections being torn
down inside pf. however, "set timeout interval" only affects how
active the pf state purge processing is. it basically tunes how many
states will be scanned every second, but doesnt change how idle a state
has to be before it can become a candidate for purging.

> It is a good question, how well the above performance metrics can represent
> the "real word" performance of a stateful NAT64 implementation!
>
> If you are interested (and have time) I would be happy to work together with
> you in this area. We could publish a common paper, etc. Please let me know,
> if you are open for that.

i can have a look, just send me an email.

generally, openbsd aims to work out of the box and not require tuning.

> The focus of my current measurements is only to test and demonstrate if
> using multiple IP addresses in stateless (RFC 2544) and stateful
> benchmarking measurements makes a difference. And it definitely seems to be
> so. :-)

allowing nat to use addresses opens up a much larger "tuple space",
which makes it more likely that a search for a free entry in that space
can be found quickly. i havent looked at that code for a while, so i'm
almost guessing as to what the reason for the performance difference is
there.

> (Now, I try to carry out the missing measurements and finish my paper about
> the extension of siitperf with pseudorandom IP addresses ASAP.)
>
> Best regards,
>
> G??bor
>
>
>

No comments:

Post a Comment