On Wed, Aug 30, 2023 at 09:54:45AM +0200, Gabor LENCSE wrote:
> Dear David,
>
> Thank you very much for your detailed answer! Now I have got the explanation
> for seemingly rather strange things. :-)
>
> However, I have some further questions. Let me explain what I do now so that
> you can more clearly see the background.
>
> I have recently enabled siitperf to use multiple IP addresses. (Siitperf is
> an IPv4, IPv6,?? SIIT, and stateful NAT64/NAT44 bechmarking tool
> implementing the measurements of RFC 2544, RFC 8219, and this draft:
> https://datatracker.ietf.org/doc/html/draft-ietf-bmwg-benchmarking-stateful
> .)
>
> Currently I want to test (and demonstrate) the difference this improvement
> has made. I have already covered the stateless case by measuring the IPv4
> and IPv6 packet forwarding performance of OpenBSD using
> 1) the very same test frames following the test frame format defined in the
> appendix of RFC 2544
> 2) using only pseudorandom port numbers required by RFC 4814 (resulted in no
> performance improvement compared to case 1)
> 3) using pseudorandom IP addresses from specified ranges (resulted in
> significant performance improvement compared to case 1)
> 4) using both pseudorandom IP addresses and port numbers (same results as in
> case 3)
>
> Many thanks to OpenBSD developers for enabling multi-core IP packet
> forwarding!
>
> https://www.openbsd.org/plus72.html says: "Activated parallel IP forwarding,
> starting 4 softnet tasks but limiting the usage to the number of CPUs."
>
> It is not a fundamental issue, but it seems to me that during my tests not
> only four but five CPU cores were used by IP packet forwarding:
the packet processing is done in kernel threads (task queues are built
on threads), and those threads could be scheduled on any cpu. the
pf purge processing runs in yet another thread.
iirc, the schedule scans down the list of cpus looking for an idle
one when it needs to run stuff, except to avoid cpu0 if possible.
this is why you see most of the system time on cpus 1 to 5.
>
> load averages:?? 1.34,?? 0.35,
> 0.12???????????????????????????????????????????????????????????? dut.cntrg
> 20:10:15
> 36 processes: 35 idle, 1 on processor up 1 days 02:16:56
> CPU00 states:?? 0.0% user,?? 0.0% nice,?? 0.0% sys,?? 0.2% spin, 6.1% intr,
> 93.7% idle
> CPU01 states:?? 0.0% user,?? 0.0% nice, 55.8% sys,?? 7.2% spin, 5.2% intr,
> 31.9% idle
> CPU02 states:?? 0.0% user,?? 0.0% nice, 53.6% sys,?? 8.0% spin, 6.2% intr,
> 32.1% idle
> CPU03 states:?? 0.0% user,?? 0.0% nice, 48.3% sys,?? 7.2% spin, 6.2% intr,
> 38.3% idle
> CPU04 states:?? 0.0% user,?? 0.0% nice, 44.2% sys,?? 9.7% spin, 6.3% intr,
> 39.8% idle
> CPU05 states:?? 0.0% user,?? 0.0% nice, 33.5% sys,?? 5.8% spin, 6.4% intr,
> 54.3% idle
> CPU06 states:?? 0.0% user,?? 0.0% nice,?? 3.2% sys,?? 0.2% spin, 7.2% intr,
> 89.4% idle
> CPU07 states:?? 0.0% user,?? 0.0% nice,?? 0.0% sys,?? 0.8% spin, 6.0% intr,
> 93.2% idle
> CPU08 states:?? 0.0% user,?? 0.0% nice,?? 0.0% sys,?? 0.2% spin, 5.4% intr,
> 94.4% idle
> CPU09 states:?? 0.0% user,?? 0.0% nice,?? 0.0% sys,?? 0.2% spin, 7.2% intr,
> 92.6% idle
> CPU10 states:?? 0.0% user,?? 0.0% nice,?? 0.0% sys,?? 0.2% spin, 8.9% intr,
> 90.9% idle
> CPU11 states:?? 0.0% user,?? 0.0% nice,?? 0.0% sys,?? 0.2% spin, 7.6% intr,
> 92.2% idle
> CPU12 states:?? 0.0% user,?? 0.0% nice,?? 0.0% sys,?? 0.0% spin, 8.6% intr,
> 91.4% idle
> CPU13 states:?? 0.0% user,?? 0.0% nice,?? 0.0% sys,?? 0.4% spin, 6.1% intr,
> 93.5% idle
> CPU14 states:?? 0.0% user,?? 0.0% nice,?? 0.0% sys,?? 0.2% spin, 6.4% intr,
> 93.4% idle
> CPU15 states:?? 0.0% user,?? 0.0% nice,?? 0.0% sys,?? 0.4% spin, 4.8% intr,
> 94.8% idle
> Memory: Real: 34M/2041M act/tot Free: 122G Cache: 825M Swap: 0K/256M
>
> The above output of the "top" command show significant system load at CPU
> cores form CPU1 to CPU5.
>
> *Has the number of softnet tasks been increased from 4 to 5?*
no :)
> What it more crucial for me, are the stateful NAT64 the measurements with
> PF.
>
> My stateful NAT64 measurement are as follows.
>
> 1. Maximum connection establishment rate test uses a binary search to find
> the highest rate, at which all connections can be established through the
> stateful NAT64 gateway when all test frames create a new connection.
>
> 2. Throughput test also uses a binary search to find the highest rate
> (called throughput) at which all test frames are forwarded by the stateful
> NAT64 gateway using bidirectional traffic. (All test frames belong to an
> already existing connection. This test requires to load the connections into
> the connection tracking table of the stateful NAT64 gateway in a previous
> step using a safely lower rate than determined by the maximum connection
> establishment rate test.)
>
> And both tests need to repeat multiple times to acquire statistically
> reliable results.
>
> As for the explanation of the seemingly deteriorating performance of PF, now
> I understand from your explanation that the "pfctl -F states" command does
> not delete the content of the connection tracking table.
>
> *Is there any way to completely delete its entire content?*
hrm.
so i just read the code again. "pfctl -F states" goes through the whole
state table and unlinks the states from the red-black trees used for
packet processing, and then marks them as unlinked so the purge process
can immediately claim then as soon as they're scanned. this means that
in terms of packet processing the tree is empty. the memory (which is
what the state limit applies to) won't be reclaimed until the purge
processing takes them.
if you just wait 10 or so seconds after "pfctl -F states" then both the
tree and state limits should be back to 0. you can watch pfctl -si,
"systat pf", or the pfstate row in "systat pool" to confirm.
you can change the scan interval with "set timeout interval" in pf.conf
from 10s. no one fiddles with that though, so i'd put it back between
runs to be representative of real world performance.
> (E.g., under Linux, I can delete the connection tracking table of iptables
> or Jool by deleting the appropriate kernel module.)
i can look at making pfctl -F states free the memory up too, but i have
this massive todo list already :(
> Of course, I can delete it by rebooting the server. However, currently I use
> a Dell PowerEdge R730 server, and its complete reboot (including stopping
> OpenBSD, initialization of the hardware, booting OpenBSD and some spare
> time) takes 5 minutes. This is a way too long overhead, if I need to do it
> between every single elementary steps (that is, the steps of the binary
> search) which are in the order of magnitude of 1 minute. :-(
5 minules of VALUE ADDING. pretty sure dell thinks you should be
grateful for all the amazing work they're doing before you get to the
boot loader.
>
> (Currently I use the compromise that I reboot the OpenBSD server after
> finishing each binary search.)
>
> Thank you very much for all your further advice in advance!
>
> Best regards,
>
> G??bor
>
> On 8/29/2023 12:01 AM, David Gwynne wrote:
> > On Mon, Aug 28, 2023 at 01:46:32PM +0200, Gabor LENCSE wrote:
> > > Hi Lyndon,
> > >
> > > Sorry for my late reply. Please see my answers inline.
> > >
> > > On 8/24/2023 11:13 PM, Lyndon Nerenberg (VE7TFX/VE6BBM) wrote:
> > > > Gabor LENCSE writes:
> > > >
> > > > > If you are interested, you can find the results in Tables 18 - 20 of
> > > > > this (open access) paper:https://doi.org/10.1016/j.comcom.2023.08.009
> > > > Thanks for the pointer -- that's a very interesting paper.
> > > >
> > > > After giving it a quick read through, one thing immediately jumps
> > > > out. The paper mentions (section A.4) a boost in performance after
> > > > increasing the state table size limit. Not having looked at the
> > > > relevant code, so I'm guessing here, but this is a classic indicator
> > > > of a hashing algorithm falling apart when the table gets close to
> > > > full. Could it be that simple? I need to go digging into the pf
> > > > code for a closer look.
> > > Beware, I wrote it about iptables and not PF!
> > >
> > > As for iptables, it is really so simple. I have done a deeper analysis of
> > > iptables performance as the function of its hash table size. It is
> > > documented in another (open access) paper:
> > > http://doi.org/10.36244/ICJ.2023.1.6
> > >
> > > However, I am not familiar with the internals of the other two tested
> > > stateful NAT64 implementations, Jool and OpenBSD PF. I have no idea, what
> > > kind of data structures they use for storing the connections.
> > openbsd uses a red-black tree to look up states. packets are parsed into
> > a key that looks up states by address family, ips, ipproto, ports, etc,
> > to find the relevant state. if a state isnt found, it falls through to
> > ruleset evaluation, which is notionally a linked list, but has been
> > optimised.
> >
> > > > You also describe how the performance degrades over time. This
> > > > exactly matches the behaviour we see. Could the fix be as simple
> > > > as cranking 'set limit states' up to, say, two milltion? There is
> > > > one way to find out ... :-)
> > > As you could see, the highest number of connections was 40M, and the limit
> > > of the states was set to 1000M. It worked well for me then with the PF of
> > > OpenBSD 7.1.
> > >
> > > It would be interesting to find the root cause of the phenomenon, why the
> > > performance of PF seems to deteriorate with time. E.g., somehow the internal
> > > data structures of PF become "polluted" if many connections are established
> > > and then deleted?
> > my first guess is that you're starting to fight agains the pf state
> > purge processing. pf tries to scan the entire state table every 10
> > seconds (by default) looking for expired states it can remove. this scan
> > process runs every second, but it tries to cover the whole state table
> > by 10 seconds. the more states you have the more time this takes, and
> > this increases linearly with the number of states you have.
> >
> > until relatively recently (post 7.2), the scan and gc processing
> > effectively stopped the world. at work we run with about 2 million
> > states during business hours, and i was seeing the gc processing take up
> > approx 70ms a second, during which packet processing didnt really
> > happen.
> >
> > now the scan can happen without blocking pf packet processing. it still
> > takes cpu time, so there is a point that processing packets and scanning
> > for states will fight each other for time, but at least they're not
> > fighting each other for locks now.
> >
> > > However, I have deleted the content of the state table after each elementary
> > > measurement step using the "pfctl -F states" command. (I am sorry, this
> > > command is missing from the paper, but it is there in my saved "del-pf"
> > > file!)
> > >
> > > Perhaps PF developers could advise us, if the deletion of the states
> > > generate a fresh state table or not.
> > it marks the states as expired, and then the purge scan is able to take
> > them and actually free them.
> >
> > > Could anyone help us in this question?
> > >
> > > Best regards,
> > >
> > > G??bor
> > >
> > >
> > >
> > >
> > > I use binary search to find the highest lossless rate (throughput).
> > > Especially w
> > >
> > >
> > > > --lyndon
No comments:
Post a Comment