Monday, May 04, 2020

OSPF lsa_check issue

Hi,

Following on from the OSPF issue we were seeing in 5.8, we have built a
vagrant lab with a complete replica of our production network in order to
test config against 6.6 (latest syspatch applied) and test a number of
scenarios.

All in all everything has gone well, and other than some minor config
enhancements, everything is fundamentally working.

The original issue we had was routes not being advertised beyond the DR,
when there were situations like a network blip or restart of the ospf
process on another router/firewall.

Since moving to 6.6 we have been able to recreate the same situation we
have had in production, we do this by doing a "rcctl restart ospfd" on the
DR, typically a few times.

Eventually other routers start logging as follows:

May 4 15:44:19 va-l1-tun ospfd[75371]: lsa_check: bad age
May 4 15:44:19 va-l1-tun ospfd[75371]: lsa_check: bad age
May 4 15:44:24 va-l1-br-02 ospfd[27625]: lsa_check: bad age
May 4 15:44:24 va-l1-br-02 ospfd[27625]: lsa_check: bad age
May 4 15:44:24 1 va-l1-tun ospfd[75371]: lsa_check: bad age

If we run a tcpdump using tcpdump -i vio0 -s 1500 -w /tmp/ospf.pcap proto
ospf, we can then see the ospf hello packets fully in wireshark, but the LS
update packets are fragmented so we can not see the full detail or what is
being passed from the relevant neighbor.

We have tried to increase the verbosity of logging using "ospfctl log
verbose", but still we are unsure which lsa update is incorrect.

The only way we have found to stop these logs from appearing is to "rcctl
restart ospfd" on various boxes until it stops.

What we are hoping for help with is diagnosing exactly which record has
the lsa_check: bad age, and understanding whether this should in effect
clear itself for example.

We have looked at the source code, but do not fully understand the flow
beyond the check itself in lsa_check.

We are wondering if there is something fundamentally wrong with our config,
but it is pretty simple. Effectively a set of connected routers in a single
area with one of the hops having a backup across the internet with a GRE
tunnel. At most we are only ever 3 hops away between a source and
destination.

We have also on occasion seen "seq num mismatch, bad flags" messages, but
these have appeared to clear themselves.

Thanks

No comments:

Post a Comment