Improving reliability for DNS Resources • Firezone Blog (2024)

tl;dr: Upgrade your Gateway(s) to 1.1.0 soon to improvereliability for DNS Resources.

In our How DNS works in Firezone post, wecovered how DNS Resources are resolved and routed reliably even when the IPsthey resolve to collide. The system described there works well for the vastmajority of our users across many kinds of networks.

But, as it turns out, not all networks are well-behaved (surprise!). Certainnetworks in particular can cause issues with DNS Resources, causing them to timeout or fail to be resolved after a period of time.

This post describes why that happens, how we're resolving it, and the steps youcan take to upgrade.

The case of the NAT reset

The issue was first discovered about a month ago during our internal dogfoodtesting sessions. We noticed that after some time (typically 30 minutes to a fewhours), DNS Resources would become unresponsive and require the application toissue another DNS query to perform the hole-punching dance and re-establishconnectivity.

This is odd behavior -- tunnels are designed to be kept alive indefinitely witha periodic keep-alive sent from Client to Gateway.

When tunnels drop

There are two obvious reasons why a tunnel might drop and need to bere-established:

  • The Client experienced a change in network connectivity (e.g. switching Wi-Finetworks), or
  • The Gateway experienced a change in network connectivity (e.g. restarted by anadmin)

A third, less obvious reason is when network in between the Client and Gatewayis misbehaving.

Google Cloud NAT

We dogfood Firezone internally across a variety of network conditions for bothClient and Gateway. After some investigation, we discovered a curious pattern:the DNS Resource reliaibility issue only occurred for our Gateways running inGoogle Cloud.

After running an overnight soak test, we discovered that the issue happened atregular intervals. Precisely every 30 minutes, the WireGuard tunnel woulddrop, and connectivity to the DNS Resource would be lost. Since new tunnels forDNS Resources are established only at the time of resolution, the application(ping in our case) would lose connectivity until it was restarted.

Google doesn't publish details on the session lifetimes for their NAT Gateways,so we can't be sure if the problem is related to GCP or another router close toGCP's datacenters (if you happen to know, please email us!).

But the goal of this post isn't to pick on Google -- some enterprise routersbehave similarly, under the guide of so-called "security" features, so the issuecould occur in other networks as well.

The solution

The solution is a simple, yet subtle one: instead of establishing the tunnel fora DNS Resource at the time of resolution, we now wait until we see the firstpacket for the Resource before performing the hole-punching dance to set up thetunnel.

The stub resolver maintains a list of mapped IPs to DNS Resources, so we know atthe packet level which DNS Resource the packet is for, even long after the queryhas been resolved.

If the tunnel fails, the very next packet from the application will establish itagain, avoiding the need for another query (which the application may not make)and thus avoiding reliability issues detailed above.

NAT64 comes for free

One interesting edge case we hit implementing the above solution is that wedon't know the actual IP of the DNS Resource until the tunnel to the Gatewayis established, at which point the Gateway resolves it.

Since the stub resolver now immediately returns a dummy IP when asked to do so,it could return an IPv4 address for a Resource that has only AAAA recordsdefined, or vice versa. If the application chooses IPv4 to connect to theResource, packets would arrive at the Gateway and suddenly need to be translatedto IPv6.

So we added a NAT64 implementation to Gateways in 1.1.0 that handles this edgecase (and others!) on-the-fly, with no configuration required. That means yourworkforce can now seamlessly connect to IPv6-only Resources even if they're onIPv4-only networks!

How to upgrade

We released Gateway version 1.1.0 yesterday that includes the change. Thisversion is compatible with Client versions 1.0.x and 1.1.x. However, Clientversions 1.1.x will not be compatible with Gateway versions 1.0.x.

To give admins time to upgrade their Gateways, we are waiting to release the1.1.0 Clients until Thursday, June 27th. We recommend upgrading yourGateways to 1.1.0 as soon as possible to avoid any service disruptions caused byend users upgrading their Clients prematurely.

Upgrading Gateway(s) usually takes only a couple minutes --read the docs to see how.

Conclusion

That's all for now. If you have questions or hit issues, contact us via one ofthe means listed here.

Improving reliability for DNS Resources • Firezone Blog (2024)

References

Top Articles
Latest Posts
Article information

Author: Arline Emard IV

Last Updated:

Views: 5747

Rating: 4.1 / 5 (52 voted)

Reviews: 83% of readers found this page helpful

Author information

Name: Arline Emard IV

Birthday: 1996-07-10

Address: 8912 Hintz Shore, West Louie, AZ 69363-0747

Phone: +13454700762376

Job: Administration Technician

Hobby: Paintball, Horseback riding, Cycling, Running, Macrame, Playing musical instruments, Soapmaking

Introduction: My name is Arline Emard IV, I am a cheerful, gorgeous, colorful, joyous, excited, super, inquisitive person who loves writing and wants to share my knowledge and understanding with you.