When first starting out with Azure Enterprise Landing Zone (ELZ) deployments many years ago, one area that took a little poking about to understand was how best to architect private DNS within Azure.

One article that really helped me at the time was this one from Microsoft’s Adam Stuart, which I’d strongly recommend having a read of (in addition to this one of course 😉).

In this article, we are going to explore why we need private DNS in the first place, the different ways of architecting the solution, and some of the real-world pros and cons of each approach based on experience in having deployed both!


Why Private DNS?

Private DNS zones within Azure do what they say on the tin! It is a PaaS service that allows you to host private DNS entries for any zone you like - regardless of whether you own them or not.

These zones are then linked to one or more virtual networks. Once a private zone is linked to a virtual network, resources within that network can query the Azure DNS service IP (168.63.129.16) and resolve the entries.

Private zones can be linked directly to spoke virtual networks, but more commonly, a central shared virtual network that contains Domain Controllers, standalone DNS server or Azure Private DNS resolvers is used. The DNS for all spoke subnets then points to this central IP (be it VMs or DNS private resolvers).

Azure Private DNS Architecture

There are probably two main uses for private DNS zones:

  1. You wish to utilise a private zone for your cloud resolution, without relying on the default Microsoft-provided domain.
  2. You are utilising private endpoints for connectivity to your resources.

The second use-case is the primary one we are going to be focussing on in this article.


Private Endpoints - A Quick Primer

A private endpoint is essentially a network interface card (NIC) that is placed in one of your virtual networks. It is used to connect to a PaaS resource privately, without going over the public Internet.

Private Endpoint Example

Utilising private endpoints for connectivity to your resources relies heavily on DNS. When private endpoints are enabled on a resource, a CNAME shim gets added into the name resolution chain. This is shown in the example dig query below. As you can see, the first response for the query of mikeguypepdemo.blob.core.windows.net is mikeguypepdemo.privatelink.blob.core.windows.net.

Private DNS CNAME Resolution

If you’re on the public Internet, nothing will change. The query will still continue to resolve to the public IP address (as in the example above), which means you can use private endpoints and public access at the same time if required.

However, if you’re on the private network, and your DNS request ends up resolving to the correct private IP (somehow), then you will connect directly over the private network to the virtual NIC instead (assuming firewalls and NSGs allow you of course).

This resolution doesn’t have to be carried out via an Azure Private DNS zone, you could just manage it manually yourself. But this is the most common approach, and the two have some automatic integrations.


Private DNS Zones - A Global Resource

Azure Private DNS zones are themselves a global resource. This means, that regardless of which region we create them in (from a resource group perspective), the data will be replicated globally without us needing to do anything - it is not tied to a specific region.

So, from a pure resilience point of view, duplicating zones in multiple regions is unnecessary.

Note: I’ve always had a slight niggling doubt here with regards to resource groups. If the resource group is in a region that is having an outage, my worry is that we may not be able to update the resources within that resource group - and therefore the DNS records. Perhaps someone at Microsoft can put my mind at ease here 🙂!

If private DNS zones are a global resource, why are we bothering discussing options regarding multi-region architecture?


Resource Failover Requirements

Let’s first think about a resource that doesn’t natively support geo-redundancy (natively anyway and at the time of writing!) - an Azure Kuberentes Service cluster. It sits within a single region, and if that region goes down, the service is effectively lost. So, from a private endpoint perspective, a single private endpoint in the same region is sufficient. We don’t need a private endpoint in other regions (though we may of course choose to for other reasons).

But what about something that does support geo-redundancy, like an Azure Storage Account? Let’s say we created the storage account in UK South with geo-redundancy enabled but only created a private endpoint in UK South.

Whilst the data may be replicated to another region, if there was an outage affecting the entire UK South region (including networking), then we would have no way of privately accessing the data from our secondary region. We’d be reliant on the private endpoint in UK South which is inaccessible.

With that in mind, it then becomes necessary to add an additional private endpoint in the failover region, so that we can access the data in the event of a failure. This looks a little something like this during normal operation (i.e. the account hasn’t failed over yet)…

Multi-region Private Endpoint Architecture


Resource Failover - What Happens?

Sticking with the storage account (and blob) example. When you enable geo-redundancy, Microsoft will replicate your data in the background to your secondary region. In the event of a failure (or customer-initiated failover), you will continue to connect to the regular DNS name, but the magic of Microsoft software-defined networking will deliver it to the data in the secondary region.

Multi-region Private Endpoint Failover

A potential area of confusion here is if you enable geo-redundancy with read access. This setting allows you to perform read operations within the secondary region under normal conditions, by using a different endpoint. This endpoint gets its own DNS name and sub-resource for private endpoints, so requires an additional network interface card (NIC).

  • Primary Endpoint - mikeguypepdemo.blob.core.windows.net
  • Secondary Endpoint - mikeguypepdemo-secondary.blob.core.windows.net

Multi-region Private Endpoint with Read Access

This doesn’t change anything in terms of failover behaviour though. In the event of a failure or customer-initiated failover, Microsoft actually takes care of swapping DNS names around for you, as it says in the docs

During the failover process, DNS (Domain Name System) entries for your storage account service endpoints are automatically updated such that the secondary region’s endpoints become the new primary endpoints. Once the unplanned failover is complete, clients can begin writing to the new primary endpoints.

A failover scenario effectively looks as follows (though the “read” endpoint in the failed region is likely inaccessible)…

Multi-region Private Endpoint Read Access Failover


Azure Private DNS - Global Zones

Now we have a bit of an idea what is going on with DNS, what is the problem with a single global private DNS resource per-zone? Well, it is that we only have one IP at a time.

Using our earlier example, under normal conditions we want mikeguypepdemo.blob.core.windows.net to resolve to 10.1.1.4 - the private endpoint in UK South. However, under failure conditions, we want it to resolve to 10.2.2.4.

This means, that in the event of a failure or customer-initiated failover, we would need to go and update our private DNS zone record to point to a different IP. Whilst this is not the end of the world, and it can certainly be automated, it is still a bit of a pain, especially if you have multiple geo-redundant services to failover and have got the added stress of a significant disaster looming over you!

A single DNS zone also means under normal conditions, anything outside of the primary region is going to be accessing the primary resource over VNet peerings rather than a local private endpoint resource (even if it exists). Obviously where read access options are available (like we looked at above), this isn’t really an issue, but worth being aware of in case it is not supported or not implemented. The traffic path could be sub-optimal and be traversing through multiple hub firewalls.


Azure Private DNS - Regional Zones

The answer to all of our problems has got to be the use of regional zones then! Right? If only it was quite that straightforward.

Let’s have a look what the architecture might look like with regional zones. We will keep the read-only endpoints out of the equation to keep things simpler (click the image for a larger version).

Regional Azure Private DNS Architecture

As you can see, because each region’s DNS is independent, queries resolve to the local IP, and traffic (shown by the red arrows) takes the most efficient route.

What about during a failure scenario? Unlike the single zone approach, no changes are required to DNS records. The secondary region already has the correct IP address and continues to resolve as normal - the magic of Microsoft connects it to the secondary data once failed over.

Regional Azure Private DNS Architecture - Failure Scenario

What are the problems with this approach?


Azure Private DNS - Regional Zones - The Problems

Whilst the multiple-zone approach prevents the need to update DNS records in the event of a failure, it does come with some significant drawbacks. Let’s look at some.

Management and Complexity

Having to maintain a zone per-region adds additional management overhead, complexity and confusion. One or two regions aren’t so bad, but if you are scaling to a lot more, it becomes tedious.

Infrastructure as Code tools such as Terraform, can clearly help alleviate this, but it is still not ideal.

Cost

Private DNS zones like most things in Azure aren’t free! Every time you add more, you’re adding additional cost.

On-Premises Forwarding and Sub-Optimal Paths

In order for on-premises devices to access the Azure resources, they too are going to need to be able to resolve the DNS names. This means, our on-premises DNS servers are going to need conditional forwarding rules to send the domains (such as private.blob.core.windows.net) to the Azure DNS servers (regardless of what they are - VMs or DNS private resolvers).

But wait? We have two sets of DNS servers (UK South and UK West in our example) - which do we use?

Whichever server ends up responding to our on-prem query will end up returning IPs for its local region’s private endpoint. So, we may end up accessing a UK South storage account, via a UK West private endpoint (or vice versa). This may not be the end of the world, butit is not the most efficient path either. This will be made worse if the regions are much further apart.

This challenge is illustrated in simple terms in the diagram below, with the red arrows indicating the flow of traffic, which has likely arrived into the hub by way of Express Route, VPN or SD-WAN.

Suboptimal Routing Example

For Microsoft Active Directory-integrated DNS, conditional forwarding rules (with multiple destination servers) are evaluated in order, with the subsequent servers only being used if no response is received. A timeout value is used to define how quickly to use the subsequent DNS servers (note, any kind of response - even an NX Domain is considered a response).

What does this mean in practice? It means that a single region is always going to end up being used to access resources, unless you have a more intelligent way of distributing DNS forwarding rules.

Note - I was working with a client recently who had this kind of setup. When the UK South region was up, everything worked fine. However, when they disabled it (via central firewalls) for DR testing, DNS stopped working. They could manually query the UK West resolvers from the domain controllers, so traffic flow and firewall connectivity was working as expected.

A few packet captures later, and we’d worked out what the problem was. The issue turned out to be a timeout issue. The conditional forwarders were set to a 3 second timeout, but the default client timeout was 2 seconds. Clients were timing out before the server had chance to failover to the next in the list. A simple tweak of the conditional forwarder timeout to 1 second fixed the issue.

Every Zone Requires a DNS Entry

Given that a DNS request could be sent to a DNS service in any region (based on our conditional forwarding rules), we have to account for that when configuring our private DNS zone records.

Even if a service is ONLY available in one region (say UK South), we still have to ensure that a DNS entry exists in the other regions, else a transitive failure in DNS to the primary region, may mean we can’t connect, even though there is a valid path. This means we either have to add DNS entries manually to each zone or deploy additional private endpoints (with automatic integration) even though they aren’t necessarily required.

The latter generally works ok, but I do a lot of my work through Terraform. Certain resources (such as Kubernetes clusters) try to be helpful by creating their own private endpoints, which causes some pain here. I’ve also found (again, with AKS), that even when adding the secondary private endpoint to the other region, the name wasn’t correctly added to DNS for some reason, and I had to still manually add it as an additional IaC resource (unlike regular private endpoints).

Again, doing this across two regions - not so bad. Three, four, five… the complexity and overheads grow.


What Would I Choose?

The requirements would vary based on how important things like seamless failover are to the environment, which is why a solid design phase is always important for any work like this. That being said, as a general rule I’d probably opt for a single global deployment having done both.

The complexity and overheads added by having to duplicate everything in each zone outweigh the benefits in my opinion. Creating some “disaster recovery” scripts to automate IP changes in a failover would be pretty straightforward to do, and it keeps the environment simpler.

Simple, in my opinion, is always preferred.


Conclusion

This is one of those areas of Azure architecture where you feel a little “damned if you do, damned if you don’t”.

The multiple-zone approach is a great solution, but the complexity and overheads added are a pain. The simplicity of a single global architecture is much easier to swallow but can result in sub-optimal traffic paths and the need for additional steps during failures, which is not ideal.

I’d strongly recommend reading the article from Adam Stuart mentioned at the beginning of this article (which you can find here), as well as the Microsoft docs. Lab it out, understand it, and make an informed decision based on your own requirements.

As always, feel free to drop me a message if you have any questions!