Microsoft added two more Azure reference Architectures (RA) this month to the ones already published. One for SAP and one for SharePoint. The one for SharePoint is quite extensive, it is a SP 2016 HA farm using a SQL AO cluster.
They also made the ARM template available on GitHub so you can easily deploy it to Azure.
I was asked by several customers if the SP design was good enough for production and decided to share what I would change about it publicly.
As an architect, I often get my work criticized, questioned, even fibbed about and most of the time I just simply smile and nod, hopefully, the team that put this together at Microsoft do that when they read this.
SharePoint Reference Architecture for Azure
Figure 1 below is the diagram included with the reference architecture for the SP 2016 farm. Notice the Gateway subnet is the entry and exit point for all traffic. This could have also used an ExpressRoute connection and substituted an ExpressRoute gateway for the VPN gateway instead.
Side note, if you want the Visio version of the diagram I created one to match the design and put it on GitHub, the ones marked V2 are my designs.
Some things of interest
- There are two read-only domain controllers running in Azure to authenticate the SharePoint services as this is not yet supported in Azure AD Domain Services. These have a trust with the on-premise forest.
- This architecture requires 38 cores minimum.
- This architecture uses SQL Server VMs instead of Azure SQL Database as SP 2016 does not support it.
- Always on Availability Groups are used with SQL for HA and recovery.
Figure 1 SharePoint 2016 in Azure Diagram
SharePoint Reference Architecture for Azure v2
The thing that most stood out to me was that there is no DMZ with NVAs inspecting traffic. I have customers that do this today. They use ExpressRoute and feel that it is adequate without the need to inspect traffic until it gets to their data center.
Debates about security are a lot of fun and often based only on hypotheticals, which are difficult to disprove until they are no longer hypothetical, and I will just leave it at that.
If you haven’t read this it is a great glimpse into some of the security, monitoring, and protection in Azure. How Azure Security Center aids in detecting good applications being used maliciously, by Sajva Halverson Cloud Security Investigations & Intelligence
Not inspecting all traffic at the perimeter of Azure is not something I would recommend to customers who are running servers in the cloud that contain data. And so that is the main thing I would add to this design.
And to be fair, Microsoft kind of suggest this as an option but not in any real detail.
So how would I improve things?
Shown in Figure 2 is an example of what I would use for securing data between your on-premise data center and Azure. Using two DMZs to protect a typical three tier workload.
There are two private DMZs where all inbound and outbound traffic must go. Within the DMZs are two NVAs in an availability set inspecting all traffic, and each DMZ has an NSG to control traffic as well. The load balancer plays a role as well.
If there was a requirement for allowing traffic from the internet then there would be two public DMZs used in the same way.
Figure 2 Architecture Diagram for Azure DMZ
Securing all traffic
This section describes best practices for implementing a secure hybrid network that extends an on-premises network to Azure.
This reference architecture implements a DMZ between an on-premises network and an Azure virtual network.
The DMZ includes highly available network virtual appliances (NVAs) that implement security functionality such as firewalls and packet inspection. All outgoing traffic from the Vnet is force-tunneled on-premise through a VPN or ExpressRoute connection, where it can be audited again.
Typical scenarios where you would make use of this force-tunneled design include
- Hybrid apps, where parts of the workload run in Azure and other parts run on-premise.
- Regulatory compliance and auditing.
- To prevent information disclosure.
- Auditing of application communications.
- Requirements internally.
You can also decide per-subnet if force tunneling is enabled or not within the same Vnet.
DMZs for isolation of traffic
In Figure 2, there are two private DMZs, one for inbound, one for outbound, and all traffic passes through them.
Also within these two DMZs are two NVAs in an availability set. I will cover the NVAs in more depth later but this is a generic term that can have many divergent functions.
Figure 3, shows the two DMZs broken out of the standard three-tier reference architecture.
Figure 3 Private and Public DMZs in Azure
Recommendations for DMZs
- Include an inbound and outbound DMZ that is placed in front of all other workloads within their Vnet.
- If you allow inbound traffic directly from the internet build separate public DMZs with their own NVAs.
- The NVAs are in an availability group.
- On the load balancer, configure rules that will terminate all traffic sent to ports not enabled for that specific protocol and traffic. For example, only HTTP on port 80 only HTTPS on 443.
As mentioned earlier, NVAs are a generic term used to for a number of divergent functions but essentially they provide a service for managing and monitoring network traffic.
Recommendations for NVAs
Implement your NVAs with the following functionality on a VM
- Traffic is routed using IP forwarding on the NVA network interfaces (NICs).
- Traffic is permitted to pass through the NVA only if it is appropriate to do so. Inbound traffic arrives on one network interface and outbound traffic on another.
- The NVAs can only be configured from the management subnet.
- The VMs for the NVA are placed in an availability set behind a load balancer.
- Include a layer-7 NVA to terminate application connections at the NVA level and maintain affinity with the backend tiers. This guarantees symmetric connectivity, in which response traffic from the backend tiers returns through the NVA.
- Do not use the same set of NVAs for traffic inbound from the internet and traffic inbound from your data centers. (In case you missed this the first 3 times I said it.) Using the same NVAs for both types of traffic introduces a security risk by not providing a secure perimeter between the two types of traffic.
Routing all on-premises user requests through the NVA
The UDR in the gateway subnet blocks all user requests other than those received from on-premises. The UDR passes allowed requests to the NVAs in the private DMZ subnet, and these requests are passed on to the application if they are allowed by the NVA rules.
You can add other routes to the UDR, but make sure they don’t inadvertently bypass the NVAs or block administrative traffic intended for the management subnet.
As mentioned in the list of recommendations above, the load balancer in front of the NVAs also acts as a security device by ignoring traffic on ports that are not open in the load balancing rules.
Another option to consider is connecting multiple NVAs in series (see Figure 4), with each NVA performing a specialized security task. This allows each security function to be managed on a per-NVA basis.
Figure 4 NVAs in Series
Recommendations for NSGs
- Block all traffic not originating from the on-premises network using an NSG rule for the inbound NVA subnet. An Azure VPN gateway exposes a public IP address for the connection to the on-premises network.
- Create NSGs for each subnet to provide a second level of protection against inbound traffic bypassing an incorrectly configured or disabled NVA.
- Use NSGs to block/pass traffic between application tiers. Traffic between tiers is restricted by using NSGs.
These are very basic recommendations for NSGs and are specific to the architecture recommended here. Additional consideration should be given to NSGs in the context of the entire infrastructure.
- This is typically your cloud administrator.
- The second role should be for your security administrators, these are the team members that will go and build, configure, test, deploy NVAs, NSGs, and the various resources to manage the security of your cloud.
- It makes sense to add a SharePoint administrator or DevOps role as well. Allowing them to administer the infrastructure for SharePoint, application components, and restarting the VMs.
Depending on your SQL security policies management of the SQL Server Always On cluster may require a fourth custom role.
Recommendations for RBAC
- When granting permissions, use the principle of least privilege.
- Log all administrative operations and perform regular audits to ensure any configuration changes were planned.
- Create a Cloud Administrator role.
- Create a Security Administrators role.
- Create a DevOps role.
You can further extend the security of your resources by using RBAC with your resource groups.
Using Resource Groups
Recommendations for Resource Groups
- Create a resource group containing the subnets (excluding the VMs), NSGs, and the gateway resources for connecting to the on-premises network. Assign the cloud administrator role to this resource group.
- Create a resource group containing the VMs for the NVAs (including the load balancer), the jumpbox and other management VMs, and the UDR for the gateway subnet that forces all traffic through the NVAs. Assign the security administrator role to this resource group.
- Create separate resource groups for each application or workload tier that contains the load balancer and VMs. This resource group shouldn’t include the subnets for each tier. Assign the DevOps role to this resource group.
See Naming Conventions in Azure for the proper naming of resource groups.
Azure provides a feature called Lock that does essentially that. It does provide two different levels of the lock feature but neither level will allow users to delete or modify critical resources. Only an Owner or User Access Admin can create and delete locks.
Because locks apply to operations as the management plane it doesn’t prevent resources from performing their own functions. Changes are prohibited, but operations are not.
It is important to understand the impact of setting locks and selecting the correct level of a lock. Applying the ReadOnly lock can prevent Azure operations from completing operations because they are carried out at the management plane level. Be sure you understand the restrictions and have taken the time to test the lock before deploying it to a production cloud.
Recommendations for Locks
- Use locks on the resource groups recommended earlier in this guide to prevent accidental deletions and attacks to critical services.
- Develop a policy for applying locks to new resources deployed to Azure.
- Use locks to prevent changes to the network in Azure.
JIT and the Management Subnet
JIT allows users who have been given rights previously to a VM to request access to that VM for a period of time. Once that time is up the user can no longer access the VM and Azure uses NSG’s to block traffic to the VMs ports.
While this is in public preview, planning to incorporate it should begin now. This feature will lower the risk of a VM being successfully attacked and compromised. And when that VM is the VM in the Management subnet which has access to all subnets directly, if it is compromised an attacker could gain access to all Azure resources.
Additional important notes on the Management subnet and its Jumpbox.
Recommendations for JIT and the Management Subnet
- Only allow execution of management tasks on the Jumpbox.
- Do not add another VM to the subnet.
- Do not add an appliance to the management subnet.
- Do not create a public IP for the Jumpbox. Traffic to the Jumpbox should come through the DMZ and NVAs never directly from the internet.
- Create NSG rules so the management subnet only responds to requests from the allowed route.
Recommendations Internet Traffic
- Force-tunnel all outbound Internet traffic through your on-premises network using the site-to-site VPN tunnel or ExpressRoute, and route to the Internet using network address translation (NAT). This prevents accidental leakage of any confidential information stored in your data tier and allows inspection and auditing of all outgoing traffic.
- It is important to not block all Internet traffic doing so will break VM diagnostic logging, VM extensions downloading, Azure diagnostics on storage accounts, OMS, and other PaaS services.
- Once in place verify that outbound internet traffic is being forced-tunneled correctly.
In the second article tomorrow, I will cover how to use Azure Event Grid and Azure Logic Apps for monitoring your NVAs as part of your plan to secure resources in Azure. And then the final article will be on how to modify the current reference architecture to match the design I have suggested and deploy it to Azure with a single PowerShell.