Saturday, October 22, 2022

Problems with Benchmarking Between Test and Production

 I struck an issue with a client a little while back that resulted in a lot of learnings for testing and benchmarking.

This client had a major data processing run to undertake - the sort of run which involved an army of people and took the system down for an entire weekend. To prepare for this we did several runs in a test system to check the data, the process and the timing. So far so good. All went well in testing, the date was set, and on the Friday night we completed our pre-flight tasks, fired up the data load, and headed off to bed smug in the knowledge that we had properly planned and prepared. 

Up bright and early on Saturday morning, logged in to check progress and I was hit with a sinking feeling in my stomach. The task I had expected to have finished was nowhere near complete. Thankfully this part of the load could be safely restarted and would continue from where it left off, so a restart, quick check that things looked ok and then time for a coffee and de-stress.

A few checks later it was obvious the task was slowing down again, so after a nervous communication to stakeholders I started the investigation into why is this going so slow, and why is it so different to our testing?

Hardware

For our testing we'd made sure to use up to date backups from production, so the first thought was hardware. The client runs an AWS environment with SQL Server running on EC2 instances. They use r type instances as the databases are generally pretty small and running as much in memory as possible usually gives good results, though in this case we were seeing quite significant disk IO slow down on production.

Here was where we found the first difference. At first glance the instances looked the same, but on a closer look production was an r5a, vs r5 in test. The difference? Again, on first glance these just appeared to be a different processor (AMD vs Intel) but on closer inspection there was also a significant difference in EBS (disk) throughput - 4750 IOPS for the r5, 2880 for the r5a. These servers had been built before I started with the client, so I didn't know why the particular types had been chosen of if issues had been encountered before. A quick chat with the (very helpful) Infrastructure team and the instance type was changed, then the load started again.

This time I was watching the load closely and sure enough, after a couple of hours I could see tempdb was getting hammered and disk latency was significantly high. I'd struck this before in AWS, and the answer was to put tempdb on local storage. This means a different instance type again, r5d. So we changed the instance, moved tempdb to the new local drive and we were underway again.

By this time it was getting late, so a quick check that things were humming along then off to bed.

Code? Versions? What Else?

By the next morning it was obvious that things weren't going great, and it looked like we were going to need to pull the plug and reschedule the outage once we had identified and fixed the issue. In the remaining time I decided to run an Extended Events session to capture what was happening, and dig around a bit more on the servers.

Two things showed up. First was it looked like we had a severe case of parameter sniffing, and secondly, we were running different editions of SQL Server. Test was running Developer Edition, which is essentially a free version of Enterprise Edition for development. Production was Standard Edition. (I don't want to get into the rights and wrongs of running Developer Edition on Test instances, but let's just say that new test builds are running Standard edition the same as production).

Now I wasn't going to change production to Developer or Enterprise Edition just for this one load, and investigating the potential parameter sniffing was going to take a while. So we cancelled the load, rolled back the databases and prepared for the hard questions sure to be coming on Monday morning.

Fixes, and the Joy of Clones

Come Monday morning I had 2 things in my favour; a very understanding client, and a Cloud environment. Though there was to be no doubt that it was my job to get this sorted out before the coming weekend. No pressure.

First course of action was another talk with the awesome Infra guys who happily stood up a clone of production, isolated from the core infrastructure such as Active Directory. Now I had an exact copy of production to start investigating and trialing loads, meaning I could start a load and tweak my Extended Events trace to capture exactly the data I needed.

Now the codebase for this load is legacy to the max. Name a SQL anti-pattern or bad practice and you'll probably find it implemented somewhere in this code. Stored procs running cursors that call stored procs in other databases that run more cursors, tempdb used for permanent tables, procs that rely on session temp tables not created in the proc, and on, and on. Oh, and did I mention the main developer who had taken over maintaining this had left? And any changes he'd made had just been tacked onto the existing code as he was, understandably, too scared to change much of the original code. Such fun.

Analysing event traces can be tricky and tedious. I like to grab start and end statements which tends to result in quite large traces, but sorting by things like duration, logical reads and cpu lets you narrow down where the problem area is. I know you can query the trace, but I've never had much success with this approach, or maybe I just prefer the visual elements.

In this case, I noticed that one stored procedure had a large variation in duration times from call to call. This particular stored procedure was called from within a cursor in a separate stored procedure, and was located in a different database to the one where the load was happening. This is where adding in all the extra events paid off, and I was able to narrow it down to the exact statement that had the most variation. 

Rewriting the code wasn't an option - it would've taken longer than a week just to work out the logic that was intended, so instead I added an OPTION(RECOMPILE) clause to the guilty statement. This looked promising, so a restore, apply the fix and start another load to check, and thankfully the full load finished in only 4 hours.

Production load process kicked off on the new go-live weekend, and this time the load finished in just under 4 hours, the fastest time in living memory.

Learnings

So, what were the learnings?

  1. If doing benchmarking, makes sure the test and production systems are identical. For this client we now use cloned environments from production so we can be absolutely sure that everything is the same
  2. There is a big difference between Developer (Enterprise) and Standard editions of SQL Server internally - more so than just the added features you tend to associate with Enterprise. This is another good reason to not use Developer Edition in test, and to be careful when using in dev work
  3. Parameter sniffing is real, and it can really hurt you. In this case fixing this was the biggest reason for the process dropping from over 48 hours, down to 4
  4. Extended events can be a life saver

A huge benefit during this was having supportive people in other areas of the business, all looking to get the same outcome. Without doubt, if I hadn't had the support of the Cloud Infrastructure team to make instance changes and stand up clone environments then this would've been a lot more difficult. While some of this was due to the people themselves, it's also a good reminder as a DBA to not be a difficult ass. Being able to properly look after a SQL Server environment while also maintaining good relations with the Infrastructure, Security, Storage, and Architecture teams, as well as the other parts of the business, takes a fair amount of work, and the occasional tongue biting, but pays back many times over when times get tough. And at some point, times will get tough.


Saturday, October 15, 2022

Group Managed Service Accounts

Choosing the service accounts to run the SQL Server services under is often something to consider when doing install. Of late I've taken to using domain accounts called Group Managed Service Accounts. These combine the security advantages of the local virtual accounts, with the resource and management advantages of regular Active Directory accounts, in that they:
  •     have a system managed password
  •     undergo regular password changes with no downtime
  •     can be granted access to network resources, e.g file shares, other SQL Servers etc.
  •     can be added to domain level group policy objects

There are some disadvantages, though these are only really encountered in the setup phase, with some particular requirements and extra steps needed to get them up and running. 

Through the rest of this post I'll outline how to create these accounts, and the specific requirements and actions needed to get them up and running, including a few gotchas that have caught me out in the early stages of using these.

Step 1 - Creating the AD account

First step is to create the account in Active Directory, which would usually be done by one of the AD Admins, but there are some specific requirements for the setup which they'll need to consider.

The basic steps are:

  1.     Create a group that will have permissions to use the service account
  2.     Create a new account as a group managed service account with:
    1. the name of the AD group containing the names of computers that are allowed to use the service account
    2. the accepted Kerberos encryption types. This is very important for environments where the default allowed ciphers have been changed, usually removing the RC4 cipher (which is a good idea)
  3.     Grant permission for the service account to self-register SPNs

This can all be done in one PowerShell command for the account creation, and a second one to grant SPN registration. To create the account, use the format:

New-ADServiceAccount -Name <service account name> -DNSHostName <fully qualified service account name> -KerberosEncryptionType AES128, AES256 -PrincipalsAllowedToRetrieveManagedPassword <AD group name>
 
Note that if the KerberosEncrytionTypes don't match what's configured in your environment then the account will still create properly but will cause issues when trying to configure on the server. If you know that your environment is using all cipher types then this parameter can be left out, but the majority of environments I've worked in lately are disabling the RC4 cipher.
 
Service Principal Name (SPN) registration of the service and account is vital if you want to have kerberos authentication ... and you do want kerberos working. SPN registration can be done when the account is created, but it's better to give the account permission to self-register SPNs, particularly if the account will be used across multiple SQL Servers. SQL Server will then register the correct SPNs on service start. Note that this is only needed for the service account that runs the SQL Server service, not for the agent or other services.

The command to allow the account to self-register SPNs is (use the correct CN name for the service):

dsacls "CN=<service account cn>" /G SELF:RPWP; "servicePrincipalName"

E.g.

dsacls "CN=SQLgMSAtestSQLServerService,CN=Managed Service Accounts,DC=example,DC=com" /G SELF:RPWP; "servicePrincipalName"

For more information on the New-ADServiceAccount cmdlet, check out https://learn.microsoft.com/en-us/powershell/module/activedirectory/new-adserviceaccount
 

Step 2 - Configuring the Server

Once the account is created there's a bit of configuration required on the SQL Server.

Prerequisites

Prior to starting configuration we need to have the PowerShell Active Directory cmdlets installed. This can be done through Server Manager, or from an elevated PowerShell session (right click, Run as Administrator) with the command

Install-WindowsFeature RSAT-AD-PowerShell

We also need the outbound firewall/security group port open to the Active Directory Web Services from the SQL Server. This is 9389 unless specifically changed.

And lastly, the server needs to be a member of the group that has permission to retrieve the managed password. Before the group membership can take effect either the server should be rebooted, or the system Kerberos tickets cleared using the command
klist -li 0x3e7 purge

Installing the Service Accounts

Next, we need to install the service accounts on the server. To do this we'll need to log onto the server as an administrator, then run the following PowerShell command in an elevated PowerShell session: 

Install-ADServiceAccount <accountname> 

We can test the service account install by running the following

Test-ADServiceAccount <accountname>

Configure the Services

Once the service account is installed the SQL Server service can be configured to use this as the log on account. This is done by adding the account with a $ sign on the end and a blank password in the service properties in SQL Server Configuration Manager. Note always use the SQL Server Configuration Manager to do this rather than the standard Windows Services msc, as the SQL Server manager will do some extra bits in the background that the standard Windows one doesn't.

 



 Click Ok and we're done. The service will restart if already running and we'll see the password fields automatically filled in.

Extra Configuration

Using the SQL Server Configuration Manager to do the services change means that the service accounts will automatically get added into the correct groups and SQL Server roles, but if there's any custom permissions that had been granted to the previous service accounts, such as file system permissions, then these will need to be updated as well. 

As a side note, granting Windows file system permissions is better done by granting access to the local NT Service\SQLSERVERAGENT and NT Service\MSSQLSERVER virtual accounts as the new service accounts automatically get included in these so future changes mean not having to worry about these permissions

 Troubleshooting

  • Error trying to contact the server when trying to install the service account. This is likely due to the AD Web Service port being blocked. Try running a simple command to get your AD user details
    Get-ADUser <your username>

  • Test-ADServiceAccount cmdlet returns false, and/or trying to start the service with the gmsa account gives a password error. This is because the computer isn’t part of the group that has access to retrieve the password for the specified gmsa, or the computer hasn’t picked up the membership.

Azure Data Factory Metadata-Driven Pipelines - 1

 I've had a bit of free time work wise lately so figured I'd finally get onto revamping one of our Azure Data Factory (ADF) extract ...