Saturday, October 22, 2022

Problems with Benchmarking Between Test and Production

 I struck an issue with a client a little while back that resulted in a lot of learnings for testing and benchmarking.

This client had a major data processing run to undertake - the sort of run which involved an army of people and took the system down for an entire weekend. To prepare for this we did several runs in a test system to check the data, the process and the timing. So far so good. All went well in testing, the date was set, and on the Friday night we completed our pre-flight tasks, fired up the data load, and headed off to bed smug in the knowledge that we had properly planned and prepared. 

Up bright and early on Saturday morning, logged in to check progress and I was hit with a sinking feeling in my stomach. The task I had expected to have finished was nowhere near complete. Thankfully this part of the load could be safely restarted and would continue from where it left off, so a restart, quick check that things looked ok and then time for a coffee and de-stress.

A few checks later it was obvious the task was slowing down again, so after a nervous communication to stakeholders I started the investigation into why is this going so slow, and why is it so different to our testing?

Hardware

For our testing we'd made sure to use up to date backups from production, so the first thought was hardware. The client runs an AWS environment with SQL Server running on EC2 instances. They use r type instances as the databases are generally pretty small and running as much in memory as possible usually gives good results, though in this case we were seeing quite significant disk IO slow down on production.

Here was where we found the first difference. At first glance the instances looked the same, but on a closer look production was an r5a, vs r5 in test. The difference? Again, on first glance these just appeared to be a different processor (AMD vs Intel) but on closer inspection there was also a significant difference in EBS (disk) throughput - 4750 IOPS for the r5, 2880 for the r5a. These servers had been built before I started with the client, so I didn't know why the particular types had been chosen of if issues had been encountered before. A quick chat with the (very helpful) Infrastructure team and the instance type was changed, then the load started again.

This time I was watching the load closely and sure enough, after a couple of hours I could see tempdb was getting hammered and disk latency was significantly high. I'd struck this before in AWS, and the answer was to put tempdb on local storage. This means a different instance type again, r5d. So we changed the instance, moved tempdb to the new local drive and we were underway again.

By this time it was getting late, so a quick check that things were humming along then off to bed.

Code? Versions? What Else?

By the next morning it was obvious that things weren't going great, and it looked like we were going to need to pull the plug and reschedule the outage once we had identified and fixed the issue. In the remaining time I decided to run an Extended Events session to capture what was happening, and dig around a bit more on the servers.

Two things showed up. First was it looked like we had a severe case of parameter sniffing, and secondly, we were running different editions of SQL Server. Test was running Developer Edition, which is essentially a free version of Enterprise Edition for development. Production was Standard Edition. (I don't want to get into the rights and wrongs of running Developer Edition on Test instances, but let's just say that new test builds are running Standard edition the same as production).

Now I wasn't going to change production to Developer or Enterprise Edition just for this one load, and investigating the potential parameter sniffing was going to take a while. So we cancelled the load, rolled back the databases and prepared for the hard questions sure to be coming on Monday morning.

Fixes, and the Joy of Clones

Come Monday morning I had 2 things in my favour; a very understanding client, and a Cloud environment. Though there was to be no doubt that it was my job to get this sorted out before the coming weekend. No pressure.

First course of action was another talk with the awesome Infra guys who happily stood up a clone of production, isolated from the core infrastructure such as Active Directory. Now I had an exact copy of production to start investigating and trialing loads, meaning I could start a load and tweak my Extended Events trace to capture exactly the data I needed.

Now the codebase for this load is legacy to the max. Name a SQL anti-pattern or bad practice and you'll probably find it implemented somewhere in this code. Stored procs running cursors that call stored procs in other databases that run more cursors, tempdb used for permanent tables, procs that rely on session temp tables not created in the proc, and on, and on. Oh, and did I mention the main developer who had taken over maintaining this had left? And any changes he'd made had just been tacked onto the existing code as he was, understandably, too scared to change much of the original code. Such fun.

Analysing event traces can be tricky and tedious. I like to grab start and end statements which tends to result in quite large traces, but sorting by things like duration, logical reads and cpu lets you narrow down where the problem area is. I know you can query the trace, but I've never had much success with this approach, or maybe I just prefer the visual elements.

In this case, I noticed that one stored procedure had a large variation in duration times from call to call. This particular stored procedure was called from within a cursor in a separate stored procedure, and was located in a different database to the one where the load was happening. This is where adding in all the extra events paid off, and I was able to narrow it down to the exact statement that had the most variation. 

Rewriting the code wasn't an option - it would've taken longer than a week just to work out the logic that was intended, so instead I added an OPTION(RECOMPILE) clause to the guilty statement. This looked promising, so a restore, apply the fix and start another load to check, and thankfully the full load finished in only 4 hours.

Production load process kicked off on the new go-live weekend, and this time the load finished in just under 4 hours, the fastest time in living memory.

Learnings

So, what were the learnings?

  1. If doing benchmarking, makes sure the test and production systems are identical. For this client we now use cloned environments from production so we can be absolutely sure that everything is the same
  2. There is a big difference between Developer (Enterprise) and Standard editions of SQL Server internally - more so than just the added features you tend to associate with Enterprise. This is another good reason to not use Developer Edition in test, and to be careful when using in dev work
  3. Parameter sniffing is real, and it can really hurt you. In this case fixing this was the biggest reason for the process dropping from over 48 hours, down to 4
  4. Extended events can be a life saver

A huge benefit during this was having supportive people in other areas of the business, all looking to get the same outcome. Without doubt, if I hadn't had the support of the Cloud Infrastructure team to make instance changes and stand up clone environments then this would've been a lot more difficult. While some of this was due to the people themselves, it's also a good reminder as a DBA to not be a difficult ass. Being able to properly look after a SQL Server environment while also maintaining good relations with the Infrastructure, Security, Storage, and Architecture teams, as well as the other parts of the business, takes a fair amount of work, and the occasional tongue biting, but pays back many times over when times get tough. And at some point, times will get tough.


No comments:

Post a Comment

Azure Data Factory Metadata-Driven Pipelines - 1

 I've had a bit of free time work wise lately so figured I'd finally get onto revamping one of our Azure Data Factory (ADF) extract ...