21 Apr 2024

Replace a SQL WHILE Loop with Set Based Logic in T-SQL

 

Problem

Ask most T-SQL pros about their thoughts on WHILE loops or cursors, and you'll get a rant about the pitfalls and advice to avoid them. This advice is great, but hard to follow in practice when row-by-row operations plague your environment. When a DEV team creates an app, a WHILE loop might be their go-to solution if they come from a C# or C++ background. It's like the old saying that everything looks like a nail when your only tool is a hammer. WHILE loops work until they don't. Performance is painful when you start working with larger datasets, and the iterations keep growing. What can you do when your WHILE loop takes longer to process? Try your best to avoid them.


Solution

This article explores a few techniques to replace a pesky WHILE loop that was fast when it was new but slowed over time. I pieced most of these steps together from other articles on MSSQLTips.com. I'm not only a contributor but also a reader. While creating this solution, I went down a few dead ends only to learn what wouldn't work. In the end, some advice from a fellow author saved the day. After finishing this article, I hope you can use it to help solve a similar problem.

WHILE Loops

What exactly is a WHILE loop used for? In a WHILE loop, you set a condition and repeat an action until satisfying that condition. For example, you might declare a variable, set the value to a specific count based on the number of rows, and then perform a statement until all rows are processed. I've included a simple example below.

DECLARE @count INT = 10;

WHILE (@count > 0)
BEGIN

    PRINT @count;

    SET @count = @count - 1;

END;

Results:

10
9
8
7
6
5
4
3
2
1

Developers often use WHILE loops when they insert data and need to retrieve new identity values of a table to perform another operation. Sometimes, you can work around this with the OUTPUT clause. In this article, Daniel Hutmacher outlines using the OUTPUT clause of a MERGE as a solution. I've suggested this method dozens of times. Another area where WHILE loops pop up is manipulating strings.

Secret Codes

For a moment, imagine two tables. The first one is called SecretCodes (pictured below), and this table only contains two columns, an integer, and a huge string (VARCHAR(MAX)).

SecretCodes table

Inside the string column lives a sequence of numbers and codes separated by a single space. I've included an illustration below. The length of the code column ranges from 2,000 to 100,000 characters, hence the need for a VARCHAR(MAX) data type.

List of Secret Codes

The image above highlights three strings that begin with #A, padded zeros, and finally, a number. This string is always nine characters long. The pattern varies for codes inserted between every integer, but it works for our example.

This code is where our second table comes into play. Our CodeReference table contains two columns: one is a list of codes (#A0000), and the other is the corresponding integer values. In the screenshot below, code #A0008198 (minus the # sign) has a value of 55662.

CodeReference Table

An important point to remember is that the CodeValue column changes every few days. Also, the business might decide to add new codes or delete existing ones.

In this example, we use the WHILE loop to update the SecretCode table and replace all the codes with their corresponding value from the CodeReference table. Let's put together a dataset and look at this in action.

Building Our Dataset

As mentioned previously, our end goal is to create two tables. The first holds the encrypted codes, and the other contains the actual referenced values. To reach our final destination, I'll first create a staging table.

USE [master];
GO

IF DATABASEPROPERTYEX('SecretCodeDemo', 'Version') IS NOT NULL
BEGIN
    ALTER DATABASE SecretCodeDemo SET SINGLE_USER WITH ROLLBACK IMMEDIATE;
    DROP DATABASE SecretCodeDemo;
END;
GO

CREATE DATABASE SecretCodeDemo;
GO

ALTER DATABASE SecretCodeDemo SET RECOVERY SIMPLE;
GO

USE SecretCodeDemo;
GO

CREATE TABLE dbo.StageSecretCodes
(
    Id INT NOT NULL,
    RowNumber INT NOT NULL IDENTITY(1, 1),
    Code NVARCHAR(MAX) NOT NULL,
    CodeRef NVARCHAR(255) NULL,
    OriginalCode INT NULL,
);

-- I want the numbers to be distinct.
WITH BaseData
AS (SELECT TOP 25000
           ABS(CHECKSUM(NEWID()) % 60000) + 1 AS Code
    FROM sys.all_columns AS n1
        CROSS JOIN sys.all_columns AS n2),
     BaseDataDistinct
AS (SELECT DISTINCT TOP 20000
           Code
    FROM BaseData)
INSERT INTO dbo.StageSecretCodes
(
    Id,
    Code,
    OriginalCode
)
SELECT 1,
       Code,
       Code
FROM BaseDataDistinct;

UPDATE dbo.StageSecretCodes
SET CodeRef = CONCAT('#A', FORMAT(RowNumber, '0000000')),
    Code = CONCAT('#A', FORMAT(RowNumber, '0000000'))
WHERE RowNumber % 2 = 0;
GO

Populating Our Tables

Since we created and populated the staging table with values, let's create the two main tables. We'll focus on them for the rest of the article.

CREATE TABLE dbo.CodeReference
(
    Id INT NOT NULL,
    CodeReference VARCHAR(250) NOT NULL,
    CodeValue INT NOT NULL
);

INSERT INTO dbo.CodeReference
(
    Id,
    CodeReference,
    CodeValue
)
SELECT Id,
       REPLACE(CodeRef, '#', '') AS CodeReference,
       OriginalCode
FROM dbo.StageSecretCodes
WHERE CodeRef IS NOT NULL;

CREATE TABLE dbo.SecretCodes
(
    Id INT NOT NULL,
    Code NVARCHAR(MAX) NOT NULL
);

CREATE NONCLUSTERED INDEX IX_CodeReference_CodeValue
ON dbo.CodeReference (
CodeReference,
CodeValue);

INSERT INTO dbo.SecretCodes
(
    Id,
    Code
)
SELECT Id,
       STRING_AGG(CAST(Code AS VARCHAR(MAX)), ' ')
FROM dbo.StageSecretCodes
GROUP BY Id;
GO

CHECKPOINT;

SELECT *
FROM dbo.SecretCodes;
SELECT *
FROM dbo.CodeReference;
GO

The last two statements should return similar result sets, as seen in the screenshot below. Since we are generating random numbers, yours will be different.

Query Results

Our CodeReference table contained less than 50 records in the past, yet now it holds over 10,000.

Creating a Backup

Before we proceed, let's execute the code below to create a copy_only backup. Update the file path to something that exists on your system, or you will get an error.

BACKUP DATABASE [SecretCodeDemo]
TO  DISK = N'C:\code\MSSQLTips\SQLFiles\SecretCodeDemo.bak'
WITH COPY_ONLY,
     NOFORMAT,
     INIT,
     NAME = N'SecretCodeDemo-Full Database Backup',
     COMPRESSION,
     STATS = 10;
GO

The WHILE Loop Version

I've included the code for the WHILE loop below. Running the process takes about two minutes on my system. With time and effort, you might optimize the code and improve performance. At one point, the CodeReference table and the number of codes were tiny, so everything seemed fine.

DECLARE @CodeLocator INT;
DECLARE @Code VARCHAR(25);

SELECT @CodeLocator = CHARINDEX('#', Code)
FROM dbo.SecretCodes;

SET NOCOUNT ON;

WHILE (@CodeLocator > 0)
BEGIN

    SELECT @Code = SUBSTRING(Code, @CodeLocator, 9)
    FROM dbo.SecretCodes;

    UPDATE c
    SET c.Code = REPLACE(c.Code, @Code, r.CodeValue)
    FROM dbo.SecretCodes c
        INNER JOIN dbo.CodeReference r
            ON r.CodeReference = REPLACE(@Code, '#', '');

    SELECT @CodeLocator = CHARINDEX('#', Code)
    FROM dbo.SecretCodes;

END;
GO

SQL performs 10,000 iterations for the code above before calling it quits for only one row in the SecretCodes table. What if there were dozens of rows? Let's find a better way.

Restore the Backup

Before we move on, let's run the code below to restore our database to work with the same dataset.

USE [master];
ALTER DATABASE [SecretCodeDemo]
SET SINGLE_USER
WITH ROLLBACK IMMEDIATE;
RESTORE DATABASE [SecretCodeDemo]
FROM DISK = N'C:\code\MSSQLTips\SQLFiles\SecretCodeDemo.bak'
WITH FILE = 1,
     NOUNLOAD,
     REPLACE,
     STATS = 10;
ALTER DATABASE [SecretCodeDemo] SET MULTI_USER;
USE [SecretCodeDemo];
GO

Splitting the String

The first step to move beyond a WHILE loop is splitting the single code string into separate rows. When each value and code is an individual row, we'll perform a set-based update statement and move past the iterative process. Now, onto a function with a lot of potential but fails in the end.

STRING_SPLIT

SQL Server 2016 introduced the STRING_SPLIT table-valued function, finally allowing a built-in method for splitting a string into rows. In the example below, the delimiter is a space.

SELECT value FROM STRING_SPLIT('Luke I am your father', ' ');

Results:

value
---------------------
Luke
I
am
your
father

The STRING_SPLIT() function was the first method I tried. However, after testing and reading Aaron Bertrand's posts, I noticed a huge problem. Unless you're on SQL Server 2022 or an Azure flavor, there's no way to assign an order to the individual rows. I'm on SQL Server 2019. We need to put the strings back together in the same order they were initially. If they could be random, then STRING_SPLIT() might work. Let's not give up hope yet.

OPENJSON

The next table-valued function I tested was OPENJSON, which converts a JSON input into a relational format. Microsoft also introduced this function in SQL Server 2016. I borrowed the idea of splitting strings using OPENJSON from another article by Aaron.

SELECT seq = [Key],
       Value
FROM OPENJSON(N'["' + REPLACE('Luke I am your father', ' ', N'","') + N'"]') AS x;

Results:

seq       value
------    ------
0         Luke
1         I
2         am
3         your
4         father

There you go, OPENJSON adds the order reference. Also, we need to wrap the OPENJSON inside another table-valued function for our purposes. The primary change I made to Aaron's original script was specifying a VARCHAR(MAX) for one of the input parameters. Run the code below to create the function before moving on.

CREATE OR ALTER FUNCTION dbo.SplitOrdered_JSON
(
    @List NVARCHAR(MAX),
    @Delimiter NCHAR(1)
)
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN
     (SELECT seq = [Key],
             Value
      FROM OPENJSON(N'["' + REPLACE(@List, @Delimiter, N'","') + N'"]') AS x);
GO

Putting The String Back Together

Below is all the code I used to pull the string apart, update it, and reassemble it. The CTE names represent the actions taking place.

;WITH Split_Data
AS (SELECT s.Id,
           CAST(sp.seq AS INT) AS seq,
           sp.value
    FROM dbo.SecretCodes s
        CROSS APPLY dbo.SplitOrdered_JSON(s.Code, N' ') sp ),
      Replace_Codes
AS (SELECT sd.Id,
           CASE
               WHEN c.CodeReference IS NULL THEN
                   sd.value
               ELSE
                   c.CodeValue
           END AS Code,
           sd.seq
    FROM Split_Data sd
        LEFT JOIN dbo.CodeReference c
            ON c.CodeReference = REPLACE(sd.value, '#', '')),
      Reassemble_Data
AS (SELECT rc.Id,
           STRING_AGG(CAST(rc.Code AS VARCHAR(MAX)), ' ')WITHIN GROUP(ORDER BY rc.seq) AS Code
    FROM Replace_Codes rc
    GROUP BY rc.Id)
UPDATE dbo.SecretCodes
SET dbo.SecretCodes.Code = cc.Code
FROM Reassemble_Data cc
    INNER JOIN dbo.SecretCodes sc
        ON sc.Id = cc.Id;
GO

SELECT *
FROM dbo.SecretCodes;
GO

Let's look at each of the CTEs individually before the final update.

Split_Data

The first CTE uses the table-valued function dbo.SplitOrdered_JSON, which we created above, to split our string code into rows. Also, I'm casting the seq column as an integer because it's a character by default.

SELECT s.Id,
           CAST(sp.seq AS INT) AS seq,
           sp.value
    FROM #SecretCodes s
        CROSS APPLY dbo.SplitOrdered_JSON(s.Code, N' ') sp )

Replace_Codes

The next one replaces the code (#A000) with its value from the CodeReference table. Note: The following two code blocks will fail if executed in isolation.

SELECT sd.Id,
           CASE
               WHEN c.CodeReference IS NULL THEN
                   sd.value
               ELSE
                   c.CodeValue
           END AS Code,
           sd.seq
    FROM Split_Data sd
        LEFT JOIN #CodeReference c
            ON c.CodeReference = REPLACE(sd.value, '#', '')

Reassemble_Data

We are getting close to the finish line, and it's time to reassemble our string with the STRING_AGG function. This function concatenates the values of a string expression. Remember to add the WITHIN GROUP clause to order it.

SELECT rc.Id,
           STRING_AGG(CAST(rc.Code AS VARCHAR(MAX)), ' ')WITHIN GROUP(ORDER BY rc.seq) AS Code
    FROM ReplaceCodes rc
    GROUP BY rc.Id

Once our string is in the proper format, we perform the UPDATE. The screenshot below illustrates the Code column after the statement completes.

Secret Codes Cracked

Running the code above takes less than a second on my system. I marked this one as a win in my book.

Wrapping Up

The catalyst for me to write this article was the pain this problem caused, and in the end, the solution was easier than I thought. I'm glad experts like Aaron Bertrand and Jeff Moden took the time to test and write about these topics. If you can think of other ways to optimize the code above, please post it in the comments.

Key Points

  • WHILE loops work until they don't. If you're confident your data set will stay small or is a one-time thing, then using them isn't much of a concern. I err on the side of tables growing unless they're fixed in depth and hold data like statuses or countries.
  • As Aaron Bertrand wrote, it's a shame Microsoft doesn't expand the STRING_SPLIT ordering to versions of SQL Server pre-2022; at least 2019 or 2017 would make many people happy.
  • The next time you review a pull request and see a WHILE loop, let the developer know your concerns. Often, code gets into the repo, and then a year later, someone else stumbles onto the script and reuses it, but now the data has grown.

1 comment:

SQL Server Copilot AI

 Here are some of the latest tips and advanced concepts for Microsoft SQL Server (MS SQL) in 2024: 1. SQL Server Copilot AI Microsoft has in...