26 Aug 2024

Many Uses of SQL CASE Expression in a JOIN Clause

 

Problem

Have you ever heard: "Never place a CASE expression in a JOIN clause?" I know I have. It's one of those best practices handed down by the SQL elders of old. I'm all for following guidelines, but sometimes I like to know why I'm doing something. Not just because someone wrote a blog article 20 years ago and it's the first one that shows up in search results. Stay tuned if you're like me, and don't follow advice blindly.


Solution

In this tutorial, we'll look at the many uses of a CASE expression. Additionally, why would you use a CASE expression in a JOIN clause? Next, I'll explore a common alternative that may or may not provide better performance. Finally, we'll answer the questions like "Should you never use a CASE expression in a JOIN clause?" My answer might surprise you.

Exploring SQL CASE

You're missing out if you've never used a CASE expression. It's the next command to learn after SELECT *. You might describe it as the MacGyver of T-SQL. Microsoft defines CASE as an expression that evaluates a list of conditions and returns one of the multiple possible result expressions. Said another way: Tell SQL that if this condition is true, then return this other thing.

SELECT CASE
           WHEN 1 = 1 THEN
               1
           WHEN 2 = 2 THEN
               2
           WHEN 3 = 3 THEN
               3
           ELSE
               0
       END AS [My Favorite Number];

Before moving on, try and answer the questions in the example above. What results will SQL Server return and why?

If you feel extra motivation, you can nest CASE expressions for added control. I'm not a fan of doing this if it makes code harder to follow. I would rather repeat certain parts. I know there is an adage of DRY ("Don't Repeat Yourself."), which, for the most part, I agree with.

Below is an example of a nested CASE expression. Doesn't that look lovely? If I put more effort in, I could clean it up. However, I use Redgate SQL Prompt to format all my code, including the example below.

SELECT CASE
           WHEN 1 = 1 THEN
               CASE
                   WHEN 2 = 2 THEN
                       CASE
                           WHEN 3 = 3 THEN
                               3
                           ELSE
                               0
                       END
                   ELSE
                       0
               END
           ELSE
               0
       END AS [My Second Favorite Number];

There are a few other places where you see CASE expressions; the first is in a WHERE clause:

SELECT Column3
FROM dbo.SpecialTable
WHERE CASE
          WHEN Column3 = 'Hello' THEN
              1
          WHEN Column3 = 'Goodbye' THEN
              2
          ELSE
              0
      END > 0;

And finally, in the example below, I've added it to the JOIN clause:

SELECT t1.Column1,
       t1.Column2,
       t2.Column3
FROM Table1 t1
    INNER JOIN SpecialTable t2
        ON CASE
               WHEN t2.Column3 = 'Hello' THEN
                   1
               WHEN t2.Column3 = 'Goodbye' THEN
                   2
               ELSE
                   0
           END = t1.JoinColumn;

As I mentioned above, people might frown on the second more than the first. I've seen it written on several "you should never do" lists.

Build Our Dataset

Let's create a dataset where using a CASE expression in a JOIN makes sense. I'll add two tables: Person and Sales.

CREATE TABLE dbo.Person
(
    Id INT NOT NULL,
    FirstName NVARCHAR(200) NOT NULL,
    LastName NVARCHAR(200) NOT NULL,
     IsActive BIT NOT NULL
    CONSTRAINT PK_Person_Id
        PRIMARY KEY CLUSTERED (Id)
);
CREATE TABLE dbo.Sales
(
    Id INT IDENTITY(1, 1) NOT NULL,
    SalesDate DATE NOT NULL,
    OrderCode NVARCHAR(3) NOT NULL,
    Quantity INT NOT NULL,
    Price DECIMAL(6, 0) NOT NULL,
    SalesPersonId INT NOT NULL,
    AccountManagerId INT NULL,
    CONSTRAINT PK_Sales_Id
        PRIMARY KEY CLUSTERED (Id),
    CONSTRAINT FK_Sales_SalesPerson
        FOREIGN KEY (SalesPersonId)
        REFERENCES dbo.Person (Id),
    CONSTRAINT FK_Sales_AccountManager
        FOREIGN KEY (AccountManagerId)
        REFERENCES dbo.Person (Id)
);
GO

Now, let's add some data to each of them. For the Person table, we'll insert 20 rows. For the Sales table, we'll insert 100,000 rows. I want at least one table to be on the larger side.

INSERT INTO dbo.Person
(
    Id,
    FirstName,
    LastName,
    IsActive
)
VALUES
(1, 'John', 'Smith', 1),
(2, 'Sarah', 'Johnson', 1),
(3, 'Michael', 'Williams', 1),
(4, 'Emily', 'Brown', 1),
(5, 'David', 'Jones', 1),
(6, 'Emma', 'Garcia', 1),
(7, 'Daniel', 'Martinez', 1),
(8, 'Olivia', 'Davis', 1),
(9, 'Matthew', 'Rodriguez', 1),
(10, 'Ava', 'Miller', 1),
(11, 'Christopher', 'Gonzalez', 1),
(12, 'Sophia', 'Wilson', 1),
(13, 'Andrew', 'Anderson', 1),
(14, 'Isabella', 'Thomas', 1),
(15, 'Joshua', 'Jackson', 1),
(16, 'Mia', 'White', 1),
(17, 'William', 'Harris', 1),
(18, 'Charlotte', 'Clark', 1),
(19, 'Ethan', 'Lewis', 1),
(20, 'Amelia', 'Lee', 1);



WITH SalesData AS (
    SELECT TOP 100000
        DATEADD(DAY, RAND(CHECKSUM(NEWID())) * (1 + DATEDIFF(DAY, '2023-04-01', '04-30-2023')), '2023-04-01') AS SalesDate,
        CASE
            WHEN (ROW_NUMBER() OVER (ORDER BY (SELECT NULL))) % 3 = 0 THEN 'ABC'
            WHEN (ROW_NUMBER() OVER (ORDER BY (SELECT NULL))) % 3 = 1 THEN 'DEF'
            ELSE 'GHI'
        END AS OrderCode,
        ABS(CHECKSUM(NEWID()) % 100) + 1 AS Quantity,
        ABS(CHECKSUM(NEWID()) % 1000) + 100 AS Price,
        ABS(CHECKSUM(NEWID()) % 10) + 1 AS SalesPersonId,
        CASE
            WHEN (ROW_NUMBER() OVER (ORDER BY (SELECT NULL))) % 100 = 0 THEN NULL
            ELSE ABS(CHECKSUM(NEWID()) % 10) + 11
        END AS AccountManagerId
      FROM sys.all_columns cl1
    CROSS JOIN sys.all_columns cl2
)
-- Insert the data into the dbo.Sales table
INSERT INTO dbo.Sales
(
    SalesDate,
    OrderCode,
    Quantity,
    Price,
    SalesPersonId,
    AccountManagerId
)
SELECT SalesDate,
       OrderCode,
       Quantity,
       Price,
       SalesPersonId,
       AccountManagerId
FROM SalesData;

We always populate the SalesPersonId in the Sales table. The AccountManagerId column is sometimes populated. You can see that the SalesPersonId and AccountManagerId are foreign keys back to the Person table.

Why Use a CASE Expression in a JOIN?

Why would you use a CASE in a JOIN clause in the first place? One reason is if you have some complex join logic you're trying to implement, essentially joining a table based on multiple conditions. For example, we want to join our Person and Sales tables primarily on the SalesPersonId. However, when the OrderCode has a specific value, we want to join on the AccountManagerId.

SELECT CONCAT(   p.LastName,
                 ', ',
                 p.FirstName
             ) AS PersonName,
       SUM(s.Price) AS TotalSales,
       s.SalesDate
FROM dbo.Sales s
    INNER JOIN dbo.Person p
        ON p.Id = CASE
                      WHEN s.OrderCode = 'ABC'
                           AND s.AccountManagerId IS NOT NULL THEN
                          s.AccountManagerId
                      ELSE
                          s.SalesPersonId
                  END
GROUP BY s.SalesDate,
         p.FirstName,
         p.LastName;

Another reason for using CASE is to handle NULLs returned from one of the tables. Of course, this is when an ISNULL() doesn't provide the results you're looking for.

SELECT CONCAT(   p.LastName,
                 ', ',
                 p.FirstName
             ) AS PersonName,
       SUM(s.Price) AS TotalSales,
       s.SalesDate
FROM dbo.Sales s
    INNER JOIN dbo.Person p
        ON p.Id = CASE
                      WHEN s.AccountManagerId IS NULL THEN
                          1
                      WHEN s.AccountManagerId IS NULL
                           AND s.SalesDate > '2023-04-05' THEN
                          2
                      ELSE
                          s.SalesPersonId
                  END
GROUP BY s.SalesDate,
         p.FirstName,
         p.LastName;

I'm sure there are other examples, but these two seem to come up the most.

Alternatives to CASE

We have yet to look at the performance of CASE versus another method. However, how can we accomplish the same results differently by not using a CASE expression? The one I see recommended the most is using a UNION ALL operator. That's the one we'll compare against.

Below is how you would write a statement to avoid using the CASE. The UNION ALL creates two datasets that match the join criteria and then combines them.

SELECT CONCAT(   p.LastName,
                 ', ',
                 p.FirstName
             ) AS PersonName,
       SUM(s1.Price) AS TotalSales,
       s1.SalesDate
FROM dbo.Sales s1
    INNER JOIN dbo.Person p
        ON p.Id = s1.SalesPersonId
WHERE s1.OrderCode <> 'ABC'
      OR s1.AccountManagerId IS NULL
GROUP BY s1.SalesDate,
         p.FirstName,
         p.LastName
UNION ALL
SELECT CONCAT(   p.LastName,
                 ', ',
                 p.FirstName
             ) AS PersonName,
       SUM(s2.Price) AS TotalSales,
       s2.SalesDate
FROM dbo.Sales s2
    INNER JOIN dbo.Person p
        ON p.Id = s2.AccountManagerId
WHERE s2.OrderCode = 'ABC'
      AND s2.AccountManagerId IS NOT NULL
GROUP BY s2.SalesDate,
         p.FirstName,
         p.LastName;

One thing to remember is that the above method goes against the DRY philosophy. Not that I advocate following the philosophy to a T.

Comparing Performance

Now it's time to compare both methods and see if one performs better. Before we do that, I'm going to create three indexes. The first is mainly for the statement with the CASE expression, and the other two are for the UNION ALL.

CREATE NONCLUSTERED INDEX IX_SalesPerson_AccountManager_OrderCode
ON dbo.Sales (
                 SalesPersonId,
                 AccountManagerId,
                 OrderCode
             )
INCLUDE (
            Price,
            SalesDate
        );


CREATE NONCLUSTERED INDEX IX_AccountManager_OrderCode
ON dbo.Sales (
                 AccountManagerId,
                 OrderCode
             )
INCLUDE (
            Price,
            SalesDate
        );

CREATE NONCLUSTERED INDEX IX_SalesPerson_OrderCode
ON dbo.Sales (
                 SalesPersonId,
                 OrderCode
             )
INCLUDE (
            Price,
            SalesDate
        );
GO

First, turn STATISTICS IO on and enable the actual execution plan. Then execute both statements in the same batch and see what the performance markers say.

SET STATISTICS IO ON;

SELECT        CONCAT(   p.LastName,
                 ', ',
                 p.FirstName
             ) AS PersonName,
       SUM(s.Price) AS TotalSales,
       s.SalesDate
FROM dbo.Sales s
    INNER JOIN dbo.Person p
        ON p.Id = CASE
                      WHEN s.OrderCode = 'ABC'
                           AND s.AccountManagerId IS NOT NULL THEN
                          s.AccountManagerId
                      ELSE
                          s.SalesPersonId
                  END
GROUP BY s.SalesDate,
         p.FirstName,
           p.LastName;

/* Second method using a UNION ALL operator*/
SELECT CONCAT(   p.LastName,                 ', ',
                 p.FirstName
             ) AS PersonName,
       SUM(s1.Price) AS TotalSales,
       s1.SalesDate
FROM dbo.Sales s1
    INNER JOIN dbo.Person p
        ON p.Id = s1.SalesPersonId
WHERE s1.OrderCode <> 'ABC'
      OR s1.AccountManagerId IS NULL
GROUP BY s1.SalesDate,
         p.FirstName,
         p.LastName
UNION ALL
SELECT CONCAT(   p.LastName,
                 ', ',
                 p.FirstName
             ) AS PersonName,
       SUM(s2.Price) AS TotalSales,
       s2.SalesDate
FROM dbo.Sales s2
    INNER JOIN dbo.Person p
        ON p.Id = s2.AccountManagerId
WHERE s2.OrderCode = 'ABC'
      AND s2.AccountManagerId IS NOT NULL
GROUP BY s2.SalesDate,
         p.FirstName,
           p.LastName;

SET STATISTICS IO OFF;
GO
Statistics IO results. Page Reads.
Execution Plan from both queries in one batch.

You can see from the images above that the UNION ALL makes up more than 50% of the batch, along with additional page reads on both tables. Ultimately, the difference isn't going to be life-changing. However, if someone says using a CASE expression in a JOIN is a lousy practice, ask them why. I don't advocate using a CASE expression in the join for every query. But there are instances where it comes in handy.

I'm curious. Do you ever use a CASE expression in a JOIN clause? Please let me know in the comments below.

Key Points

  • You can use a CASE expression in almost any part of a SQL statement, including the WHERE and JOIN.
  • Given the example, the CASE expression performed better in this tutorial than the UNION ALL. However, this isn't an absolute rule.
  • If someone says adding a CASE expression to a JOIN clause is a bad practice, ask them to explain why.

24 Aug 2024

Why is there no standard implementation of SQL?

 SQL (Structured Query Language) is a standardized language for managing and querying relational databases, but there is no single, universally implemented version of SQL across all database management systems (DBMS). The reasons for this lack of a standard implementation are rooted in the history, flexibility, and competitive nature of the database industry:

1. Evolution of SQL Standards

  • SQL has been standardized by organizations such as ANSI (American National Standards Institute) and ISO (International Organization for Standardization), with various versions like SQL-86, SQL-89, SQL-92, SQL:1999, and so on. However, these standards are broad and provide a foundation for SQL syntax and features, allowing vendors to implement and extend the standard in ways that meet their specific needs.

2. Vendor Differentiation

  • Database vendors (e.g., Oracle, Microsoft, IBM, PostgreSQL, MySQL) differentiate their products by offering unique features, optimizations, and extensions to the standard SQL language. These extensions help vendors provide additional functionality, such as proprietary functions, enhanced performance, or specialized data types that are not covered by the SQL standard.

3. Performance Optimization

  • Different DBMS platforms are optimized for different use cases and hardware environments. To achieve the best performance, vendors often implement their own versions of certain SQL features or add proprietary extensions. These optimizations are designed to take advantage of specific architectural strengths of their database engines.

4. Historical Legacy

  • SQL has been around since the 1970s, and different database systems were developed independently over time. As a result, they evolved with different features, syntax, and functionalities before SQL standards were formalized. Even after standardization, many vendors continued to support legacy features and syntax to maintain backward compatibility with older applications.

5. Flexibility and Adaptability

  • The SQL standard is intentionally flexible, allowing database vendors to innovate and address new requirements as they emerge in the industry. This flexibility leads to differences in implementation, as vendors adapt the language to suit different types of workloads, data models, and application needs.

6. Open Source vs. Commercial Databases

  • Open-source databases like PostgreSQL and MySQL may follow different development philosophies and community-driven enhancements compared to commercial databases like Oracle or SQL Server. These differences contribute to variations in SQL implementations across platforms.

7. Complexity of Standardization

  • The SQL language is vast and complex, covering a wide range of functionalities from basic data retrieval to advanced features like recursive queries, window functions, and transaction management. Achieving complete standardization across all these features is challenging, especially as new features are continually being developed and added by different vendors.

8. Community and Ecosystem

  • Different SQL implementations have grown their own ecosystems of tools, libraries, and communities. These ecosystems contribute to the differences in how SQL is implemented and used across different platforms.

9. Compliance vs. Extensions

  • While many vendors aim to be compliant with the SQL standard, they also offer proprietary extensions that are not part of the standard. These extensions can offer powerful features but also contribute to the lack of a fully standardized SQL implementation across all platforms.

Implications

  • Portability: Applications written for one DBMS may need modifications to work on another due to differences in SQL syntax and features.
  • Learning Curve: Developers and DBAs must be familiar with the specific dialect of SQL used by the DBMS they are working with, which can increase the learning curve.
  • Choice of DBMS: When choosing a DBMS, organizations must consider not just compliance with SQL standards, but also the unique features and extensions offered by each platform.

In summary, the lack of a standard implementation of SQL across all platforms is a result of historical development, vendor competition, and the need for flexibility and innovation in addressing diverse data management requirements.

SQL Server Copilot AI

 Here are some of the latest tips and advanced concepts for Microsoft SQL Server (MS SQL) in 2024: 1. SQL Server Copilot AI Microsoft has in...