Showing posts with label Index Management. Show all posts
Showing posts with label Index Management. Show all posts

12 Dec 2025

Best SQL Query to Find Duplicates – Step by Step Guide

 inding duplicate rows is one of the most common tasks in SQL, especially when working with real-world data like customer records, invoices, transactions, or log data.

In this article, you will learn the best SQL queries to identify duplicates, how to remove them, and how to prevent them in the future.

This is a beginner-friendly and practical tutorial you can use in SQL Server, MySQL, Oracle, PostgreSQL, and even BigQuery.


🔍 Why Finding Duplicates Is Important

Duplicate rows can create multiple problems:

  • Wrong totals in reports

  • Incorrect customer counts

  • Issues in billing

  • Slow performance

  • Errors in downstream systems (Power BI, ETL, APIs)

Cleaning duplicates is one of the first tasks in data quality improvement.


1. Find Duplicates Using GROUP BY (Best & Easiest Method)

This is the most common and universal method.

Query:

SELECT column1, column2, COUNT(*) AS duplicate_count FROM table_name GROUP BY column1, column2 HAVING COUNT(*) > 1;

✔ When to use:

  • When you want to know which values are duplicated

  • Works in SQL Server, MySQL, Oracle, PostgreSQL, BigQuery


📌 Example: Find Duplicate Email IDs

Suppose you have a table:

The email raj@gmail.com is repeated.

Query:

SELECT email, COUNT(*) AS duplicate_count FROM users GROUP BY email HAVING COUNT(*) > 1;

Output:

emailduplicate_count
 
        raj@gmail.com

🔥 3. Find Duplicates Using ROW_NUMBER() (Best for Deleting)

This is the most powerful method.
It uses window functions and gives a row number to each duplicate group.

Query:

WITH cte AS ( SELECT *, ROW_NUMBER() OVER (PARTITION BY email ORDER BY id) AS rn FROM users ) SELECT * FROM cte WHERE rn > 1;

✔ When to use:

  • When you need to delete only duplicates and keep 1 record

  • When you want to mark duplicates


🗑 4. Delete Duplicate Rows (Keep Only One)

Using the same CTE:

WITH cte AS ( SELECT *, ROW_NUMBER() OVER (PARTITION BY email ORDER BY id) AS rn FROM users ) DELETE FROM cte WHERE rn > 1;

This deletes all duplicate rows but keeps the first row.


🧠 5. Find Duplicates by Date or Conditional Columns

Example: Find duplicates only for rows created today.

SELECT name, email, created_date, COUNT(*) FROM users WHERE created_date = CURRENT_DATE GROUP BY name, email, created_date HAVING COUNT(*) > 1;

This is useful for ETL loads and daily imports.


🛑 6. Prevent Duplicates Using Constraints

Best long-term solution.

In SQL Server / MySQL / PostgreSQL:

ALTER TABLE users ADD CONSTRAINT uq_email UNIQUE (email);

In Oracle:

ALTER TABLE users ADD CONSTRAINT uq_email UNIQUE (email);

After this, the database will automatically block duplicate values.


📸 Screenshots You Should Add (To Increase SEO + AdSense Earnings)

You can add these screenshots:

  • SSMS window showing query

  • Duplicate rows in a sample table

  • Execution result window

(If you want, I can generate clean screenshots for you.)


FAQ – Frequently Asked Questions

1. What is the fastest way to find duplicates in SQL?

Using GROUP BY + HAVING COUNT(*) > 1 is the simplest and fastest method.


2. Does ROW_NUMBER() find duplicates?

Yes. ROW_NUMBER assigns sequence numbers to rows allowing you to identify duplicate rows easily.


3. Will deleting duplicates improve performance?

Yes.
Removing duplicates improves:

  • Query performance

  • Joins

  • Aggregations

  • ETL workflows


4. How to avoid duplicates permanently?

Use a UNIQUE constraint or PRIMARY KEY on important columns like email, mobile, employee code, etc.


5. Does this work in all SQL databases?

Yes — all examples work in:

  • SQL Server

  • MySQL

  • PostgreSQL

  • Oracle

  • BigQuery


⭐ Final Thoughts

Finding and removing duplicates is a critical part of database maintenance.
By using the queries shown above—GROUP BY, ROW_NUMBER, and unique constraints—you can quickly identify and eliminate duplicate data efficiently.

If you want more SQL tutorials, feel free to browse my other posts or ask in the comments!


👉 Want more articles like this for your blog?

I can write complete SEO-optimized posts on:

  • SQL Server

  • MySQL

  • BigQuery

  • Oracle

  • DBeaver

  • Power BI

  • ASP.NET

  • Data analysis topics

Just tell me your next topic!

7 Dec 2025

FULL PYTHON + MS SQL SCRIPT (BEST PRACTICE)

 

FULL PYTHON + MS SQL SCRIPT (BEST PRACTICE)

import pyodbc


# -----------------------------------------

# SQL CONNECTION (EDIT THIS PART)

# -----------------------------------------

def get_connection():

    try:

        conn = pyodbc.connect(

            "DRIVER={ODBC Driver 17 for SQL Server};"

            "SERVER=YOUR_SERVER_NAME;"      # e.g. DESKTOP-123\SQLEXPRESS

            "DATABASE=YOUR_DATABASE_NAME;"

            "UID=YOUR_USERNAME;"

            "PWD=YOUR_PASSWORD;",

            autocommit=False                # Manual commit = safer

        )

        return conn

    except Exception as ex:

        print("Database connection failed:", ex)

        return None



# -----------------------------------------

# INSERT RECORD

# -----------------------------------------

def insert_employee(emp_id, name, salary):

    conn = get_connection()

    if not conn:

        return

    try:

        cursor = conn.cursor()

        query = """

            INSERT INTO Employees (EmpID, Name, Salary)

            VALUES (?, ?, ?)

        """

        cursor.execute(query, (emp_id, name, salary))

        conn.commit()

        print("Record inserted successfully!")

    except Exception as ex:

        conn.rollback()

        print("Insert failed:", ex)

    finally:

        conn.close()



# -----------------------------------------

# UPDATE RECORD

# -----------------------------------------

def update_employee(emp_id, new_salary):

    conn = get_connection()

    if not conn:

        return

    try:

        cursor = conn.cursor()

        query = """

            UPDATE Employees

            SET Salary = ?

            WHERE EmpID = ?

        """

        cursor.execute(query, (new_salary, emp_id))

        conn.commit()

        print("Record updated successfully!")

    except Exception as ex:

        conn.rollback()

        print("Update failed:", ex)

    finally:

        conn.close()



# -----------------------------------------

# DELETE RECORD

# -----------------------------------------

def delete_employee(emp_id):

    conn = get_connection()

    if not conn:

        return

    try:

        cursor = conn.cursor()

        query = "DELETE FROM Employees WHERE EmpID = ?"

        cursor.execute(query, (emp_id,))

        conn.commit()

        print("Record deleted successfully!")

    except Exception as ex:

        conn.rollback()

        print("Delete failed:", ex)

    finally:

        conn.close()



# -----------------------------------------

# SELECT ALL RECORDS

# -----------------------------------------

def get_all_employees():

    conn = get_connection()

    if not conn:

        return

    try:

        cursor = conn.cursor()

        cursor.execute("SELECT EmpID, Name, Salary FROM Employees")


        print("\n--- Employee List ---")

        for row in cursor.fetchall():

            print(row.EmpID, row.Name, row.Salary)


    except Exception as ex:

        print("Select failed:", ex)

    finally:

        conn.close()



# -----------------------------------------

# CALL STORED PROCEDURE

# -----------------------------------------

def call_stored_procedure():

    """

    Example Stored Procedure:

    CREATE PROCEDURE GetEmployees

    AS

    BEGIN

        SELECT * FROM Employees

    END

    """

    conn = get_connection()

    if not conn:

        return

    try:

        cursor = conn.cursor()

        cursor.execute("{CALL GetEmployees}")   # or EXEC GetEmployees

        rows = cursor.fetchall()


        print("\n--- Stored Procedure Result ---")

        for row in rows:

            print(row.EmpID, row.Name, row.Salary)


    except Exception as ex:

        print("Stored procedure failed:", ex)

    finally:

        conn.close()



# -----------------------------------------

# MAIN TESTING

# -----------------------------------------

if __name__ == "__main__":

    insert_employee(1, "Vijay", 50000)

    update_employee(1, 60000)

    get_all_employees()

    delete_employee(1)

    get_all_employees()

    call_stored_procedure()

🚀 How to Use

  1. Update:

SERVER= DATABASE= UID= PWD=
  1. Create sample table in SQL:

CREATE TABLE Employees ( EmpID INT PRIMARY KEY, Name VARCHAR(100), Salary DECIMAL(18,2) );
  1. Run the Python file:

python mssql_python_demo.py

4 Dec 2025

SQL Server 2025 introduces vector data types and functions

 SQL Server 2025 introduces vector data types and functions, enabling native AI-driven similarity searches and machine learning tasks directly in SQL queries without external tools.​


Key AI Features

Vector data types store embeddings as optimized binary JSON arrays for efficient operations like distance calculations (e.g., VECTOR_DISTANCE, VECTOR_NORM). New vector indexes support approximate nearest neighbor searches, accelerating AI workloads. External AI model management allows invoking REST endpoints for embeddings via sp_invoke_external_rest_endpoint.​


Developer Enhancements

Regular expressions functions like REGEXP_LIKE and REGEXP_REPLACE enable pattern matching and text manipulation in queries. Fuzzy string matching with EDIT_DISTANCE_SIMILARITY computes similarity scores for approximate searches. Change event streaming captures real-time DML changes to Azure Event Hubs in JSON or Avro formats.​


Performance Improvements

Optimized locking reduces blocking and memory use, while tempdb governance prevents space exhaustion from runaway jobs. Cardinality estimation feedback for expressions adapts query plans across executions, and optional parameter plan optimization (OPPO) handles varying parameters dynamically. These align with 2025 trends like AI integration and query tuning for modern data stacks

3 Dec 2025

Beaver's functionality

 The errors you're encountering are due to differences in the SQL functions supported by Oracle and SQL Server. The LEFT and ISNULL functions are specific to SQL Server, and Oracle does not recognize them, which is why you're getting the ORA-00904: "LEFT": invalid identifier error.


Solutions:

Use Oracle-Compatible Functions:


For LEFT, use Oracle's SUBSTR function:


SELECT COLA, SUBSTR(COLB, 1, 10) FROM TABLEA;

For ISNULL, use Oracle's NVL function:


SELECT COLA, NVL(COLB, 'default_value') FROM TABLEA;

Transform Data in SQL Server:


If you can migrate the data without transformations, you can use the SQL Server-specific functions like LEFT and ISNULL after the data has been moved.

DBeaver Script Configuration:


If you want to write cross-platform scripts, you can use DBeaver's functionality to define SQL scripts for each database type, ensuring the correct functions are used for each.

If you need to run these functions as part of the migration process, you'll have to use Oracle-compatible syntax until the data is in SQL Server.

21 Aug 2024

Removing duplicate rows from a large table with millions of records

 Removing duplicate rows from a large table with millions of records can be challenging, but it can be done efficiently using SQL. Here's a step-by-step guide to remove duplicates from a table in SQL Server:

1. Identify Duplicates:

First, you'll need to identify what constitutes a "duplicate." Typically, this means all columns (except for the primary key or a unique identifier) are the same.

2. Create a Backup:

Before making any changes, it's good practice to create a backup of your table.

SELECT * INTO YourTable_Backup FROM YourTable;

3. Remove Duplicates Using a CTE (Common Table Expression):

The most common and efficient way to remove duplicates is to use a CTE combined with the ROW_NUMBER() function. Here's how you can do it:

WITH CTE AS ( SELECT *, ROW_NUMBER() OVER (PARTITION BY Column1, Column2, Column3 ORDER BY (SELECT NULL)) AS RN FROM YourTable ) DELETE FROM CTE WHERE RN > 1;
  • Explanation:
    • The ROW_NUMBER() function assigns a unique sequential integer to rows within a partition of a result set, starting at 1 for the first row in each partition.
    • The PARTITION BY clause divides the result set into partitions to which the ROW_NUMBER() function is applied.
    • The ORDER BY (SELECT NULL) clause simply ensures that the rows are arbitrarily ordered within each partition.
    • Rows with RN > 1 are considered duplicates and are deleted.

4. Verify the Results:

After running the deletion, it's essential to verify that duplicates have been removed and that the table integrity is maintained.

SELECT Column1, Column2, Column3, COUNT(*) FROM YourTable GROUP BY Column1, Column2, Column3 HAVING COUNT(*) > 1;

This query should return no results if all duplicates have been successfully removed.

5. Rebuild Indexes and Update Statistics:

After removing duplicates, it's a good idea to rebuild any indexes on the table and update the statistics.

ALTER INDEX ALL ON YourTable REBUILD; UPDATE STATISTICS YourTable;

6. Considerations for Large Tables:

For very large tables (like your 260 million rows), you might need to remove duplicates in smaller batches to avoid long transaction times or transaction log growth. This can be done using a loop or by deleting rows in chunks.

Example:

WHILE EXISTS (SELECT 1 FROM CTE WHERE RN > 1) BEGIN DELETE TOP (10000) FROM CTE WHERE RN > 1; END

7. Monitor and Optimize:

Monitor the process and optimize as needed, particularly in a production environment where long-running queries could impact performance.

Conclusion:

By following these steps, you should be able to efficiently remove duplicate rows from your large table in SQL Server. The key is to use a method that minimizes the impact on your server and ensures the integrity of your data.

Partitioning a table in SQL Server that has a primary key referenced by foreign keys in other tables

 Partitioning a table in SQL Server that has a primary key referenced by foreign keys in other tables can indeed be challenging due to the constraints involved. Your approach of dropping and recreating constraints is a common solution, but here are a few considerations that might simplify the process or make it more efficient:

1. Use SWITCH Instead of Dropping and Recreating Constraints:

If you are partitioning for performance reasons, consider using the SWITCH statement. You can create a new partitioned table with the desired schema, and then use ALTER TABLE SWITCH to move data between tables. This avoids dropping foreign key constraints but requires more setup and might need the table to be empty when switching.

  • Step-by-Step:
    1. Create a new partitioned table with the same schema as the original table.
    2. Use INSERT INTO ... SELECT to move data to the new table.
    3. Use ALTER TABLE SWITCH to switch the tables.
    4. Drop the old table and rename the new one.

2. Temporarily Disable Constraints:

SQL Server allows you to disable foreign key constraints temporarily, which might help in avoiding the need to drop them.

  • Step-by-Step:

    1. Disable the foreign key constraints on the dependent tables.
    2. Drop the primary key constraint and clustered index.
    3. Partition the table and recreate the primary key constraint on the partitioned table.
    4. Re-enable the foreign key constraints.
  • Example:

    ALTER TABLE [DependentTable] NOCHECK CONSTRAINT [FK_Name]; -- Drop and recreate the primary key and clustered index ALTER TABLE [DependentTable] CHECK CONSTRAINT [FK_Name];

3. Using Schema Modification with Minimal Downtime:

If downtime is a concern, consider using techniques like online index creation and schema modification that might reduce the impact on the application.

  • Online Index Creation: SQL Server Enterprise Edition supports creating and rebuilding indexes online, which might reduce the impact during partitioning.
  • Schema Modifications: You could stage the new partitioned table while keeping the original table intact, then switch over with minimal downtime.

4. Consideration for SQL Server Version:

If you're using SQL Server 2016 or later, take advantage of improvements in partitioning and index creation features, like support for more efficient operations on partitioned tables.

5. Using a Maintenance Window:

Since the process involves significant changes, performing this operation during a maintenance window might be the best option, even if it means temporarily disabling or dropping constraints.

6. Documentation and Backup:

Document each step carefully and ensure you have a full backup before proceeding. This will help in case anything goes wrong during the process.

Conclusion:

Unfortunately, there isn’t a way to completely avoid the process of dropping and recreating constraints when partitioning a table that is heavily referenced by foreign keys. However, depending on your specific environment and requirements, the alternatives like SWITCH, temporarily disabling constraints, or using online operations might make the process smoother and less disruptive.

 

18 Jun 2024

Check all indexes in all table in Database in MS SQL Server

 USE YourDatabaseName; -- Replace with your actual database name

-- Query to retrieve index information SELECT TableName = t.name, IndexName = ind.name, IndexType = CASE ind.index_id WHEN 0 THEN 'Heap (No Clustered Index)' WHEN 1 THEN 'Clustered' ELSE 'Nonclustered' END, ColumnName = col.name, ColumnPosition = ic.key_ordinal FROM sys.indexes ind INNER JOIN sys.index_columns ic ON ind.object_id = ic.object_id AND ind.index_id = ic.index_id INNER JOIN sys.columns col ON ic.object_id = col.object_id AND ic.column_id = col.column_id INNER JOIN sys.tables t ON ind.object_id = t.object_id WHERE ind.type_desc <> 'Heap' -- Exclude heap tables (tables without clustered indexes) ORDER BY t.name, ind.name, ic.key_ordinal;

Explanation:

  1. USE YourDatabaseName: Replace YourDatabaseName with the name of your SQL Server database to switch to that database context.

  2. sys.indexes: This system view contains information about indexes in the database.

  3. sys.index_columns: This view provides details about the columns that are part of each index.

  4. sys.columns: This view gives information about columns in tables.

  5. sys.tables: This view contains information about tables in the database.

  6. SELECT: Retrieves the table name (t.name), index name (ind.name), index type (ind.index_id), column name (col.name), and column position (ic.key_ordinal).

  7. CASE statement: Checks the ind.index_id to determine the type of index (clustered, nonclustered, or heap).

  8. WHERE: Filters out heap tables (ind.type_desc <> 'Heap') because they don't have traditional indexes.

  9. ORDER BY: Orders the results by table name (t.name), index name (ind.name), and column position (ic.key_ordinal).

Notes:

  • Make sure to replace YourDatabaseName with the actual name of your database.
  • This query provides comprehensive information about all indexes in the database except for heap tables.
  • Running this query requires appropriate permissions to access the system views (sys.indexes, sys.index_columns, sys.columns, sys.tables).

By executing this query in SQL Server Management Studio (SSMS) or any other SQL query tool, you can obtain a detailed list of indexes and their configurations across all tables in your database.

SQL Server 2025 Express Edition Download, Install and Configure

  Problem I heard there is a new version of SQL Server Express Edition with the 2025 release. What is new in this edition? Is it hard to ins...