11 Jun 2024

Data Preprocessing in Data Mining

 Data preprocessing is an important step in the data mining process. It refers to the cleaning, transforming, and integrating of data in order to make it ready for analysis. The goal of data preprocessing is to improve the quality of the data and to make it more suitable for the specific data mining task.

Some common steps in data preprocessing include:

Data preprocessing is an important step in the data mining process that involves cleaning and transforming raw data to make it suitable for analysis. Some common steps in data preprocessing include:

Data Cleaning: This involves identifying and correcting errors or inconsistencies in the data, such as missing values, outliers, and duplicates. Various techniques can be used for data cleaning, such as imputation, removal, and transformation.

Data Integration: This involves combining data from multiple sources to create a unified dataset. Data integration can be challenging as it requires handling data with different formats, structures, and semantics. Techniques such as record linkage and data fusion can be used for data integration.

Data Transformation: This involves converting the data into a suitable format for analysis. Common techniques used in data transformation include normalization, standardization, and discretization. Normalization is used to scale the data to a common range, while standardization is used to transform the data to have zero mean and unit variance. Discretization is used to convert continuous data into discrete categories.

Data Reduction: This involves reducing the size of the dataset while preserving the important information. Data reduction can be achieved through techniques such as feature selection and feature extraction. Feature selection involves selecting a subset of relevant features from the dataset, while feature extraction involves transforming the data into a lower-dimensional space while preserving the important information.

Data Discretization: This involves dividing continuous data into discrete categories or intervals. Discretization is often used in data mining and machine learning algorithms that require categorical data. Discretization can be achieved through techniques such as equal width binning, equal frequency binning, and clustering.

Data Normalization: This involves scaling the data to a common range, such as between 0 and 1 or -1 and 1. Normalization is often used to handle data with different units and scales. Common normalization techniques include min-max normalization, z-score normalization, and decimal scaling.

Data preprocessing plays a crucial role in ensuring the quality of data and the accuracy of the analysis results. The specific steps involved in data preprocessing may vary depending on the nature of the data and the analysis goals.

By performing these steps, the data mining process becomes more efficient and the results become more accurate.

Preprocessing in Data Mining: 
Data preprocessing is a data mining technique which is used to transform the raw data in a useful and efficient format. 

 


Steps Involved in Data Preprocessing: 

1. Data Cleaning: 
The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It involves handling of missing data, noisy data etc. 
 

  • (a). Missing Data: 
    This situation arises when some data is missing in the data. It can be handled in various ways. 
    Some of them are: 
    1. Ignore the tuples: 
      This approach is suitable only when the dataset we have is quite large and multiple values are missing within a tuple. 
       
    2. Fill the Missing values: 
      There are various ways to do this task. You can choose to fill the missing values manually, by attribute mean or the most probable value. 
       
  • (b). Noisy Data: 
    Noisy data is a meaningless data that can’t be interpreted by machines.It can be generated due to faulty data collection, data entry errors etc. It can be handled in following ways : 
    1. Binning Method: 
      This method works on sorted data in order to smooth it. The whole data is divided into segments of equal size and then various methods are performed to complete the task. Each segmented is handled separately. One can replace all data in a segment by its mean or boundary values can be used to complete the task. 
       
    2. Regression: 
      Here data can be made smooth by fitting it to a regression function.The regression used may be linear (having one independent variable) or multiple (having multiple independent variables). 
       
    3. Clustering: 
      This approach groups the similar data in a cluster. The outliers may be undetected or it will fall outside the clusters. 

2. Data Transformation: 
This step is taken in order to transform the data in appropriate forms suitable for mining process. This involves following ways: 

  1. Normalization: 
    It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0) 
     
  2. Attribute Selection: 
    In this strategy, new attributes are constructed from the given set of attributes to help the mining process. 
     
  3. Discretization: 
    This is done to replace the raw values of numeric attribute by interval levels or conceptual levels. 
     
  4. Concept Hierarchy Generation: 
    Here attributes are converted from lower level to higher level in hierarchy. For Example-The attribute “city” can be converted to “country”. 

CTE in SQL

 The Common Table Expressions (CTE) were introduced into standard SQL in order to simplify various classes of SQL Queries for which a derived table was just unsuitable. CTE was introduced in SQL Server 2005, the common table expression (CTE) is a temporary named result set that you can reference within a SELECTINSERTUPDATE, or DELETE statement. You can also use a CTE in a CREATE view, as part of the view’s SELECT query. In addition, as of SQL Server 2008, you can add a CTE to the new MERGE statement. 

Why do We Need CTE in the SQL server?

A common table [deprecated]CTE is a powerful SQL construct that helps simplify queries. CTEs act as virtual tables (with records and columns) that are created during query execution, used by the query, and deleted after the query executes.

Using the CTE 

We can define CTEs by adding a WITH clause directly before the SELECT, INSERT, UPDATE, DELETE, or MERGE statement. The WITH clause can include one or more CTEs separated by commas. 

Syntax:

[WITH  [, …]]  

::=

cte_name [(column_name [, …])]

AS (cte_query) 

Argument

  1. Expression name: A valid identifier for a common table expression. The expression_name must be different from the names of other common table expressions defined in the same WITH  clause, but the expression_name can be the same as the name of the base table or view. All references to expression_name in the query use the common table expression instead of the base object.
  2. Column name: Specifies a column name in a common table expression. Duplicate names within a single CTE definition are not allowed. The number of column names must match the number of columns in the CTE_query_definition’s result set. A list of column names is optional only if all the resulting columns in the query definition have different names.
  3. CTE_QueryDefinition: Specifies a SELECT statement whose result set satisfies a common table expression. The SELECT statement for CTE_query_defining must meet the same requirements as creating a view, except that a CTE cannot define another CTE. For more information, see the Remarks section and CREATE VIEW (Transact-SQL). If multiple CTE_query_settings are defined, the query definitions must be combined with one of the set operators UNION ALL, UNION, EXCEPT, or INTERSECT.
     

Rules for Define and Use of Recursive Common Table Expressions
 

The following guidelines apply to defining recursive common table expressions:

  1. A recursive CTE definition must contain at least two he CTE query definitions, an anchor member and a recursive member. You can define multiple anchor and recursive members. However, all anchor member query definitions must be placed before the first recursive member definition. All CTE query definitions are anchor members unless they reference the CTE itself.
  2. Anchor elements must be combined with one of the following set operators: UNION ALL, UNION, INTERSECT, EXCEPT. UNION ALL is the only valid set operator between the last anchor member and the first recursive member when joining multiple recursive members. Anchor and recursive elements must have the same number of columns.
  3. The data type of the recursive member’s columns must be the same as the data type of the corresponding column of the anchor member.
  4. The following items are not allowed in her CTE_query_definition for recursive members:      
Please choose a different
grouping
PIVOT (For database compatibility level 110 and above.
See Breaking Changes to Database Engine Features in SQL Server 2016.
Offal Scalar aggregation Up
LEFT, RIGHT, OUTER JOIN (INNER JOIN is allowed)
subquery
A hint applied to recursive references to CTEs within a CTE_query_definition.

Creating a Recursive Common Table Expression 

A recursive CTE is one that references itself within that CTE. The recursive CTE is useful when working with hierarchical data as the CTE continues to execute until the query returns the entire hierarchy. 

A typical example of hierarchical data is a table that includes a list of employees. For every employee, the table provides a reference to that person’s manager. That reference is itself an employee ID within the same table. You can use a recursive CTE to display the hierarchy of employee data. 

If a CTE is created incorrectly it can enter an infinite loop. To prevent this, the MAXRECURSION hint can be added in the OPTION clause of the primary SELECT, INSERT, UPDATE, DELETE, or MERGE statement. 

A table is created: 

CREATE TABLE Employees
(
EmployeeID int NOT NULL PRIMARY KEY,
FirstName varchar(50) NOT NULL,
LastName varchar(50) NOT NULL,
ManagerID int NULL
)

INSERT INTO Employees VALUES (1, 'Ken', 'Thompson', NULL)
INSERT INTO Employees VALUES (2, 'Terri', 'Ryan', 1)
INSERT INTO Employees VALUES (3, 'Robert', 'Durello', 1)
INSERT INTO Employees VALUES (4, 'Rob', 'Bailey', 2)
INSERT INTO Employees VALUES (5, 'Kent', 'Erickson', 2)
INSERT INTO Employees VALUES (6, 'Bill', 'Goldberg', 3)
INSERT INTO Employees VALUES (7, 'Ryan', 'Miller', 3)
INSERT INTO Employees VALUES (8, 'Dane', 'Mark', 5)
INSERT INTO Employees VALUES (9, 'Charles', 'Matthew', 6)
INSERT INTO Employees VALUES (10, 'Michael', 'Jhonson', 6)

After the Employees table is created, the following SELECT statement, which is preceded by a WITH clause that includes a CTE named cteReports is created: 

WITH
cteReports (EmpID, FirstName, LastName, MgrID, EmpLevel)
AS
(
SELECT EmployeeID, FirstName, LastName, ManagerID, 1
FROM Employees
WHERE ManagerID IS NULL
UNION ALL
SELECT e.EmployeeID, e.FirstName, e.LastName, e.ManagerID,
r.EmpLevel + 1
FROM Employees e
INNER JOIN cteReports r
ON e.ManagerID = r.EmpID
)
SELECT
FirstName + ' ' + LastName AS FullName,
EmpLevel,
(SELECT FirstName + ' ' + LastName FROM Employees
WHERE EmployeeID = cteReports.MgrID) AS Manager
FROM cteReports
ORDER BY EmpLevel, MgrID

Thus, CTEs can be useful when you need to generate temporary result sets that can be accessed in a SELECT, INSERT, UPDATE, DELETE, or MERGE statement.

Features and Limitations of Common Table Expressions in Azure

The current implementation of CTEs in Azure Synapse Analytics and Analytics Platform System (PDW) has the following features and limitations:

  1. A CTE can be only specified in a SELECT statement.
  2.  A CTE can be only specified in a CREATE VIEW statement.
  3. A CTE can be only specified in a CREATE TABLE AS SELECT (CTAS) statement.
  4. A CTE can be only specified in a CREATE REMOTE TABLE AS SELECT (CRTAS) statement.
  5. A CTE can be only specified in a CREATE EXTERNAL TABLE AS SELECT (CETAS) statement.



SQL Server Copilot AI

 Here are some of the latest tips and advanced concepts for Microsoft SQL Server (MS SQL) in 2024: 1. SQL Server Copilot AI Microsoft has in...