banner



How To Clean Up An Incomplete Data Set In Sql

By:   |   Updated: 2022-04-13   |   Comments   |   Related: More > Data Cleansing for Validation


Trouble

As a database or business intelligence programmer I would like to learn the data wrangling steps in SQL particularly the methods used to impute missing values without having to learn any specialized languages such as Python or R.

Solution

The solution is to go hands on experience of imputing missing values using the native SQL language, but following the same basis used in Data Scientific discipline for information munging that utilize specialized languages like Python.

About Imputing Missing Values

It is very important to get some basic know-how of data wrangling and the term "Imputing Missing Values" in a Dataset from a data wrangling procedure.

Data wrangling

Data wrangling is merely a way to refine a Dataset by converting it from raw data to another grade for information analytics purposes past Data Analyst for improve decision making.

To go a better agreement about data wrangling (information exploration, data preparation, data cleaning, validating, enriching, automation, etc.) delight refer to the following tutorial: Learn basics of Data Wrangling in SQL to handle Invalid Values.

What is Imputing Missing Values?

Imputing missing values means replacing missing values with some meaningful data in a Dataset as part of data wrangling, which can exist very fourth dimension-consuming.

What are missing values?

A missing value is any value in a Dataset (such every bit a SQL database table) which has not been supplied or has been left uninitialized.

SQL Instance of missing value

Nada values correspond missing values in a SQL table which tin can pose serious problems for conveying out circuitous data assay so these missing values must be handled by using 1 of the methods applied in data wrangling.

Imputing Missing Values using Mean and Median Methods

In this walkthrough we are going to learn the following information wrangling approaches to impute (supplant) missing values:

  1. Using Drop Attribute Method
  2. Using Mean Method

Prerequisites

This tip assumes that the readers are well familiar with T-SQL and database concepts and the post-obit environment is setup:

  1. SQL Server local/remote instance is installed
  2. Database evolution and management tool like SQL Server Management Studio (SSMS) or Azure Data Studio is readily available to setup the database and run T-SQL scripts

Reference Tip

It is highly recommended to go through the previously discussed methods used in data wrangling by reading the post-obit tip: Learn basics of Data Wrangling in SQL to handle Invalid Values.

Setup Sample Dataset (Database) WatchesDWR2

The get-go thing is to setup a sample Dataset (SQL database) called WatchesDWR2, well-nigh watches, by running the post-obit script against the master database to create the data structures:

-- Create a new database called 'WatchesDWR2' -- Connect to the 'master' database to run this snippet Employ master Get  -- Create the new database if information technology does non be already IF Not EXISTS (     SELECT [name] FROM sys.databases WHERE [name] = North'WatchesDWR2' ) CREATE DATABASE WatchesDWR2 GO   -- Switch to the WatchesDWR2 database (created locally) Utilise WatchesDWR2   -- Creating a watch tabular array CREATE TABLE [dbo].[Lookout] (     [WatchId] [int] IDENTITY(ane,i) NOT NULL,     [Color] [varchar](20) NULL,     [Case Width] INT Naught,     [Case Depth] INT Null,     [Cost] [decimal](10, 2) NULL,     CONSTRAINT [PK_WatchType] PRIMARY KEY Amassed  (    [WatchId] ASC ) )   -- Insert rows into tabular array 'Spotter' in schema '[dbo]' INSERT INTO [dbo].[Watch]     ( -- Columns to insert data into     Color,[Case Width],[Case Depth], Toll     ) VALUES     (         'Black', 49, 20, 200.00 ),     (         'Bluish', 45, xviii, 150.00 ),     (         NULL, 50, xv, NULL ),     (         NULL, 55, fifteen, 170.50 ),     (         Nada, NULL, Zero, Aught ),     (         NULL, 52, xx, 300 )          

Dataset Cheque

Please have a quick await at the recently created sample database which is used as a Dataset for data wrangling by typing and executing the following T-SQL script against the WatchesDWR2 database:

-- View Watch tabular array  SELECT w.WatchId, w.Color, due west.[Case Width], due west.[Case Depth], w.Cost FROM [dbo].[Lookout man] west GO          

The output is as follows:

Watch Table used as Dataset

Please note that this output is generated by Azure Data Studio. You lot tin can use whatsoever other compliant database evolution and management tool such as SQL Server Management Studio (SSMS) to write and run your SQL scripts.

Data Wrangling by using Drop Attribute Method

The "Drop Attribute" information wrangling algorithm is used in ane of the post-obit use cases:

  1. At that place are many missing values (NULLs) of a column, but the columns itself is not of involvement from analysis point of view
  2. At that place may be no missing value of a column, only it is excluded from the assay nosotros are preparing the Dataset for
  3. There may exist very few missing values of cavalcade, merely dropping the field (cavalcade) is better than replacing those values

And then, in other words, this is the data wrangling approach in which the Data Scientist \ Data Wrangler decides to detach (remove) an attribute from the Dataset (such every bit a SQL tabular array equally a data source) on the ground of its rare or non-use for the upcoming information analysis.

For example, if we are preparing the Dataset for analyzing case width and instance depth of different watches sold to the customers recorded in the Dataset then Color column does non carry any weight to proceed further to the adjacent stages.

Considering the current Watch t tabular array, the following two strong reasons can be used to remove this column:

  1. It (Color) is not integral part of the analysis which requires case width and case depth of the watches
  2. It has a lot of NULLs which may exist very plush (in terms of attempt) to impute (replace) them with meaningful data

Equally a issue of the above reasons, we determine to drop this attribute (cavalcade) past using data wrangling drib attribute arroyo.

This method is (if done in SQL) applied by the help of the post-obit script (executed against the sample database):

-- Drib '[Color]' from table '[Sentry]' in schema '[dbo]' using Drop Attribute Information Wrangling Method Alter TABLE [dbo].[Watch]     DROP COLUMN [Color] GO          

At present check the refined Dataset (table):

-- View Watch Tabular array (Dataset) after applying Drop Attribute Information Wrangling Method SELECT w.WatchId, w.[Case Width], w.[Case Depth], w.Cost FROM [dbo].[Spotter] w GO          

The result fix is as follows:

Drop Attribute (Color) Data Wrangling Method has been applied to the Dataset (Watch Table)

We accept refined the Dataset by dropping the Color column as it was non needed for analysis and cost (in terms of endeavor) of imputing (replacing) its missing values (Nulls) was far more than than detaching it from the Dataset straight abroad.

Information Wrangling by Imputing Missing Values using Mean Method

Let us at present discuss another form of information wrangling by imputing (replacing) missing values with the assistance of Hateful Method to amend the data quality.  This method requires us to summate statistical mean value of the series of the dataset to impute (supersede) missing values.

The Mean value is defined as follows: SUM of elements of the series/Number of elements of the serial

In other words, this is average of a serial of elements to be used as a replacement for missing value(s).

However, please remember that this is applicable to numerical series only (having numbers) and it is but desirable when very modest percent of data is missing.

Mean Calculation

Let us employ the Hateful value method to impute the missing value in Case Width cavalcade by running the following script:

--Data Wrangling Hateful value method to impute the missing value in Case Width cavalcade SELECT SUM(due west.[Case Width]) Equally SumOfValues, COUNT(*) NumberOfValues, SUM(west.[Case Width])/COUNT(*) as Mean  FROM dbo.Watch west WHERE due west.[Case Width] is NOT NULL   --Imputing the missing value in Example Width Column with Mean: fifty UPDATE dbo.Lookout SET [Instance Width]=50 -- Mean Value (Average) WHERE [Case Width] IS Naught          

The output is as follows:

Calculating Mean Value of Case Width

Let u.s.a. take a await at the tabular array now:

-- View Watch Table (Dataset) after applying Mean Value Data Wrangling Method for Example Width SELECT westward.WatchId, w.[Case Width], w.[Instance Depth], westward.Toll FROM [dbo].[Lookout man] due west GO          

The updated Watch tabular array (Dataset) is shown below:

Imputing (replacing) missing value (Null) in the Case Width column by using Mean of the series of the elements

The Case Width cavalcade of the Sentry table (Dataset) has been successfully populated using the data wrangling Mean method by replacing Nulls with computed hateful value which is 50.

Let usa refine the Case Depth column as well by applying the same Mean Method for this column.

Please write and run the following script:

--Data Wrangling Mean value method to impute the missing value in Case Depth column SELECT SUM(w.[Instance Depth]) AS SumOfValues, COUNT(*) NumberOfValues, SUM(west.[Case Depth])/COUNT(*) as Mean FROM dbo.Watch w WHERE w.[Instance Depth] is Not NULL   --Imputing the missing value in Instance Depth Column with Mean: 50 UPDATE dbo.Sentinel SET [Example Depth]=17 -- Hateful Value (Average) WHERE [Case Depth] IS NULL          

The output is as follows:

Calculating Case Depth Mean Value

Let us view the table at present:

-- View Watch Table (Dataset) afterward applying Hateful Value Data Wrangling Method for Case Depth SELECT w.WatchId, w.[Case Width], west.[Example Depth], w.Price FROM [dbo].[Sentinel] w Get          

The results are shown below:

Imputing Missing Value in the Case Depth Column

We can clearly see that the Case Depth column's missing value has been successfully replaced with the mean value and tabular array does not accept the same number of Nulls that it had before.

Alternative way to calculate Mean

Please remember since Mean is simply like Average, so nosotros can as well use the SQL Boilerplate function AVG() to calculate hateful value as a portion of your validation rules in the real-world.

For example, if we have to summate the mean value of Cost column then we can just type the following script:

--Estimating Mean Value of Price Column using AVG functionSELECT SUM(W.Cost) AS SUM_PRICE, COUNT(W.PRICE) Every bit COUNT_PRICE,AVG(w.Price) as Mean_Price_Using_AVG  FROM dbo.Lookout man w          

The output can be seen below:

Calculating Mean value of the Price column using SQL Average function (AVG)

Next Steps
  • Delight reset the sample database (Dataset) and apply data wrangling mean value method by using Average function for both Case Width and Case Depth Columns and see if you get the same results or non.
  • Try imputing (replacing) missing values in the Price Column by using Mean Method.
  • Delight setup the sample database OfficeSuppliesSampleV2_Data referenced in this tip and try data wrangling techniques after replacing columns Quantity and Price with Nulls for whatsoever two orders (rows) and try imputing the missing values using data wrangling hateful value method mentioned in this tip.
  • Please create a sample database OfficeSuppliesSample by referring to this tip and supersede whatever 3 of the rows of both Stock and Toll column with NULLs in the Production tabular array to use data wrangling in the light of this tip.

Related Manufactures

Pop Articles

About the author

MSSQLTips author Haroon Ashraf

Haroon Ashraf'due south interests are Database-Centric Architectures and his expertise includes development, testing, implementation and migration along with Database Life Wheel Management (DLM).

View all my tips

Article Last Updated: 2022-04-13

Source: https://www.mssqltips.com/sqlservertip/6815/data-wrangling-sql-imputing-missing-values/

Posted by: quadeliandn.blogspot.com

0 Response to "How To Clean Up An Incomplete Data Set In Sql"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel