How To Clean Up An Incomplete Data Set In Sql
By: | Updated: 2022-04-13 | Comments | Related: More > Data Cleansing for Validation
Trouble
As a database or business intelligence programmer I would like to learn the data wrangling steps in SQL particularly the methods used to impute missing values without having to learn any specialized languages such as Python or R.
Solution
The solution is to go hands on experience of imputing missing values using the native SQL language, but following the same basis used in Data Scientific discipline for information munging that utilize specialized languages like Python.
About Imputing Missing Values
It is very important to get some basic know-how of data wrangling and the term "Imputing Missing Values" in a Dataset from a data wrangling procedure.
Data wrangling
Data wrangling is merely a way to refine a Dataset by converting it from raw data to another grade for information analytics purposes past Data Analyst for improve decision making.
To go a better agreement about data wrangling (information exploration, data preparation, data cleaning, validating, enriching, automation, etc.) delight refer to the following tutorial: Learn basics of Data Wrangling in SQL to handle Invalid Values.
What is Imputing Missing Values?
Imputing missing values means replacing missing values with some meaningful data in a Dataset as part of data wrangling, which can exist very fourth dimension-consuming.
What are missing values?
A missing value is any value in a Dataset (such every bit a SQL database table) which has not been supplied or has been left uninitialized.
SQL Instance of missing value
Nada values correspond missing values in a SQL table which tin can pose serious problems for conveying out circuitous data assay so these missing values must be handled by using 1 of the methods applied in data wrangling.
Imputing Missing Values using Mean and Median Methods
In this walkthrough we are going to learn the following information wrangling approaches to impute (supplant) missing values:
- Using Drop Attribute Method
- Using Mean Method
Prerequisites
This tip assumes that the readers are well familiar with T-SQL and database concepts and the post-obit environment is setup:
- SQL Server local/remote instance is installed
- Database evolution and management tool like SQL Server Management Studio (SSMS) or Azure Data Studio is readily available to setup the database and run T-SQL scripts
Reference Tip
It is highly recommended to go through the previously discussed methods used in data wrangling by reading the post-obit tip: Learn basics of Data Wrangling in SQL to handle Invalid Values.
Setup Sample Dataset (Database) WatchesDWR2
The get-go thing is to setup a sample Dataset (SQL database) called WatchesDWR2, well-nigh watches, by running the post-obit script against the master database to create the data structures:
-- Create a new database called 'WatchesDWR2' -- Connect to the 'master' database to run this snippet Employ master Get -- Create the new database if information technology does non be already IF Not EXISTS ( SELECT [name] FROM sys.databases WHERE [name] = North'WatchesDWR2' ) CREATE DATABASE WatchesDWR2 GO -- Switch to the WatchesDWR2 database (created locally) Utilise WatchesDWR2 -- Creating a watch tabular array CREATE TABLE [dbo].[Lookout] ( [WatchId] [int] IDENTITY(ane,i) NOT NULL, [Color] [varchar](20) NULL, [Case Width] INT Naught, [Case Depth] INT Null, [Cost] [decimal](10, 2) NULL, CONSTRAINT [PK_WatchType] PRIMARY KEY Amassed ( [WatchId] ASC ) ) -- Insert rows into tabular array 'Spotter' in schema '[dbo]' INSERT INTO [dbo].[Watch] ( -- Columns to insert data into Color,[Case Width],[Case Depth], Toll ) VALUES ( 'Black', 49, 20, 200.00 ), ( 'Bluish', 45, xviii, 150.00 ), ( NULL, 50, xv, NULL ), ( NULL, 55, fifteen, 170.50 ), ( Nada, NULL, Zero, Aught ), ( NULL, 52, xx, 300 )
Dataset Cheque
Please have a quick await at the recently created sample database which is used as a Dataset for data wrangling by typing and executing the following T-SQL script against the WatchesDWR2 database:
-- View Watch tabular array SELECT w.WatchId, w.Color, due west.[Case Width], due west.[Case Depth], w.Cost FROM [dbo].[Lookout man] west GO
The output is as follows:
Please note that this output is generated by Azure Data Studio. You lot tin can use whatsoever other compliant database evolution and management tool such as SQL Server Management Studio (SSMS) to write and run your SQL scripts.
Data Wrangling by using Drop Attribute Method
The "Drop Attribute" information wrangling algorithm is used in ane of the post-obit use cases:
- At that place are many missing values (NULLs) of a column, but the columns itself is not of involvement from analysis point of view
- At that place may be no missing value of a column, only it is excluded from the assay nosotros are preparing the Dataset for
- There may exist very few missing values of cavalcade, merely dropping the field (cavalcade) is better than replacing those values
And then, in other words, this is the data wrangling approach in which the Data Scientist \ Data Wrangler decides to detach (remove) an attribute from the Dataset (such every bit a SQL tabular array equally a data source) on the ground of its rare or non-use for the upcoming information analysis.
For example, if we are preparing the Dataset for analyzing case width and instance depth of different watches sold to the customers recorded in the Dataset then Color column does non carry any weight to proceed further to the adjacent stages.
Considering the current Watch t tabular array, the following two strong reasons can be used to remove this column:
- It (Color) is not integral part of the analysis which requires case width and case depth of the watches
- It has a lot of NULLs which may exist very plush (in terms of attempt) to impute (replace) them with meaningful data
Equally a issue of the above reasons, we determine to drop this attribute (cavalcade) past using data wrangling drib attribute arroyo.
This method is (if done in SQL) applied by the help of the post-obit script (executed against the sample database):
-- Drib '[Color]' from table '[Sentry]' in schema '[dbo]' using Drop Attribute Information Wrangling Method Alter TABLE [dbo].[Watch] DROP COLUMN [Color] GO
At present check the refined Dataset (table):
-- View Watch Tabular array (Dataset) after applying Drop Attribute Information Wrangling Method SELECT w.WatchId, w.[Case Width], w.[Case Depth], w.Cost FROM [dbo].[Spotter] w GO
The result fix is as follows:
We accept refined the Dataset by dropping the Color column as it was non needed for analysis and cost (in terms of endeavor) of imputing (replacing) its missing values (Nulls) was far more than than detaching it from the Dataset straight abroad.
Information Wrangling by Imputing Missing Values using Mean Method
Let us at present discuss another form of information wrangling by imputing (replacing) missing values with the assistance of Hateful Method to amend the data quality. This method requires us to summate statistical mean value of the series of the dataset to impute (supersede) missing values.
The Mean value is defined as follows: SUM of elements of the series/Number of elements of the serial
In other words, this is average of a serial of elements to be used as a replacement for missing value(s).
However, please remember that this is applicable to numerical series only (having numbers) and it is but desirable when very modest percent of data is missing.
Let us employ the Hateful value method to impute the missing value in Case Width cavalcade by running the following script:
--Data Wrangling Hateful value method to impute the missing value in Case Width cavalcade SELECT SUM(due west.[Case Width]) Equally SumOfValues, COUNT(*) NumberOfValues, SUM(west.[Case Width])/COUNT(*) as Mean FROM dbo.Watch west WHERE due west.[Case Width] is NOT NULL --Imputing the missing value in Example Width Column with Mean: fifty UPDATE dbo.Lookout SET [Instance Width]=50 -- Mean Value (Average) WHERE [Case Width] IS Naught
The output is as follows:
Let u.s.a. take a await at the tabular array now:
-- View Watch Table (Dataset) after applying Mean Value Data Wrangling Method for Example Width SELECT westward.WatchId, w.[Case Width], w.[Instance Depth], westward.Toll FROM [dbo].[Lookout man] due west GO
The updated Watch tabular array (Dataset) is shown below:
The Case Width cavalcade of the Sentry table (Dataset) has been successfully populated using the data wrangling Mean method by replacing Nulls with computed hateful value which is 50.
Let usa refine the Case Depth column as well by applying the same Mean Method for this column.
Please write and run the following script:
--Data Wrangling Mean value method to impute the missing value in Case Depth column SELECT SUM(w.[Instance Depth]) AS SumOfValues, COUNT(*) NumberOfValues, SUM(west.[Case Depth])/COUNT(*) as Mean FROM dbo.Watch w WHERE w.[Instance Depth] is Not NULL --Imputing the missing value in Instance Depth Column with Mean: 50 UPDATE dbo.Sentinel SET [Example Depth]=17 -- Hateful Value (Average) WHERE [Case Depth] IS NULL
The output is as follows:
Let us view the table at present:
-- View Watch Table (Dataset) afterward applying Hateful Value Data Wrangling Method for Case Depth SELECT w.WatchId, w.[Case Width], west.[Example Depth], w.Price FROM [dbo].[Sentinel] w Get
The results are shown below:
We can clearly see that the Case Depth column's missing value has been successfully replaced with the mean value and tabular array does not accept the same number of Nulls that it had before.
Alternative way to calculate Mean
Please remember since Mean is simply like Average, so nosotros can as well use the SQL Boilerplate function AVG() to calculate hateful value as a portion of your validation rules in the real-world.
For example, if we have to summate the mean value of Cost column then we can just type the following script:
--Estimating Mean Value of Price Column using AVG functionSELECT SUM(W.Cost) AS SUM_PRICE, COUNT(W.PRICE) Every bit COUNT_PRICE,AVG(w.Price) as Mean_Price_Using_AVG FROM dbo.Lookout man w
The output can be seen below:
Next Steps
- Delight reset the sample database (Dataset) and apply data wrangling mean value method by using Average function for both Case Width and Case Depth Columns and see if you get the same results or non.
- Try imputing (replacing) missing values in the Price Column by using Mean Method.
- Delight setup the sample database OfficeSuppliesSampleV2_Data referenced in this tip and try data wrangling techniques after replacing columns Quantity and Price with Nulls for whatsoever two orders (rows) and try imputing the missing values using data wrangling hateful value method mentioned in this tip.
- Please create a sample database OfficeSuppliesSample by referring to this tip and supersede whatever 3 of the rows of both Stock and Toll column with NULLs in the Production tabular array to use data wrangling in the light of this tip.
Related Manufactures
Pop Articles
About the author
Haroon Ashraf'due south interests are Database-Centric Architectures and his expertise includes development, testing, implementation and migration along with Database Life Wheel Management (DLM).
View all my tips
Article Last Updated: 2022-04-13
Source: https://www.mssqltips.com/sqlservertip/6815/data-wrangling-sql-imputing-missing-values/
Posted by: quadeliandn.blogspot.com

0 Response to "How To Clean Up An Incomplete Data Set In Sql"
Post a Comment