Data Cleaning Challenge

23 Oct 2021 ~10 min read

Synopsis

Data

It’s been a while since last time I do exercise in data analysis. In this exercise, I’m going to do data cleaning to a dataset that I got from one of Shashank Kalanithi’s video tutorial, someone how share how to do data analytics on youtube. This dataset is an anonymized version of a dataset He received from a client. To be honest, I don’t really know much about the dataset, so I’m just do what he does in different programming language. For any further information about the dataset, you can click link url below.

Processing Steps

Loading and Do Pre-processing to the Dataset
Making a new column for ID
Subsetting the Dataset
Merging all Subsetted data
Export the output
Additional Steps

1. Loading the Dataset

# import data
    data_import <- read.csv("data_cleaning_challenge.csv")

# Show a portion of the dataset
    head(data_import,12)

Row.Type	Iter.Number	Power1	Speed1	Speed2	Electricity	Effort	Weight	Torque	X
first name: Person	last name: Human	date: end of time							NA
									NA
Row Type	Iter Number	Power1	Speed1	Speed2	Electricity	Effort	Weight	Torque	NA
Iter	1	360	108	863	599	680	442	982	NA
Iter	2	684	508	613	241	249	758	639	NA
Iter	3	365	126	825	407	855	164	86	NA
Iter	4	764	594	304	718	278	674	774	NA
Iter	5	487	97	593	206	779	800	123	NA
Average	182	361	741	231	731	493	847	237	NA
Maximum	276	33	97	154	25	922	9	312	NA
Std.Dev.	523	1000	34	904	237	600	170	553	NA
Total	336	-	-	-	-	977	744	652	NA

# Check the description of the dataset
    str(data_import)

'data.frame':	76377 obs. of  11 variables:
 $ Row.Type   : Factor w/ 8 levels "","Average","first name: Person",..: 3 1 6 4 4 4 4 4 2 5 ...
 $ Iter.Number: Factor w/ 1003 levels "","1","10","100",..: 1003 1 1002 2 114 225 336 447 95 199 ...
 $ Power1     : Factor w/ 1004 levels "","-","1","10",..: 1003 1 1004 294 653 299 742 434 295 260 ...
 $ Speed1     : Factor w/ 1003 levels "","-","1","10",..: 1 1 1003 14 458 34 553 970 717 970 ...
 $ Speed2     : Factor w/ 1003 levels "","-","1","10",..: 1 1 1003 852 575 810 232 552 151 65 ...
 $ Electricity: Factor w/ 1003 levels "","-","1","10",..: 1 1 1003 558 162 346 691 123 706 171 ...
 $ Effort     : Factor w/ 1002 levels "","1","10","100",..: 1 1 1002 648 169 842 201 757 440 917 ...
 $ Weight     : Factor w/ 1002 levels "","1","10","100",..: 1 1 1002 384 734 75 641 782 833 891 ...
 $ Torque     : Factor w/ 1002 levels "","1","10","100",..: 1 1 1002 983 602 847 752 30 156 240 ...
 $ X          : logi  NA NA NA NA NA NA ...
 $ X.1        : Factor w/ 7 levels "","notes: 2BLOCK",..: 1 1 1 1 1 1 1 1 1 1 ...

As you can see, variable X and X.1 doesn’t really give us any information. So, I’m not going to use it.

# Subsetting the dataset
    data_import <- data_import[1:9]

2. Making a New Column for Person Index

The purpose of this step was to assign an ID to each person in the dataset. With these ID, we can differentiate each record and iteration for each person.

# Make a function for iterating person index
    iteration <- function(data){
        Iteration <- c()
        itr <- 0
        for (i in 1:nrow(data)){
            if (grepl("first name:", data$Row.Type[i]) == T){
                itr <- itr + 1
                Iteration <- c(Iteration, itr)
            } else{
                Iteration <- c(Iteration, Iteration[i-1])
            }}
        data <- cbind(data, Iteration)
        return(data)
    }
    
# add the iteration and overwrite the dataset
    data_import <- iteration(data_import)

# Check the total of person in the dataset
    max(data_import$Iteration)

5994

3. Subsetting the Dataset

# subset rows with first name, last name, and date.
    data_name <- subset(data_import, grepl("first name:", data_import$Row.Type), select = c(Row.Type, Iter.Number, Power1, Iteration))
    colnames(data_name) <- c("First.Name", "Last.Name", "Date", "Iteration")

# Cleaning data in data_name, deleting unnecessary string like 'first name:', 'last name:' and 'date:'
    data_name$First.Name <- gsub("first name:", "", data_name$First.Name)
    data_name$Last.Name <- gsub("last name:", "", data_name$Last.Name)
    data_name$Date <- gsub("date: ", "", data_name$Date)

# subset rows without name, column name and missing value
    no_name <- !(grepl("first name:", data_import$Row.Type)|
                   grepl('Row Type', data_import$Row.Type))
    data_no_name <- subset(data_import, no_name)
    data_no_name <- data_no_name[!(data_no_name$Row.Type == ""|
                                     is.na(data_no_name$Row.Type)),]

4. Merge the Dataset into One Object

## Join data
    output_data <- merge(x = data_name, y = data_no_name, by.x = "Iteration")

head(output_data,12)

Iteration	First.Name	Last.Name	Date	Row.Type	Iter.Number	Power1	Speed1	Speed2	Electricity	Effort	Weight	Torque	Iteration.1
1	Person	Human	end of time	Iter	1	360	108	863	599	680	442	982	1
1	Person	Human	end of time	Iter	2	684	508	613	241	249	758	639	1
1	Person	Human	end of time	Iter	3	365	126	825	407	855	164	86	1
1	Person	Human	end of time	Iter	4	764	594	304	718	278	674	774	1
1	Person	Human	end of time	Iter	5	487	97	593	206	779	800	123	1
1	Person	Human	end of time	Average	182	361	741	231	731	493	847	237	1
1	Person	Human	end of time	Maximum	276	33	97	154	25	922	9	312	1
1	Person	Human	end of time	Std.Dev.	523	1000	34	904	237	600	170	553	1
1	Person	Human	end of time	Total	336	-	-	-	-	977	744	652	1
2	Person	Human	end of time	Iter	1	702	494	311	492	456	370	150	2
2	Person	Human	end of time	Iter	2	929	82	838	421	154	346	227	2
2	Person	Human	end of time	Iter	3	763	402	344	951	139	295	285	2

dim(output_data)

46409

5. Export the Output

# Export Output Data
    writexl::write_xlsx(output_data, "cleaned.xlsx")

6. Additional Steps

Now that I reviewed my previous code and throroughly examine the dataset, I found that all numbers of this dataset are just a random number from 1-999. Although this dataset give us random numbers, this dataset can provide us a good dataset layout for solving a data cleaning problem.

Steps number 5 was the end of processing step Shashank Kalanithi showed in his video. By looking at the output that being showed in step 4, we can see that the output table isn’t entirely clean and tidy. While it’s true that the dataset was way a lot cleaner by looking for its observation, I found a lot of repetitive value in the output. So, in this Additional Step I want to add a little change to the output based on my own preference.

# Create a duplicate of output dataset
    data2 <- output_data

# Replace '-' value in total to NA and change column type into a correct one
    library(dplyr)

    data2 <- data2 %>% mutate(Power1 = as.numeric(gsub("-", NA, Power1)), 
                              Speed1 = as.numeric(gsub("-", NA, Speed1)), 
                              Speed2 = as.numeric(gsub("-", NA, Speed2)), 
                              Electricity = as.numeric(gsub("-", NA, Electricity)),
                              Iter.Number = as.numeric(as.character(Iter.Number)),
                              Effort = as.numeric(as.character(Effort)),
                              Weight = as.numeric(as.character(Weight)),
                              Torque = as.numeric(as.character(Torque)),
                              Row.Type = as.character(Row.Type)
                             )

# Create a function to extract and transform the main data to a tidier table
    Exportdata <- function(data, col, row, iter){
        if(row == "Iter"){
            temp <- with(data, data.frame(paste(Power1[Row.Type == row & Iteration == iter], collapse = ", "),
                                          paste(Speed1[Row.Type == row & Iteration == iter], collapse = ", "),
                                          paste(Speed2[Row.Type == row & Iteration == iter], collapse = ", "),
                                          paste(Electricity[Row.Type == row & Iteration == iter], collapse = ", "),
                                          paste(Effort[Row.Type == row & Iteration == iter], collapse = ", "),
                                          paste(Weight[Row.Type == row & Iteration == iter], collapse = ", "),
                                          paste(Torque[Row.Type == row & Iteration == iter], collapse = ", "),
                                          stringsAsFactors = FALSE))
        } else {
            temp <- with(data, data.frame(Power1[Row.Type == row & Iteration == iter],
                                        Speed1[Row.Type == row & Iteration == iter],
                                        Speed2[Row.Type == row & Iteration == iter],
                                        Electricity[Row.Type == row & Iteration == iter],
                                        Effort[Row.Type == row & Iteration == iter],
                                        Weight[Row.Type == row & Iteration == iter],
                                        Torque[Row.Type == row & Iteration == iter],
                                        stringsAsFactors = FALSE))
        }
        colnames(temp) <- col
        return(temp)
    }

# Check if the new function is working
    Exportdata(data = data2, col = c("Power1",'Speed1','Speed2', 'Electricity', 'Effort', 'Weight', 'Torque'), row = "Iter", iter = 1)

Power1	Speed1	Speed2	Electricity	Effort	Weight	Torque
360, 684, 365, 764, 487	108, 508, 126, 594, 97	863, 613, 825, 304, 593	599, 241, 407, 718, 206	680, 249, 855, 278, 779	442, 758, 164, 674, 800	982, 639, 86, 774, 123

# Transform all data into a new dataset
    outputIteration <- data.frame()
    for (i in 1:max(data2$Iteration)){
        temp <- with(data2, 
                     data.frame(ID = paste0('ID#', sprintf('%04d', i)),
                                firstName = unique(First.Name[Iteration == i]),
                                lastName = unique(Last.Name[Iteration == i]),
                                Date = unique(Date[Iteration == i]),
                                totalIteration = max(Iter.Number[Row.Type == 'Iter' & Iteration == i]),
                                Exportdata(data = data2, col = c("Power1",'Speed1','Speed2', 'Electricity', 'Effort', 'Weight', 'Torque'), row = "Iter", iter = i),
                                Exportdata(data = data2, col = c("Avg_Power1",'Avg_Speed1','Avg_Speed2', 'Avg_Electricity', 'Avg_Effort', 'Avg_Weight', 'Avg_Torque'), row = "Average", iter = i),
                                Exportdata(data = data2, col = c("Max_Power1",'Max_Speed1','Max_Speed2', 'Max_Electricity', 'Max_Effort', 'Max_Weight', 'Max_Torque'), row = "Maximum", iter = i),
                                Exportdata(data = data2, col = c("sd_Power1",'sd_Speed1','sd_Speed2', 'sd_Electricity', 'sd_Effort', 'sd_Weight', 'sd_Torque'), row = "Std.Dev.", iter = i),
                                Exportdata(data = data2, col = c("Total_Power1",'Total_Speed1','Total_Speed2', 'Total_Electricity', 'Total_Effort', 'Total_Weight', 'Total_Torque'), row = "Total", iter = i),
                                stringsAsFactors = FALSE
                               )
                    )
        outputIteration <- rbind(outputIteration, temp)
        }

# Remove column with NA
    outputIteration <- outputIteration %>% select_if(~ !any(is.na(.)))

# Show the last six rows, dimension, and class of each column in the dataset
    tail(outputIteration); dim(outputIteration); t(sapply(outputIteration, class))

	ID	firstName	lastName	Date	totalIteration	Power1	Speed1	Speed2	Electricity	Effort	...	sd_Power1	sd_Speed1	sd_Speed2	sd_Electricity	sd_Effort	sd_Weight	sd_Torque	Total_Effort	Total_Weight	Total_Torque
5989	ID#5989	Person	Human	end of time	3	530, 23, 448	650, 683, 273	146, 627, 482	267, 160, 748	982, 582, 186	...	458	479	777	397	678	967	560	161	598	124
5990	ID#5990	Person	Human	end of time	4	877, 575, 505, 139	262, 385, 939, 45	70, 907, 870, 345	681, 391, 88, 256	585, 10, 48, 95	...	188	532	54	26	32	3	470	445	7	945
5991	ID#5991	Person	Human	end of time	5	401, 665, 325, 402, 213	183, 138, 336, 181, 949	322, 153, 454, 734, 537	394, 812, 15, 575, 932	994, 359, 205, 487, 405	...	723	528	315	669	301	788	909	354	616	851
5992	ID#5992	Person	Human	end of time	5	434, 334, 571, 303, 170	985, 605, 187, 544, 930	479, 535, 584, 901, 119	299, 595, 28, 894, 455	75, 222, 743, 824, 588	...	586	798	212	998	133	709	178	813	224	950
5993	ID#5993	Person	Human	end of time	5	294, 661, 505, 663, 834	469, 750, 581, 8, 133	91, 489, 45, 684, 805	243, 519, 892, 597, 907	439, 314, 808, 109, 798	...	260	5	889	364	731	285	345	100	640	206
5994	ID#5994	Person	Human	end of time	6	51, 623, 624, 311, 137, 879	210, 2, 745, 267, 717, 73	277, 179, 663, 247, 778, 977	993, 400, 239, 277, 975, 680	894, 34, 623, 209, 207, 500	...	112	717	630	239	561	142	909	847	583	407

5994

ID	firstName	lastName	Date	totalIteration	Power1	Speed1	Speed2	Electricity	Effort	...	sd_Power1	sd_Speed1	sd_Speed2	sd_Electricity	sd_Effort	sd_Weight	sd_Torque	Total_Effort	Total_Weight	Total_Torque
character	character	character	character	numeric	character	character	character	character	character	...	numeric	numeric	numeric	numeric	numeric	numeric	numeric	numeric	numeric	numeric

# Export Output Data
    writexl::write_xlsx(outputIteration, "cleaned_and_tidy.xlsx")

Now look at that. Nice, clean, and tidy data. One row for one person.

The down side of this output are,.. Firstly, each person has its own totalIteration. So instead of making many column for each iteration, I decided to merge the iteration with paste() function and this method can affect the output. The data created by each person are supposed to be numeric. By merging all iteration data into one column, all iteration data will change to character class separated by comma. If you want to analyse this column, you had to do one more step to seperate each iteration using strsplit() function and transform it to numeric class. Secondly, I used for loop which makes the entire code run longer. In total, I need 5 minutes to execute this entire code. This is too long for me, personally. And I didn’t know the alternative yet. So it count as a down side.

Although, this final output wasn’t completely perfect. I satistied with it.