Synopsis
Data
It’s been a while since last time I do exercise in data analysis. In this exercise, I’m going to do data cleaning to a dataset that I got from one of Shashank Kalanithi’s video tutorial, someone how share how to do data analytics on youtube. This dataset is an anonymized version of a dataset He received from a client. To be honest, I don’t really know much about the dataset, so I’m just do what he does in different programming language. For any further information about the dataset, you can click link url below.
- Data source [2.7MB]
- Video Link
- Youtube Channel
Processing Steps
- Loading and Do Pre-processing to the Dataset
- Making a new column for ID
- Subsetting the Dataset
- Merging all Subsetted data
- Export the output
- Additional Steps
1. Loading the Dataset
# import data
data_import <- read.csv("data_cleaning_challenge.csv")
# Show a portion of the dataset
head(data_import,12)
Row.Type | Iter.Number | Power1 | Speed1 | Speed2 | Electricity | Effort | Weight | Torque | X | X.1 |
---|---|---|---|---|---|---|---|---|---|---|
first name: Person | last name: Human | date: end of time | NA | |||||||
NA | ||||||||||
Row Type | Iter Number | Power1 | Speed1 | Speed2 | Electricity | Effort | Weight | Torque | NA | |
Iter | 1 | 360 | 108 | 863 | 599 | 680 | 442 | 982 | NA | |
Iter | 2 | 684 | 508 | 613 | 241 | 249 | 758 | 639 | NA | |
Iter | 3 | 365 | 126 | 825 | 407 | 855 | 164 | 86 | NA | |
Iter | 4 | 764 | 594 | 304 | 718 | 278 | 674 | 774 | NA | |
Iter | 5 | 487 | 97 | 593 | 206 | 779 | 800 | 123 | NA | |
Average | 182 | 361 | 741 | 231 | 731 | 493 | 847 | 237 | NA | |
Maximum | 276 | 33 | 97 | 154 | 25 | 922 | 9 | 312 | NA | |
Std.Dev. | 523 | 1000 | 34 | 904 | 237 | 600 | 170 | 553 | NA | |
Total | 336 | - | - | - | - | 977 | 744 | 652 | NA |
# Check the description of the dataset
str(data_import)
'data.frame': 76377 obs. of 11 variables:
$ Row.Type : Factor w/ 8 levels "","Average","first name: Person",..: 3 1 6 4 4 4 4 4 2 5 ...
$ Iter.Number: Factor w/ 1003 levels "","1","10","100",..: 1003 1 1002 2 114 225 336 447 95 199 ...
$ Power1 : Factor w/ 1004 levels "","-","1","10",..: 1003 1 1004 294 653 299 742 434 295 260 ...
$ Speed1 : Factor w/ 1003 levels "","-","1","10",..: 1 1 1003 14 458 34 553 970 717 970 ...
$ Speed2 : Factor w/ 1003 levels "","-","1","10",..: 1 1 1003 852 575 810 232 552 151 65 ...
$ Electricity: Factor w/ 1003 levels "","-","1","10",..: 1 1 1003 558 162 346 691 123 706 171 ...
$ Effort : Factor w/ 1002 levels "","1","10","100",..: 1 1 1002 648 169 842 201 757 440 917 ...
$ Weight : Factor w/ 1002 levels "","1","10","100",..: 1 1 1002 384 734 75 641 782 833 891 ...
$ Torque : Factor w/ 1002 levels "","1","10","100",..: 1 1 1002 983 602 847 752 30 156 240 ...
$ X : logi NA NA NA NA NA NA ...
$ X.1 : Factor w/ 7 levels "","notes: 2BLOCK",..: 1 1 1 1 1 1 1 1 1 1 ...
As you can see, variable X
and X.1
doesn’t really give us any information. So, I’m not going to use it.
# Subsetting the dataset
data_import <- data_import[1:9]
2. Making a New Column for Person Index
The purpose of this step was to assign an ID to each person in the dataset. With these ID, we can differentiate each record and iteration for each person.
# Make a function for iterating person index
iteration <- function(data){
Iteration <- c()
itr <- 0
for (i in 1:nrow(data)){
if (grepl("first name:", data$Row.Type[i]) == T){
itr <- itr + 1
Iteration <- c(Iteration, itr)
} else{
Iteration <- c(Iteration, Iteration[i-1])
}}
data <- cbind(data, Iteration)
return(data)
}
# add the iteration and overwrite the dataset
data_import <- iteration(data_import)
# Check the total of person in the dataset
max(data_import$Iteration)
5994
3. Subsetting the Dataset
# subset rows with first name, last name, and date.
data_name <- subset(data_import, grepl("first name:", data_import$Row.Type), select = c(Row.Type, Iter.Number, Power1, Iteration))
colnames(data_name) <- c("First.Name", "Last.Name", "Date", "Iteration")
# Cleaning data in data_name, deleting unnecessary string like 'first name:', 'last name:' and 'date:'
data_name$First.Name <- gsub("first name:", "", data_name$First.Name)
data_name$Last.Name <- gsub("last name:", "", data_name$Last.Name)
data_name$Date <- gsub("date: ", "", data_name$Date)
# subset rows without name, column name and missing value
no_name <- !(grepl("first name:", data_import$Row.Type)|
grepl('Row Type', data_import$Row.Type))
data_no_name <- subset(data_import, no_name)
data_no_name <- data_no_name[!(data_no_name$Row.Type == ""|
is.na(data_no_name$Row.Type)),]
4. Merge the Dataset into One Object
## Join data
output_data <- merge(x = data_name, y = data_no_name, by.x = "Iteration")
head(output_data,12)
Iteration | First.Name | Last.Name | Date | Row.Type | Iter.Number | Power1 | Speed1 | Speed2 | Electricity | Effort | Weight | Torque | Iteration.1 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Person | Human | end of time | Iter | 1 | 360 | 108 | 863 | 599 | 680 | 442 | 982 | 1 |
1 | Person | Human | end of time | Iter | 2 | 684 | 508 | 613 | 241 | 249 | 758 | 639 | 1 |
1 | Person | Human | end of time | Iter | 3 | 365 | 126 | 825 | 407 | 855 | 164 | 86 | 1 |
1 | Person | Human | end of time | Iter | 4 | 764 | 594 | 304 | 718 | 278 | 674 | 774 | 1 |
1 | Person | Human | end of time | Iter | 5 | 487 | 97 | 593 | 206 | 779 | 800 | 123 | 1 |
1 | Person | Human | end of time | Average | 182 | 361 | 741 | 231 | 731 | 493 | 847 | 237 | 1 |
1 | Person | Human | end of time | Maximum | 276 | 33 | 97 | 154 | 25 | 922 | 9 | 312 | 1 |
1 | Person | Human | end of time | Std.Dev. | 523 | 1000 | 34 | 904 | 237 | 600 | 170 | 553 | 1 |
1 | Person | Human | end of time | Total | 336 | - | - | - | - | 977 | 744 | 652 | 1 |
2 | Person | Human | end of time | Iter | 1 | 702 | 494 | 311 | 492 | 456 | 370 | 150 | 2 |
2 | Person | Human | end of time | Iter | 2 | 929 | 82 | 838 | 421 | 154 | 346 | 227 | 2 |
2 | Person | Human | end of time | Iter | 3 | 763 | 402 | 344 | 951 | 139 | 295 | 285 | 2 |
dim(output_data)
46409
14
5. Export the Output
# Export Output Data
writexl::write_xlsx(output_data, "cleaned.xlsx")
6. Additional Steps
Now that I reviewed my previous code and throroughly examine the dataset, I found that all numbers of this dataset are just a random number from 1-999. Although this dataset give us random numbers, this dataset can provide us a good dataset layout for solving a data cleaning problem.
Steps number 5 was the end of processing step Shashank Kalanithi showed in his video. By looking at the output that being showed in step 4, we can see that the output table isn’t entirely clean and tidy. While it’s true that the dataset was way a lot cleaner by looking for its observation
, I found a lot of repetitive value in the output. So, in this Additional Step I want to add a little change to the output based on my own preference.
# Create a duplicate of output dataset
data2 <- output_data
# Replace '-' value in total to NA and change column type into a correct one
library(dplyr)
data2 <- data2 %>% mutate(Power1 = as.numeric(gsub("-", NA, Power1)),
Speed1 = as.numeric(gsub("-", NA, Speed1)),
Speed2 = as.numeric(gsub("-", NA, Speed2)),
Electricity = as.numeric(gsub("-", NA, Electricity)),
Iter.Number = as.numeric(as.character(Iter.Number)),
Effort = as.numeric(as.character(Effort)),
Weight = as.numeric(as.character(Weight)),
Torque = as.numeric(as.character(Torque)),
Row.Type = as.character(Row.Type)
)
# Create a function to extract and transform the main data to a tidier table
Exportdata <- function(data, col, row, iter){
if(row == "Iter"){
temp <- with(data, data.frame(paste(Power1[Row.Type == row & Iteration == iter], collapse = ", "),
paste(Speed1[Row.Type == row & Iteration == iter], collapse = ", "),
paste(Speed2[Row.Type == row & Iteration == iter], collapse = ", "),
paste(Electricity[Row.Type == row & Iteration == iter], collapse = ", "),
paste(Effort[Row.Type == row & Iteration == iter], collapse = ", "),
paste(Weight[Row.Type == row & Iteration == iter], collapse = ", "),
paste(Torque[Row.Type == row & Iteration == iter], collapse = ", "),
stringsAsFactors = FALSE))
} else {
temp <- with(data, data.frame(Power1[Row.Type == row & Iteration == iter],
Speed1[Row.Type == row & Iteration == iter],
Speed2[Row.Type == row & Iteration == iter],
Electricity[Row.Type == row & Iteration == iter],
Effort[Row.Type == row & Iteration == iter],
Weight[Row.Type == row & Iteration == iter],
Torque[Row.Type == row & Iteration == iter],
stringsAsFactors = FALSE))
}
colnames(temp) <- col
return(temp)
}
# Check if the new function is working
Exportdata(data = data2, col = c("Power1",'Speed1','Speed2', 'Electricity', 'Effort', 'Weight', 'Torque'), row = "Iter", iter = 1)
Power1 | Speed1 | Speed2 | Electricity | Effort | Weight | Torque |
---|---|---|---|---|---|---|
360, 684, 365, 764, 487 | 108, 508, 126, 594, 97 | 863, 613, 825, 304, 593 | 599, 241, 407, 718, 206 | 680, 249, 855, 278, 779 | 442, 758, 164, 674, 800 | 982, 639, 86, 774, 123 |
# Transform all data into a new dataset
outputIteration <- data.frame()
for (i in 1:max(data2$Iteration)){
temp <- with(data2,
data.frame(ID = paste0('ID#', sprintf('%04d', i)),
firstName = unique(First.Name[Iteration == i]),
lastName = unique(Last.Name[Iteration == i]),
Date = unique(Date[Iteration == i]),
totalIteration = max(Iter.Number[Row.Type == 'Iter' & Iteration == i]),
Exportdata(data = data2, col = c("Power1",'Speed1','Speed2', 'Electricity', 'Effort', 'Weight', 'Torque'), row = "Iter", iter = i),
Exportdata(data = data2, col = c("Avg_Power1",'Avg_Speed1','Avg_Speed2', 'Avg_Electricity', 'Avg_Effort', 'Avg_Weight', 'Avg_Torque'), row = "Average", iter = i),
Exportdata(data = data2, col = c("Max_Power1",'Max_Speed1','Max_Speed2', 'Max_Electricity', 'Max_Effort', 'Max_Weight', 'Max_Torque'), row = "Maximum", iter = i),
Exportdata(data = data2, col = c("sd_Power1",'sd_Speed1','sd_Speed2', 'sd_Electricity', 'sd_Effort', 'sd_Weight', 'sd_Torque'), row = "Std.Dev.", iter = i),
Exportdata(data = data2, col = c("Total_Power1",'Total_Speed1','Total_Speed2', 'Total_Electricity', 'Total_Effort', 'Total_Weight', 'Total_Torque'), row = "Total", iter = i),
stringsAsFactors = FALSE
)
)
outputIteration <- rbind(outputIteration, temp)
}
# Remove column with NA
outputIteration <- outputIteration %>% select_if(~ !any(is.na(.)))
# Show the last six rows, dimension, and class of each column in the dataset
tail(outputIteration); dim(outputIteration); t(sapply(outputIteration, class))
ID | firstName | lastName | Date | totalIteration | Power1 | Speed1 | Speed2 | Electricity | Effort | ... | sd_Power1 | sd_Speed1 | sd_Speed2 | sd_Electricity | sd_Effort | sd_Weight | sd_Torque | Total_Effort | Total_Weight | Total_Torque | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5989 | ID#5989 | Person | Human | end of time | 3 | 530, 23, 448 | 650, 683, 273 | 146, 627, 482 | 267, 160, 748 | 982, 582, 186 | ... | 458 | 479 | 777 | 397 | 678 | 967 | 560 | 161 | 598 | 124 |
5990 | ID#5990 | Person | Human | end of time | 4 | 877, 575, 505, 139 | 262, 385, 939, 45 | 70, 907, 870, 345 | 681, 391, 88, 256 | 585, 10, 48, 95 | ... | 188 | 532 | 54 | 26 | 32 | 3 | 470 | 445 | 7 | 945 |
5991 | ID#5991 | Person | Human | end of time | 5 | 401, 665, 325, 402, 213 | 183, 138, 336, 181, 949 | 322, 153, 454, 734, 537 | 394, 812, 15, 575, 932 | 994, 359, 205, 487, 405 | ... | 723 | 528 | 315 | 669 | 301 | 788 | 909 | 354 | 616 | 851 |
5992 | ID#5992 | Person | Human | end of time | 5 | 434, 334, 571, 303, 170 | 985, 605, 187, 544, 930 | 479, 535, 584, 901, 119 | 299, 595, 28, 894, 455 | 75, 222, 743, 824, 588 | ... | 586 | 798 | 212 | 998 | 133 | 709 | 178 | 813 | 224 | 950 |
5993 | ID#5993 | Person | Human | end of time | 5 | 294, 661, 505, 663, 834 | 469, 750, 581, 8, 133 | 91, 489, 45, 684, 805 | 243, 519, 892, 597, 907 | 439, 314, 808, 109, 798 | ... | 260 | 5 | 889 | 364 | 731 | 285 | 345 | 100 | 640 | 206 |
5994 | ID#5994 | Person | Human | end of time | 6 | 51, 623, 624, 311, 137, 879 | 210, 2, 745, 267, 717, 73 | 277, 179, 663, 247, 778, 977 | 993, 400, 239, 277, 975, 680 | 894, 34, 623, 209, 207, 500 | ... | 112 | 717 | 630 | 239 | 561 | 142 | 909 | 847 | 583 | 407 |
5994
36
ID | firstName | lastName | Date | totalIteration | Power1 | Speed1 | Speed2 | Electricity | Effort | ... | sd_Power1 | sd_Speed1 | sd_Speed2 | sd_Electricity | sd_Effort | sd_Weight | sd_Torque | Total_Effort | Total_Weight | Total_Torque |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
character | character | character | character | numeric | character | character | character | character | character | ... | numeric | numeric | numeric | numeric | numeric | numeric | numeric | numeric | numeric | numeric |
# Export Output Data
writexl::write_xlsx(outputIteration, "cleaned_and_tidy.xlsx")
Now look at that. Nice, clean, and tidy data. One row for one person.
The down side of this output are,.. Firstly, each person has its own totalIteration
. So instead of making many column for each iteration, I decided to merge the iteration with paste()
function and this method can affect the output. The data created by each person are supposed to be numeric
. By merging all iteration data into one column, all iteration data will change to character
class separated by comma
. If you want to analyse this column, you had to do one more step to seperate each iteration using strsplit()
function and transform it to numeric
class. Secondly, I used for loop
which makes the entire code run longer. In total, I need 5 minutes to execute this entire code. This is too long for me, personally. And I didn’t know the alternative yet. So it count as a down side.
Although, this final output wasn’t completely perfect. I satistied with it.