• Home
  • About
    • rdwin.github.io photo

      R, Dwi N

      'Anything that can go wrong, will go wrong'
      ~Murphy's Law

    • Learn More
    • Email
    • LinkedIn
    • Instagram
    • Github
  • Tags
  • Posts
  • Projects

Data Cleaning Challenge

23 Oct 2021   ~10 min read

Synopsis

Data

It’s been a while since last time I do exercise in data analysis. In this exercise, I’m going to do data cleaning to a dataset that I got from one of Shashank Kalanithi’s video tutorial, someone how share how to do data analytics on youtube. This dataset is an anonymized version of a dataset He received from a client. To be honest, I don’t really know much about the dataset, so I’m just do what he does in different programming language. For any further information about the dataset, you can click link url below.

  • Data source [2.7MB]
  • Video Link
  • Youtube Channel

Processing Steps

  1. Loading and Do Pre-processing to the Dataset
  2. Making a new column for ID
  3. Subsetting the Dataset
  4. Merging all Subsetted data
  5. Export the output
  6. Additional Steps

1. Loading the Dataset

# import data
    data_import <- read.csv("data_cleaning_challenge.csv")
# Show a portion of the dataset
    head(data_import,12)
Row.Type Iter.Number Power1 Speed1 Speed2 Electricity Effort Weight Torque X X.1
first name: Person last name: Human date: end of time NA
NA
Row Type Iter Number Power1 Speed1 Speed2 Electricity Effort Weight Torque NA
Iter 1 360 108 863 599 680 442 982 NA
Iter 2 684 508 613 241 249 758 639 NA
Iter 3 365 126 825 407 855 164 86 NA
Iter 4 764 594 304 718 278 674 774 NA
Iter 5 487 97 593 206 779 800 123 NA
Average 182 361 741 231 731 493 847 237 NA
Maximum 276 33 97 154 25 922 9 312 NA
Std.Dev. 523 1000 34 904 237 600 170 553 NA
Total 336 - - - - 977 744 652 NA
# Check the description of the dataset
    str(data_import)
'data.frame':	76377 obs. of  11 variables:
 $ Row.Type   : Factor w/ 8 levels "","Average","first name: Person",..: 3 1 6 4 4 4 4 4 2 5 ...
 $ Iter.Number: Factor w/ 1003 levels "","1","10","100",..: 1003 1 1002 2 114 225 336 447 95 199 ...
 $ Power1     : Factor w/ 1004 levels "","-","1","10",..: 1003 1 1004 294 653 299 742 434 295 260 ...
 $ Speed1     : Factor w/ 1003 levels "","-","1","10",..: 1 1 1003 14 458 34 553 970 717 970 ...
 $ Speed2     : Factor w/ 1003 levels "","-","1","10",..: 1 1 1003 852 575 810 232 552 151 65 ...
 $ Electricity: Factor w/ 1003 levels "","-","1","10",..: 1 1 1003 558 162 346 691 123 706 171 ...
 $ Effort     : Factor w/ 1002 levels "","1","10","100",..: 1 1 1002 648 169 842 201 757 440 917 ...
 $ Weight     : Factor w/ 1002 levels "","1","10","100",..: 1 1 1002 384 734 75 641 782 833 891 ...
 $ Torque     : Factor w/ 1002 levels "","1","10","100",..: 1 1 1002 983 602 847 752 30 156 240 ...
 $ X          : logi  NA NA NA NA NA NA ...
 $ X.1        : Factor w/ 7 levels "","notes: 2BLOCK",..: 1 1 1 1 1 1 1 1 1 1 ...

As you can see, variable X and X.1 doesn’t really give us any information. So, I’m not going to use it.

# Subsetting the dataset
    data_import <- data_import[1:9]

2. Making a New Column for Person Index

The purpose of this step was to assign an ID to each person in the dataset. With these ID, we can differentiate each record and iteration for each person.

# Make a function for iterating person index
    iteration <- function(data){
        Iteration <- c()
        itr <- 0
        for (i in 1:nrow(data)){
            if (grepl("first name:", data$Row.Type[i]) == T){
                itr <- itr + 1
                Iteration <- c(Iteration, itr)
            } else{
                Iteration <- c(Iteration, Iteration[i-1])
            }}
        data <- cbind(data, Iteration)
        return(data)
    }
    
# add the iteration and overwrite the dataset
    data_import <- iteration(data_import)
# Check the total of person in the dataset
    max(data_import$Iteration)

5994

3. Subsetting the Dataset

# subset rows with first name, last name, and date.
    data_name <- subset(data_import, grepl("first name:", data_import$Row.Type), select = c(Row.Type, Iter.Number, Power1, Iteration))
    colnames(data_name) <- c("First.Name", "Last.Name", "Date", "Iteration")
# Cleaning data in data_name, deleting unnecessary string like 'first name:', 'last name:' and 'date:'
    data_name$First.Name <- gsub("first name:", "", data_name$First.Name)
    data_name$Last.Name <- gsub("last name:", "", data_name$Last.Name)
    data_name$Date <- gsub("date: ", "", data_name$Date)
# subset rows without name, column name and missing value
    no_name <- !(grepl("first name:", data_import$Row.Type)|
                   grepl('Row Type', data_import$Row.Type))
    data_no_name <- subset(data_import, no_name)
    data_no_name <- data_no_name[!(data_no_name$Row.Type == ""|
                                     is.na(data_no_name$Row.Type)),]

4. Merge the Dataset into One Object

## Join data
    output_data <- merge(x = data_name, y = data_no_name, by.x = "Iteration")
head(output_data,12)
Iteration First.Name Last.Name Date Row.Type Iter.Number Power1 Speed1 Speed2 Electricity Effort Weight Torque Iteration.1
1 Person Human end of time Iter 1 360 108 863 599 680 442 982 1
1 Person Human end of time Iter 2 684 508 613 241 249 758 639 1
1 Person Human end of time Iter 3 365 126 825 407 855 164 86 1
1 Person Human end of time Iter 4 764 594 304 718 278 674 774 1
1 Person Human end of time Iter 5 487 97 593 206 779 800 123 1
1 Person Human end of time Average 182 361 741 231 731 493 847 237 1
1 Person Human end of time Maximum 276 33 97 154 25 922 9 312 1
1 Person Human end of time Std.Dev. 523 1000 34 904 237 600 170 553 1
1 Person Human end of time Total 336 - - - - 977 744 652 1
2 Person Human end of time Iter 1 702 494 311 492 456 370 150 2
2 Person Human end of time Iter 2 929 82 838 421 154 346 227 2
2 Person Human end of time Iter 3 763 402 344 951 139 295 285 2
dim(output_data)

46409

14

5. Export the Output

# Export Output Data
    writexl::write_xlsx(output_data, "cleaned.xlsx")

6. Additional Steps

Now that I reviewed my previous code and throroughly examine the dataset, I found that all numbers of this dataset are just a random number from 1-999. Although this dataset give us random numbers, this dataset can provide us a good dataset layout for solving a data cleaning problem.

Steps number 5 was the end of processing step Shashank Kalanithi showed in his video. By looking at the output that being showed in step 4, we can see that the output table isn’t entirely clean and tidy. While it’s true that the dataset was way a lot cleaner by looking for its observation, I found a lot of repetitive value in the output. So, in this Additional Step I want to add a little change to the output based on my own preference.

# Create a duplicate of output dataset
    data2 <- output_data
# Replace '-' value in total to NA and change column type into a correct one
    library(dplyr)

    data2 <- data2 %>% mutate(Power1 = as.numeric(gsub("-", NA, Power1)), 
                              Speed1 = as.numeric(gsub("-", NA, Speed1)), 
                              Speed2 = as.numeric(gsub("-", NA, Speed2)), 
                              Electricity = as.numeric(gsub("-", NA, Electricity)),
                              Iter.Number = as.numeric(as.character(Iter.Number)),
                              Effort = as.numeric(as.character(Effort)),
                              Weight = as.numeric(as.character(Weight)),
                              Torque = as.numeric(as.character(Torque)),
                              Row.Type = as.character(Row.Type)
                             )
# Create a function to extract and transform the main data to a tidier table
    Exportdata <- function(data, col, row, iter){
        if(row == "Iter"){
            temp <- with(data, data.frame(paste(Power1[Row.Type == row & Iteration == iter], collapse = ", "),
                                          paste(Speed1[Row.Type == row & Iteration == iter], collapse = ", "),
                                          paste(Speed2[Row.Type == row & Iteration == iter], collapse = ", "),
                                          paste(Electricity[Row.Type == row & Iteration == iter], collapse = ", "),
                                          paste(Effort[Row.Type == row & Iteration == iter], collapse = ", "),
                                          paste(Weight[Row.Type == row & Iteration == iter], collapse = ", "),
                                          paste(Torque[Row.Type == row & Iteration == iter], collapse = ", "),
                                          stringsAsFactors = FALSE))
        } else {
            temp <- with(data, data.frame(Power1[Row.Type == row & Iteration == iter],
                                        Speed1[Row.Type == row & Iteration == iter],
                                        Speed2[Row.Type == row & Iteration == iter],
                                        Electricity[Row.Type == row & Iteration == iter],
                                        Effort[Row.Type == row & Iteration == iter],
                                        Weight[Row.Type == row & Iteration == iter],
                                        Torque[Row.Type == row & Iteration == iter],
                                        stringsAsFactors = FALSE))
        }
        colnames(temp) <- col
        return(temp)
    }

# Check if the new function is working
    Exportdata(data = data2, col = c("Power1",'Speed1','Speed2', 'Electricity', 'Effort', 'Weight', 'Torque'), row = "Iter", iter = 1)
Power1 Speed1 Speed2 Electricity Effort Weight Torque
360, 684, 365, 764, 487 108, 508, 126, 594, 97 863, 613, 825, 304, 593 599, 241, 407, 718, 206 680, 249, 855, 278, 779 442, 758, 164, 674, 800 982, 639, 86, 774, 123
# Transform all data into a new dataset
    outputIteration <- data.frame()
    for (i in 1:max(data2$Iteration)){
        temp <- with(data2, 
                     data.frame(ID = paste0('ID#', sprintf('%04d', i)),
                                firstName = unique(First.Name[Iteration == i]),
                                lastName = unique(Last.Name[Iteration == i]),
                                Date = unique(Date[Iteration == i]),
                                totalIteration = max(Iter.Number[Row.Type == 'Iter' & Iteration == i]),
                                Exportdata(data = data2, col = c("Power1",'Speed1','Speed2', 'Electricity', 'Effort', 'Weight', 'Torque'), row = "Iter", iter = i),
                                Exportdata(data = data2, col = c("Avg_Power1",'Avg_Speed1','Avg_Speed2', 'Avg_Electricity', 'Avg_Effort', 'Avg_Weight', 'Avg_Torque'), row = "Average", iter = i),
                                Exportdata(data = data2, col = c("Max_Power1",'Max_Speed1','Max_Speed2', 'Max_Electricity', 'Max_Effort', 'Max_Weight', 'Max_Torque'), row = "Maximum", iter = i),
                                Exportdata(data = data2, col = c("sd_Power1",'sd_Speed1','sd_Speed2', 'sd_Electricity', 'sd_Effort', 'sd_Weight', 'sd_Torque'), row = "Std.Dev.", iter = i),
                                Exportdata(data = data2, col = c("Total_Power1",'Total_Speed1','Total_Speed2', 'Total_Electricity', 'Total_Effort', 'Total_Weight', 'Total_Torque'), row = "Total", iter = i),
                                stringsAsFactors = FALSE
                               )
                    )
        outputIteration <- rbind(outputIteration, temp)
        }
# Remove column with NA
    outputIteration <- outputIteration %>% select_if(~ !any(is.na(.)))

# Show the last six rows, dimension, and class of each column in the dataset
    tail(outputIteration); dim(outputIteration); t(sapply(outputIteration, class))
ID firstName lastName Date totalIteration Power1 Speed1 Speed2 Electricity Effort ... sd_Power1 sd_Speed1 sd_Speed2 sd_Electricity sd_Effort sd_Weight sd_Torque Total_Effort Total_Weight Total_Torque
5989 ID#5989 Person Human end of time 3 530, 23, 448 650, 683, 273 146, 627, 482 267, 160, 748 982, 582, 186 ... 458 479 777 397 678 967 560 161 598 124
5990 ID#5990 Person Human end of time 4 877, 575, 505, 139 262, 385, 939, 45 70, 907, 870, 345 681, 391, 88, 256 585, 10, 48, 95 ... 188 532 54 26 32 3 470 445 7 945
5991 ID#5991 Person Human end of time 5 401, 665, 325, 402, 213 183, 138, 336, 181, 949 322, 153, 454, 734, 537 394, 812, 15, 575, 932 994, 359, 205, 487, 405 ... 723 528 315 669 301 788 909 354 616 851
5992 ID#5992 Person Human end of time 5 434, 334, 571, 303, 170 985, 605, 187, 544, 930 479, 535, 584, 901, 119 299, 595, 28, 894, 455 75, 222, 743, 824, 588 ... 586 798 212 998 133 709 178 813 224 950
5993 ID#5993 Person Human end of time 5 294, 661, 505, 663, 834 469, 750, 581, 8, 133 91, 489, 45, 684, 805 243, 519, 892, 597, 907 439, 314, 808, 109, 798 ... 260 5 889 364 731 285 345 100 640 206
5994 ID#5994 Person Human end of time 6 51, 623, 624, 311, 137, 879 210, 2, 745, 267, 717, 73 277, 179, 663, 247, 778, 977 993, 400, 239, 277, 975, 680 894, 34, 623, 209, 207, 500 ... 112 717 630 239 561 142 909 847 583 407

5994

36

ID firstName lastName Date totalIteration Power1 Speed1 Speed2 Electricity Effort ... sd_Power1 sd_Speed1 sd_Speed2 sd_Electricity sd_Effort sd_Weight sd_Torque Total_Effort Total_Weight Total_Torque
character character character character numeric character character character character character ... numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric
# Export Output Data
    writexl::write_xlsx(outputIteration, "cleaned_and_tidy.xlsx")

Now look at that. Nice, clean, and tidy data. One row for one person.

The down side of this output are,.. Firstly, each person has its own totalIteration. So instead of making many column for each iteration, I decided to merge the iteration with paste() function and this method can affect the output. The data created by each person are supposed to be numeric. By merging all iteration data into one column, all iteration data will change to character class separated by comma. If you want to analyse this column, you had to do one more step to seperate each iteration using strsplit() function and transform it to numeric class. Secondly, I used for loop which makes the entire code run longer. In total, I need 5 minutes to execute this entire code. This is too long for me, personally. And I didn’t know the alternative yet. So it count as a down side.

Although, this final output wasn’t completely perfect. I satistied with it.

r programmingdata cleaning Share: Facebook twitter Copy link