Synopsis
Data
The experiments have been carried out with a group of 30 volunteers within an age bracket of 19-48 years. Each person performed six activities (WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING) wearing a smartphone (Samsung Galaxy S II) on the waist. Using its embedded accelerometer and gyroscope, we captured 3-axial linear acceleration and 3-axial angular velocity at a constant rate of 50Hz. The experiments have been video-recorded to label the data manually. The obtained dataset has been randomly partitioned into two sets, where 70% of the volunteers was selected for generating the training data and 30% the test data.
The sensor signals (accelerometer and gyroscope) were pre-processed by applying noise filters and then sampled in fixed-width sliding windows of 2.56 sec and 50% overlap (128 readings/window). The sensor acceleration signal, which has gravitational and body motion components, was separated using a Butterworth low-pass filter into body acceleration and gravity. The gravitational force is assumed to have only low frequency components, therefore a filter with 0.3 Hz cutoff frequency was used. From each window, a vector of features was obtained by calculating variables from the time and frequency domain.
A full description is available at the site where the data was obtained: Here
The purpose of this project is to demonstrating how to collect, load, and clean a data set. The goal is to prepare tidy data that can be processed easily for later analysis.
- Data set [60MB]
Processing Steps
- Collecting the data.
- Merges training and test data into one data set.
- Extract the measurement by mean and standard deviations.
- Change to descriptive activities label.
- Change column name to descriptive labels.
- Create independent data set with the average of each variable for each activity and each subject.
1. Collecting the Data
# Download the Data set
fileurl <- 'https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip'
download.file(fileurl, destfile = 'projectdataset.zip')
# Unzip the Data set
unzip('./projectdataset.zip')
## 0 Start to reading files
# Read training data
activity.train <- read.table('./UCI HAR Dataset/train/y_train.txt', header = F)
feature.train <- read.table('./UCI HAR Dataset/train/X_train.txt', header = F)
subject.train <- read.table('./UCI HAR Dataset/train/subject_train.txt', header = F)
# Read test data
activity.test <- read.table('./UCI HAR Dataset/test/y_test.txt', header = F)
feature.test <- read.table('./UCI HAR Dataset/test/X_test.txt', header = F)
subject.test <- read.table('./UCI HAR Dataset/test/subject_test.txt', header = F)
# Read activity labels
activity.label <- read.table('./UCI HAR Dataset/activity_labels.txt', header = F)
# Read feature names
feature.names <- read.table('./UCI HAR Dataset/features.txt', header = F)
2. Merge Training and Test Data into One Data Set
# 1.1 Assigning variable names
names(activity.train) <- 'Activity'
names(feature.train) <- feature.names[,2]
names(subject.train) <- 'Subject'
names(activity.test) <- 'Activity'
names(feature.test) <- feature.names[,2]
names(subject.test) <- 'Subject'
names(activity.label) <- c('Activity', 'ActivityType')
# 1.2 Merge all data frame into one set
train <- cbind(subject.train, activity.train, feature.train)
test <- cbind(subject.test, activity.test, feature.test)
data <- rbind(train, test)
print(paste("Observation: ", nrow(data),"Column: ", ncol(data)))
head(data[1:10])
[1] "Observation: 10299 Column: 563"
Subject | Activity | tBodyAcc-mean()-X | tBodyAcc-mean()-Y | tBodyAcc-mean()-Z | tBodyAcc-std()-X | tBodyAcc-std()-Y | tBodyAcc-std()-Z | tBodyAcc-mad()-X | tBodyAcc-mad()-Y |
---|---|---|---|---|---|---|---|---|---|
1 | 5 | 0.2885845 | -0.02029417 | -0.1329051 | -0.9952786 | -0.9831106 | -0.9135264 | -0.9951121 | -0.9831846 |
1 | 5 | 0.2784188 | -0.01641057 | -0.1235202 | -0.9982453 | -0.9753002 | -0.9603220 | -0.9988072 | -0.9749144 |
1 | 5 | 0.2796531 | -0.01946716 | -0.1134617 | -0.9953796 | -0.9671870 | -0.9789440 | -0.9965199 | -0.9636684 |
1 | 5 | 0.2791739 | -0.02620065 | -0.1232826 | -0.9960915 | -0.9834027 | -0.9906751 | -0.9970995 | -0.9827498 |
1 | 5 | 0.2766288 | -0.01656965 | -0.1153619 | -0.9981386 | -0.9808173 | -0.9904816 | -0.9983211 | -0.9796719 |
1 | 5 | 0.2771988 | -0.01009785 | -0.1051373 | -0.9973350 | -0.9904868 | -0.9954200 | -0.9976274 | -0.9902177 |
3. Extract the Measurement by Mean and Standard Deviations
subset.feature <- feature.names$V2[grep("mean\\(\\)|std\\(\\)",feature.names$V2)]
subset.data <- c('Subject', 'Activity', as.character(subset.feature))
data <- subset(data, select = subset.data)
print(paste("Observation: ", nrow(data),"Column: ", ncol(data)))
head(data[1:10])
[1] "Observation: 10299 Column: 68"
Subject | Activity | tBodyAcc-mean()-X | tBodyAcc-mean()-Y | tBodyAcc-mean()-Z | tBodyAcc-std()-X | tBodyAcc-std()-Y | tBodyAcc-std()-Z | tGravityAcc-mean()-X | tGravityAcc-mean()-Y |
---|---|---|---|---|---|---|---|---|---|
1 | 5 | 0.2885845 | -0.02029417 | -0.1329051 | -0.9952786 | -0.9831106 | -0.9135264 | 0.9633961 | -0.1408397 |
1 | 5 | 0.2784188 | -0.01641057 | -0.1235202 | -0.9982453 | -0.9753002 | -0.9603220 | 0.9665611 | -0.1415513 |
1 | 5 | 0.2796531 | -0.01946716 | -0.1134617 | -0.9953796 | -0.9671870 | -0.9789440 | 0.9668781 | -0.1420098 |
1 | 5 | 0.2791739 | -0.02620065 | -0.1232826 | -0.9960915 | -0.9834027 | -0.9906751 | 0.9676152 | -0.1439765 |
1 | 5 | 0.2766288 | -0.01656965 | -0.1153619 | -0.9981386 | -0.9808173 | -0.9904816 | 0.9682244 | -0.1487502 |
1 | 5 | 0.2771988 | -0.01009785 | -0.1051373 | -0.9973350 | -0.9904868 | -0.9954200 | 0.9679482 | -0.1482100 |
4. Change to Descriptive Activities Labels
for (x in 1:6) {data$Activity [(as.character(data$Activity) == x)] <- as.character(activity.label[x,2])
}
print(paste("Observation: ", nrow(data),"Column: ", ncol(data)))
head(data[1:10])
[1] "Observation: 10299 Column: 68"
Subject | Activity | tBodyAcc-mean()-X | tBodyAcc-mean()-Y | tBodyAcc-mean()-Z | tBodyAcc-std()-X | tBodyAcc-std()-Y | tBodyAcc-std()-Z | tGravityAcc-mean()-X | tGravityAcc-mean()-Y |
---|---|---|---|---|---|---|---|---|---|
1 | STANDING | 0.2885845 | -0.02029417 | -0.1329051 | -0.9952786 | -0.9831106 | -0.9135264 | 0.9633961 | -0.1408397 |
1 | STANDING | 0.2784188 | -0.01641057 | -0.1235202 | -0.9982453 | -0.9753002 | -0.9603220 | 0.9665611 | -0.1415513 |
1 | STANDING | 0.2796531 | -0.01946716 | -0.1134617 | -0.9953796 | -0.9671870 | -0.9789440 | 0.9668781 | -0.1420098 |
1 | STANDING | 0.2791739 | -0.02620065 | -0.1232826 | -0.9960915 | -0.9834027 | -0.9906751 | 0.9676152 | -0.1439765 |
1 | STANDING | 0.2766288 | -0.01656965 | -0.1153619 | -0.9981386 | -0.9808173 | -0.9904816 | 0.9682244 | -0.1487502 |
1 | STANDING | 0.2771988 | -0.01009785 | -0.1051373 | -0.9973350 | -0.9904868 | -0.9954200 | 0.9679482 | -0.1482100 |
5. Change Column Names to Descriptive Labels
names(data) <- gsub('^t','Time', names(data))
names(data) <- gsub('^f', 'Frequency', names(data))
names(data) <- gsub('Acc', 'Accelerometer', names(data))
names(data) <- gsub('BodyBody', 'Body', names(data))
names(data) <- gsub('Gyro', 'Gyroscope', names(data))
names(data) <- gsub('Mag', 'Magnitude', names(data))
print(paste("Observation: ", nrow(data),"Column: ", ncol(data)))
head(data[1:10])
[1] "Observation: 10299 Column: 68"
Subject | Activity | TimeBodyAccelerometer-mean()-X | TimeBodyAccelerometer-mean()-Y | TimeBodyAccelerometer-mean()-Z | TimeBodyAccelerometer-std()-X | TimeBodyAccelerometer-std()-Y | TimeBodyAccelerometer-std()-Z | TimeGravityAccelerometer-mean()-X | TimeGravityAccelerometer-mean()-Y |
---|---|---|---|---|---|---|---|---|---|
1 | STANDING | 0.2885845 | -0.02029417 | -0.1329051 | -0.9952786 | -0.9831106 | -0.9135264 | 0.9633961 | -0.1408397 |
1 | STANDING | 0.2784188 | -0.01641057 | -0.1235202 | -0.9982453 | -0.9753002 | -0.9603220 | 0.9665611 | -0.1415513 |
1 | STANDING | 0.2796531 | -0.01946716 | -0.1134617 | -0.9953796 | -0.9671870 | -0.9789440 | 0.9668781 | -0.1420098 |
1 | STANDING | 0.2791739 | -0.02620065 | -0.1232826 | -0.9960915 | -0.9834027 | -0.9906751 | 0.9676152 | -0.1439765 |
1 | STANDING | 0.2766288 | -0.01656965 | -0.1153619 | -0.9981386 | -0.9808173 | -0.9904816 | 0.9682244 | -0.1487502 |
1 | STANDING | 0.2771988 | -0.01009785 | -0.1051373 | -0.9973350 | -0.9904868 | -0.9954200 | 0.9679482 | -0.1482100 |
6. Create Independent Data Set with the Average of Each Variable for Each Activity and Each Subject
# Creates a second independent data set
data2 <- aggregate(.~ Subject + Activity, data, mean)
data2 <- data2[order(data2$Subject,data2$Activity), ]
print(paste("Observation: ", nrow(data2),"Column: ", ncol(data2)))
head(data2[1:10])
[1] "Observation: 180 Column: 68"
Subject | Activity | TimeBodyAccelerometer-mean()-X | TimeBodyAccelerometer-mean()-Y | TimeBodyAccelerometer-mean()-Z | TimeBodyAccelerometer-std()-X | TimeBodyAccelerometer-std()-Y | TimeBodyAccelerometer-std()-Z | TimeGravityAccelerometer-mean()-X | TimeGravityAccelerometer-mean()-Y | |
---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | LAYING | 0.2215982 | -0.040513953 | -0.1132036 | -0.92805647 | -0.836827406 | -0.82606140 | -0.2488818 | 0.7055498 |
31 | 1 | SITTING | 0.2612376 | -0.001308288 | -0.1045442 | -0.97722901 | -0.922618642 | -0.93958629 | 0.8315099 | 0.2044116 |
61 | 1 | STANDING | 0.2789176 | -0.016137590 | -0.1106018 | -0.99575990 | -0.973190056 | -0.97977588 | 0.9429520 | -0.2729838 |
91 | 1 | WALKING | 0.2773308 | -0.017383819 | -0.1111481 | -0.28374026 | 0.114461337 | -0.26002790 | 0.9352232 | -0.2821650 |
121 | 1 | WALKING_DOWNSTAIRS | 0.2891883 | -0.009918505 | -0.1075662 | 0.03003534 | -0.031935943 | -0.23043421 | 0.9318744 | -0.2666103 |
151 | 1 | WALKING_UPSTAIRS | 0.2554617 | -0.023953149 | -0.0973020 | -0.35470803 | -0.002320265 | -0.01947924 | 0.8933511 | -0.3621534 |
# Create txt file from this tidy data
write.table(data2, file = 'TidyData.txt', row.names = F)
# E N D