使用 R 编程进行主成分分析

R编程中的主成分分析（PCA） 是对所有现有属性的线性分量的分析。主成分是数据集中原始预测变量的线性组合（正交变换）。它是 EDA（探索性数据分析）的一种有用技术，可让您更好地可视化具有许多变量的数据集中存在的变化。

R – 主成分分析

第一个主成分捕获数据集中的最大方差。它决定了更高可变性的方向。第二个主成分捕获数据中的剩余方差，并且与 PC1 不相关。 PC1 和 PC2 之间的相关性应该为零。因此，所有后续的主成分都遵循相同的概念。它们捕获剩余的方差，而不与之前的主成分相关。

数据集

数据集mtcars （电机趋势汽车道路测试）包括 32 辆汽车的油耗和汽车设计和性能的 10 个方面。它预装了 R 中的 dplyr 包。

R

# Installing required package
install.packages("dplyr")
 
# Loading the package
library(dplyr)
 
# Importing excel file
str(mtcars)

R

# Loading Data
data(mtcars)
 
# Apply PCA using prcomp function
# Need to scale / Normalize as
# PCA depends on distance measure
my_pca <- prcomp(mtcars, scale = TRUE,
                center = TRUE, retx = T)
names(my_pca)
 
# Summary
summary(my_pca)
my_pca
 
# View the principal component loading
# my_pca$rotation[1:5, 1:4]
my_pca$rotation
 
# See the principal components
dim(my_pca$x)
my_pca$x
 
# Plotting the resultant principal components
# The parameter scale = 0 ensures that arrows
# are scaled to represent the loadings
biplot(my_pca, main = "Biplot", scale = 0)
 
# Compute standard deviation
my_pca$sdev
 
# Compute variance
my_pca.var <- my_pca$sdev ^ 2
my_pca.var
 
# Proportion of variance for a scree plot
propve <- my_pca.var / sum(my_pca.var)
propve
 
# Plot variance explained for each principal component
plot(propve, xlab = "principal component",
            ylab = "Proportion of Variance Explained",
            ylim = c(0, 1), type = "b",
            main = "Scree Plot")
 
# Plot the cumulative proportion of variance explained
plot(cumsum(propve),
    xlab = "Principal Component",
    ylab = "Cumulative Proportion of Variance Explained",
    ylim = c(0, 1), type = "b")
 
# Find Top n principal component
# which will atleast cover 90 % variance of dimension
which(cumsum(propve) >= 0.9)[1]
 
# Predict mpg using first 4 new Principal Components
# Add a training set with principal components
train.data <- data.frame(disp = mtcars$disp, my_pca$x[, 1:4])
 
# Running a Decision tree algporithm
## Installing and loading packages
install.packages("rpart")
install.packages("rpart.plot")
library(rpart)
library(rpart.plot)
 
rpart.model <- rpart(disp ~ .,
                    data = train.data, method = "anova")
 
rpart.plot(rpart.model)

输出：

使用 R 语言使用数据集进行主成分分析

我们对包含 32 个汽车品牌和 10 个变量的mtcars进行主成分分析。

R

# Loading Data
data(mtcars)
 
# Apply PCA using prcomp function
# Need to scale / Normalize as
# PCA depends on distance measure
my_pca <- prcomp(mtcars, scale = TRUE,
                center = TRUE, retx = T)
names(my_pca)
 
# Summary
summary(my_pca)
my_pca
 
# View the principal component loading
# my_pca$rotation[1:5, 1:4]
my_pca$rotation
 
# See the principal components
dim(my_pca$x)
my_pca$x
 
# Plotting the resultant principal components
# The parameter scale = 0 ensures that arrows
# are scaled to represent the loadings
biplot(my_pca, main = "Biplot", scale = 0)
 
# Compute standard deviation
my_pca$sdev
 
# Compute variance
my_pca.var <- my_pca$sdev ^ 2
my_pca.var
 
# Proportion of variance for a scree plot
propve <- my_pca.var / sum(my_pca.var)
propve
 
# Plot variance explained for each principal component
plot(propve, xlab = "principal component",
            ylab = "Proportion of Variance Explained",
            ylim = c(0, 1), type = "b",
            main = "Scree Plot")
 
# Plot the cumulative proportion of variance explained
plot(cumsum(propve),
    xlab = "Principal Component",
    ylab = "Cumulative Proportion of Variance Explained",
    ylim = c(0, 1), type = "b")
 
# Find Top n principal component
# which will atleast cover 90 % variance of dimension
which(cumsum(propve) >= 0.9)[1]
 
# Predict mpg using first 4 new Principal Components
# Add a training set with principal components
train.data <- data.frame(disp = mtcars$disp, my_pca$x[, 1:4])
 
# Running a Decision tree algporithm
## Installing and loading packages
install.packages("rpart")
install.packages("rpart.plot")
library(rpart)
library(rpart.plot)
 
rpart.model <- rpart(disp ~ .,
                    data = train.data, method = "anova")
 
rpart.plot(rpart.model)

输出：

毕情节

生成的主成分绘制为Biplot 。比例值 0 表示箭头按比例表示载荷。
解释每个主成分的方差

碎石图表示方差和主成分的比例。在 2 个主成分以下，有一个最大比例的方差，如图所示。
累计方差比例

碎石图表示方差和主成分的累积比例。在 2 个主成分之上，有一个最大的累积方差比例，如图所示。
决策树模型

建立决策树模型以使用数据集中的其他变量和使用 ANOVA 方法来预测disp 。绘制决策树图并显示信息。