在 R 编程中生成词云
词云是一种数据可视化技术,用于表示文本数据,其中每个词的大小表示其频率或重要性。可以使用词云突出显示重要的文本数据点。词云广泛用于分析来自社交网络网站的数据。
为什么是词云?
应该使用词云来呈现文本数据的原因是:
- 词云增加了简单性和清晰度。最常用的关键词在词云中更突出
- 词云是一种有效的交流工具。它们易于理解、易于分享且具有影响力。
- 词云比表格数据更具视觉吸引力。
R中的实现
以下是在 R 编程中创建词云的步骤。
第 1 步:创建文本文件
将文本复制并粘贴到纯文本文件(例如:file.txt)中并保存文件。
第 2 步:安装和加载所需的软件包
Python3
# install the required packages
install.packages("tm") # for text mining
install.packages("SnowballC") # for text stemming
install.packages("wordcloud") # word-cloud generator
install.packages("RColorBrewer") # color palettes
# load the packages
library("tm")
library("SnowballC")
library("wordcloud")
library("RColorBrewer")
Python3
text = readLines(file.choose())
Python3
# VectorSource() function
# creates a corpus of
# character vectors
docs = Corpus(VectorSource(text))
Python3
toSpace = content_transformer
(function (x, pattern)
gsub(pattern, " ", x))
docs1 = tm_map(docs, toSpace, "/")
docs1 = tm_map(docs, toSpace, "@")
docs1 = tm_map(docs, toSpace, "#")
Python3
# Convert the text to lower case
docs1 = tm_map(docs1,
content_transformer(tolower))
# Remove numbers
docs1 = tm_map(docs1, removeNumbers)
# Remove white spaces
docs1 = tm_map(docs1, stripWhitespace)
Python3
dtm = TermDocumentMatrix(docs)
m = as.matrix(dtm)
v = sort(rowSums(m), decreasing = TRUE)
d = data.frame(word = names(v), freq = v)
head(d, 10)
Python3
wordcloud(words = d$word,
freq = d$freq,
min.freq = 1,
max.words = 200,
random.order = FALSE,
rot.per = 0.35,
colors = brewer.pal(8, "Dark2"))
Python3
# R program to illustrate
# Generating word cloud
# Install the required packages
install.packages("tm") # for text mining
install.packages("SnowballC") # for text stemming
install.packages("wordcloud") # word-cloud generator
install.packages("RColorBrewer") # color palettes
# Load the packages
library("tm")
library("SnowballC")
library("wordcloud")
library("RColorBrewer")
# To choose the text file
text = readLines(file.choose())
# VectorSource() function
# creates a corpus of
# character vectors
docs = Corpus(VectorSource(text))
# Text transformation
toSpace = content_transformer(
function (x, pattern)
gsub(pattern, " ", x))
docs1 = tm_map(docs, toSpace, "/")
docs1 = tm_map(docs, toSpace, "@")
docs1 = tm_map(docs, toSpace, "#")
strwrap(docs1)
# Cleaning the Text
docs1 = tm_map(docs1, content_transformer(tolower))
docs1 = tm_map(docs1, removeNumbers)
docs1 = tm_map(docs1, stripWhitespace)
# Build a term-document matrix
dtm = TermDocumentMatrix(docs)
m = as.matrix(dtm)
v = sort(rowSums(m),
decreasing = TRUE)
d = data.frame(word = names(v),
freq = v)
head(d, 10)
# Generate the Word cloud
wordcloud(words = d$word,
freq = d$freq,
min.freq = 1,
max.words = 200,
random.order = FALSE,
rot.per = 0.35,
colors = brewer.pal(8, "Dark2"))
第三步:文本挖掘
- 加载文本:
使用 text mining(tm) 包中的Corpus()函数加载文本。语料库是文档的列表。- 首先导入在步骤 1 中创建的文本文件:
要导入本地保存在计算机中的文件,请键入以下 R 代码。您将被要求以交互方式选择文本文件。Python3
text = readLines(file.choose())
- 将数据加载为语料库:
Python3
# VectorSource() function # creates a corpus of # character vectors docs = Corpus(VectorSource(text))
- 文字转换:
使用tm_map()函数执行转换以替换文本中的特殊字符,例如“@”、“#”、“/”。Python3
toSpace = content_transformer (function (x, pattern) gsub(pattern, " ", x)) docs1 = tm_map(docs, toSpace, "/") docs1 = tm_map(docs, toSpace, "@") docs1 = tm_map(docs, toSpace, "#")
- 首先导入在步骤 1 中创建的文本文件:
- 清理文本:
tm_map()函数用于删除不必要的空白,将文本转换为小写,删除常见的停用词。可以使用removeNumbers 删除数字。Python3
# Convert the text to lower case docs1 = tm_map(docs1, content_transformer(tolower)) # Remove numbers docs1 = tm_map(docs1, removeNumbers) # Remove white spaces docs1 = tm_map(docs1, stripWhitespace)
第 4 步:构建术语文档矩阵
文档矩阵是包含单词频率的表格。列名是单词,行名是文档。文本挖掘包中的函数TermDocumentMatrix()可以按如下方式使用。
Python3
dtm = TermDocumentMatrix(docs)
m = as.matrix(dtm)
v = sort(rowSums(m), decreasing = TRUE)
d = data.frame(word = names(v), freq = v)
head(d, 10)
第 5 步:生成词云
词的重要性可以用如下的词云来说明。
Python3
wordcloud(words = d$word,
freq = d$freq,
min.freq = 1,
max.words = 200,
random.order = FALSE,
rot.per = 0.35,
colors = brewer.pal(8, "Dark2"))
下面给出了 R 中词云的完整代码。
Python3
# R program to illustrate
# Generating word cloud
# Install the required packages
install.packages("tm") # for text mining
install.packages("SnowballC") # for text stemming
install.packages("wordcloud") # word-cloud generator
install.packages("RColorBrewer") # color palettes
# Load the packages
library("tm")
library("SnowballC")
library("wordcloud")
library("RColorBrewer")
# To choose the text file
text = readLines(file.choose())
# VectorSource() function
# creates a corpus of
# character vectors
docs = Corpus(VectorSource(text))
# Text transformation
toSpace = content_transformer(
function (x, pattern)
gsub(pattern, " ", x))
docs1 = tm_map(docs, toSpace, "/")
docs1 = tm_map(docs, toSpace, "@")
docs1 = tm_map(docs, toSpace, "#")
strwrap(docs1)
# Cleaning the Text
docs1 = tm_map(docs1, content_transformer(tolower))
docs1 = tm_map(docs1, removeNumbers)
docs1 = tm_map(docs1, stripWhitespace)
# Build a term-document matrix
dtm = TermDocumentMatrix(docs)
m = as.matrix(dtm)
v = sort(rowSums(m),
decreasing = TRUE)
d = data.frame(word = names(v),
freq = v)
head(d, 10)
# Generate the Word cloud
wordcloud(words = d$word,
freq = d$freq,
min.freq = 1,
max.words = 200,
random.order = FALSE,
rot.per = 0.35,
colors = brewer.pal(8, "Dark2"))
输出:
词云的优势
- 分析客户和员工的反馈。
- 确定要定位的新 SEO 关键字。
- 词云是杀手级的可视化工具。他们以简单明了的格式呈现文本数据
- 词云是很好的交流工具。对于希望传达基本见解的任何人来说,它们都非常方便
词云的缺点
- 词云并不适合所有情况。
- 数据应针对上下文进行优化。
- 词云通常无法提供改进和发展业务所需的可行见解。