在 R 中处理文本

R 编程语言用于统计计算，并被许多数据挖掘者和统计学家用于开发统计软件和数据分析。它包括机器学习算法、线性回归、时间序列、统计推断等等。 R 及其库实现了多种统计和图形技术，包括线性和非线性建模、经典、统计测试、时间序列分析、分类、聚类等。

写在双引号内的任何值都被视为 R 中的字符串。字符串是字符数组，这些字符集合存储在变量中。 R 在内部将每个字符串存储在双引号内，即使您使用单引号创建它们也是如此。

R – 处理文本

可以通过以下方式完成：

在 R 中使用内置类型
使用 Tidyverse 模块
使用正则表达式和外部模块
使用 grep()

方法一：使用内置类型

在此方法中，我们使用内置类型进行文本处理。

Variable_name <- "String"

例子：

R

# R program to demonstrate
# creation of a string
a < -"hello world" print(a)

R

# concatenate two strings
str1 <- "hello" 
str2 <- "how are you?" 
print(paste(str1, str2, sep = " ", collapse = "NULL"))

R

# formatting numbers and strings
  
# Total number of digits displayed.
# Last digit rounded off.
result < - format(69.145656789, digits=9)
print(result)
  
# Display numbers in scientific notation.
result < - format(c(3, 132.84521),
                  scientific=TRUE)
print(result)
  
# The minimum number of digits
# to the right of the decimal point.
result < - format(96.47, nsmall=5)
print(result)
  
# Format treats everything as a string.
result < - format(8)
print(result)
  
# Numbers are padded with blank
# in the beginning for width.
result < - format(67.7, width=6)
print(result)
  
# Left justify strings.
result < - format("Hello", width=8,
                  justify="l")
print(result)

R

# to count the number of characters
# in the string
a <- nchar("hello world")
print(a)

R

# Changing to Upper case.
a <- toupper("hello world")
print(a)
  
# Changing to lower case.
b <- tolower("HELLO WORLD")
print(b)

R

# Extract characters from 1th to 3rd position.
c <- substring("Programming", 1, 3)
print(c)

R

string <- c("WelcometoGeeksforgeeks!")

R

library(tidyverse)
  
str_detect(string, "geeks")

R

library(tidyverse)
  
str_locate(string, "geeks")

R

library(tidyverse)
  
str_extract(string, "for")

R

library(tidyverse)
  
str_replace(string, "toGeeksforgeeks", " geeks")

R

string <- c("WelcometoGeeksforgeeks!")
  
str_extract_all(string, "G..k")

R

str_extract_all(string, "W\\D\\Dcome")

R

str <- c("Hello", "hello", "hi", "hey")
grep('hey', str)

R

str <- c("Hello", "hello", "hi", "hey")
grep('he', str, ignore.case ="True")

输出：

"hello world"

以下是使用字符串时需要遵循的规则列表：

字符串开头和结尾的引号应该都是双引号或都是单引号。它们不能混合。
双引号可以插入到以单引号开头和结尾的字符串中。
单引号可以插入到以双引号开头和结尾的字符串中。

字符串操作

字符串操作是要求用户处理给定字符串并使用/更改其数据的过程。 R中有不同的方法来操作字符串，如下所示：

字符串连接 - paste()函数：此函数用于在 R 中组合字符串。它可以将 n 个参数组合在一起。

Syntax: paste(…., sep = ” “, collapse =NULL )

Parameters:

…..: It is used to pass n no of arguments to combine together.
sep: It is used to represent the separator between the arguments. It is optional.
collapse: It is used to remove the space between 2 strings, But not space within two words in one string.

编程需要懂一点英语

例子：

R

# concatenate two strings
str1 <- "hello" 
str2 <- "how are you?" 
print(paste(str1, str2, sep = " ", collapse = "NULL"))

输出：

"hello how are you?"

格式化数字和字符串- format()函数：此函数用于以指定的样式格式化字符串和数字。

Syntax: format(x, digits, nsmall, scientific, width, justify = c(“left”, “right”, “centre”, “none”))

Parameters:

x is the vector input.
digits here is the total number of digits displayed.
nsmall is the minimum number of digits to the right of the decimal point.
scientific is set to TRUE to display scientific notation.
width indicates the minimum width to be displayed by padding blanks in the beginning.
justify is the display of the string to left, right, or center.

编程需要懂一点英语

例子：

R

# formatting numbers and strings
  
# Total number of digits displayed.
# Last digit rounded off.
result < - format(69.145656789, digits=9)
print(result)
  
# Display numbers in scientific notation.
result < - format(c(3, 132.84521),
                  scientific=TRUE)
print(result)
  
# The minimum number of digits
# to the right of the decimal point.
result < - format(96.47, nsmall=5)
print(result)
  
# Format treats everything as a string.
result < - format(8)
print(result)
  
# Numbers are padded with blank
# in the beginning for width.
result < - format(67.7, width=6)
print(result)
  
# Left justify strings.
result < - format("Hello", width=8,
                  justify="l")
print(result)

输出：

[1] "69.1456568"
[1] "3.000000e+00" "1.328452e+02"
[1] "96.47000"
[1] "8"
[1] "  67.7"
[1] "Hello   "

统计字符串中的字符数——nchar()函数：该函数用于统计字符串中的字符数和空格数。

Syntax: nchar(x)

Parameter:

x is the vector input here.

编程需要懂一点英语

例子：

R

# to count the number of characters
# in the string
a <- nchar("hello world")
print(a)

输出：

[1] 11

更改字符串的大小写 - toupper() & tolower()函数：这些函数用于更改字符串的大小写。

Syntax: toupper(x) and tolower(x)

Parameter:

x is the vector input

编程需要懂一点英语

例子：

R

# Changing to Upper case.
a <- toupper("hello world")
print(a)
  
# Changing to lower case.
b <- tolower("HELLO WORLD")
print(b)

输出：

"HELLO WORLD"
"hello world"

提取部分字符串– substring()函数：此函数用于提取部分字符串。

Syntax: substring(x, first, last)

Parameters:

x is the character vector input.
first is the position of the first character to be extracted.
last is the position of the last character to be extracted.

编程需要懂一点英语

例子：

R

# Extract characters from 1th to 3rd position.
c <- substring("Programming", 1, 3)
print(c)

输出：

"Pro"

方法二：使用 Tidyverse 模块

在这种方法中，我们将使用 Tidyverse 模块，该模块包括数据科学工作流程中所需的所有包，从数据探索到数据可视化。 stringr 是一个库，具有许多用于数据清理和数据准备任务的功能。它还设计用于处理字符串，并具有许多使此过程变得简单的功能。

我们正在使用此文本进行处理：

R

string <- c("WelcometoGeeksforgeeks!")

示例 1：检测字符串

在本例中，我们将使用str_detect()方法检测字符串。

Syntax: str_detect( string, “text in string”)

Parameters:

String is the vector input

编程需要懂一点英语

R

library(tidyverse)
  
str_detect(string, "geeks")

输出：

TRUE

示例 2：定位字符串

在本例中，我们将使用str_locate()方法检测字符串。

Syntax: str_locate( string, “text in string”)

Parameters:

String is the vector input

编程需要懂一点英语

R

library(tidyverse)
  
str_locate(string, "geeks")

输出：

start end
18 22

示例 3：提取字符串

在本例中，我们将使用str_extract()方法检测字符串。

Syntax: str_extract( string, “text in string”)

Parameters:

String is the vector input

编程需要懂一点英语

R

library(tidyverse)
  
str_extract(string, "for")

输出：

for

示例 4：替换字符串

在本例中，我们将使用str_replace()方法检测字符串。

Syntax: str_replace( string, “text in string”)

Parameters:

String is the vector input

编程需要懂一点英语

R

library(tidyverse)
  
str_replace(string, "toGeeksforgeeks", " geeks")

输出：

'Welcome geeks!'

方法3：使用正则表达式和外部模块

在这种方法中，我们使用了像 stringr 这样的外部模块的正则表达式。

示例 1：使用点选择字符

在这里，我们将使用点 (.) 来选择字符中的字符串。

R

string <- c("WelcometoGeeksforgeeks!")
  
str_extract_all(string, "G..k")

输出：

Geek

示例 2：使用 \\D 选择字符串

\\D 用于选择正则表达式中的任何字符和数字。

R

str_extract_all(string, "W\\D\\Dcome")

输出：

'Welcome'

方法 4：使用 grep()

grep()函数返回在向量中找到模式的索引。如果该模式多次出现，则返回出现的索引列表。这非常有用，因为它不仅告诉我们模式的出现，还告诉我们它在向量中的位置。

Syntax: grep(pattern, string, ignore.case=FALSE)

Parameters:

pattern: A regular expressions pattern.
string: The character vector to be searched.
ignore.case: Whether to ignore case in the search. Here ignore.case is an optional parameter as is set to FALSE by default.

编程需要懂一点英语

示例 1：查找字符串中特定单词的所有实例。

R

str <- c("Hello", "hello", "hi", "hey")
grep('hey', str)

输出：

示例 2：查找字符串中特定单词的所有实例，不考虑大小写

R

str <- c("Hello", "hello", "hi", "hey")
grep('he', str, ignore.case ="True")

输出：

[1] 1 2 4