📌  相关文章
📜  如何使用正则表达式从给定 Pandas DataFrame 的单词中删除重复字符?

📅  最后修改于: 2022-05-13 01:54:37.868000             🧑  作者: Mango

如何使用正则表达式从给定 Pandas DataFrame 的单词中删除重复字符?

先决条件: Python中的正则表达式

在本文中,我们将看到如何使用正则表达式从给定 Pandas Dataframe 的给定列的单词中删除连续重复的字符。

在这里,我们实际上是在寻找连续出现的重复出现的字符,因为我们创建了一个包含这个正则表达式 (\w)\1+ 的模式,这里 \w 代表字符,1+ 代表多次出现的字符。

我们在 re 库的re.sub()函数中传递我们的模式。

函数中的'sub'代表SubString,在给定的字符串(第三个参数)中搜索某个正则表达式模式,并在找到子字符串模式时替换为repl(第二个参数),count检查并维护次数发生这种情况。

现在,让我们创建一个数据框:

Python3
# importing required libraries
import pandas as pd
import re
 
# creating Dataframe with column
# as name and common_comments
df = pd.DataFrame(
  {
    'name' : ['Akash', 'Ayush', 'Diksha',
              'Priyanka', 'Radhika'],
     
    'common_comments' : ['hey buddy meet me today ',
                         'sorry bro i cant meet',
                         'hey akash i love geeksforgeeks',
                         'twiiter is the best way to comment',
                         'geeksforgeeks is good for learners']
    },
   
    columns = ['name', 'common_comments']
)
# printing Dataframe
df


Python3
# define a function to remove
# continuously repeating character
# from the word
def conti_rep_char(str1):
    tchr = str1.group(0)
    if len(tchr) > 1:
      return tchr[0:1]
     
# define a function to check
# whether unique character
# is present or not
def check_unique_char(rep, sent_text):
   
    # regular expression for
    # repetition of characters
    convert = re.sub(r'(\w)\1+',
                     rep,
                     sent_text)
     
    # returning the converted word
    return convert
 
df['modified_common_comments'] = df['common_comments'].apply(
                                   lambda x : check_unique_char(conti_rep_char,
                                                              x))
# show Dataframe
df



输出:

现在,从 Dataframe common_comments 列的单词中删除连续重复的字符。

Python3

# define a function to remove
# continuously repeating character
# from the word
def conti_rep_char(str1):
    tchr = str1.group(0)
    if len(tchr) > 1:
      return tchr[0:1]
     
# define a function to check
# whether unique character
# is present or not
def check_unique_char(rep, sent_text):
   
    # regular expression for
    # repetition of characters
    convert = re.sub(r'(\w)\1+',
                     rep,
                     sent_text)
     
    # returning the converted word
    return convert
 
df['modified_common_comments'] = df['common_comments'].apply(
                                   lambda x : check_unique_char(conti_rep_char,
                                                              x))
# show Dataframe
df


输出: