📜  Python Pandas-使用文本数据

📅  最后修改于: 2020-11-06 05:42:36             🧑  作者: Mango


在本章中,我们将使用基本的Series / Index讨论字符串操作。在随后的章节中,我们将学习如何在DataFrame上应用这些字符串函数。

Pandas提供了一组字符串函数,可以轻松地对字符串数据进行操作。最重要的是,这些函数忽略(或排除)缺少的/ NaN值。

几乎所有的这些方法使用Python字符串函数工作(参见: HTTPS://文档Python.ORG / 3 /库/ stdtypes.html#字符串的方法)。因此,将Series对象转换为String对象,然后执行该操作。

现在让我们看看每个操作如何执行。

Sr.No Function & Description
1

lower()

Converts strings in the Series/Index to lower case.

2

upper()

Converts strings in the Series/Index to upper case.

3

len()

Computes String length().

4

strip()

Helps strip whitespace(including newline) from each string in the Series/index from both the sides.

5

split(‘ ‘)

Splits each string with the given pattern.

6

cat(sep=’ ‘)

Concatenates the series/index elements with given separator.

7

get_dummies()

Returns the DataFrame with One-Hot Encoded values.

8

contains(pattern)

Returns a Boolean value True for each element if the substring contains in the element, else False.

9

replace(a,b)

Replaces the value a with the value b.

10

repeat(value)

Repeats each element with specified number of times.

11

count(pattern)

Returns count of appearance of pattern in each element.

12

startswith(pattern)

Returns true if the element in the Series/Index starts with the pattern.

13

endswith(pattern)

Returns true if the element in the Series/Index ends with the pattern.

14

find(pattern)

Returns the first position of the first occurrence of the pattern.

15

findall(pattern)

Returns a list of all occurrence of the pattern.

16

swapcase

Swaps the case lower/upper.

17

islower()

Checks whether all characters in each string in the Series/Index in lower case or not. Returns Boolean

18

isupper()

Checks whether all characters in each string in the Series/Index in upper case or not. Returns Boolean.

19

isnumeric()

Checks whether all characters in each string in the
Series/Index are numeric. Returns Boolean.

现在让我们创建一个系列,看看以上所有功能如何工作。

import pandas as pd
import numpy as np

s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t', np.nan, '1234','SteveSmith'])

print s

输出如下-

0            Tom
1   William Rick
2           John
3        Alber@t
4            NaN
5           1234
6    Steve Smith
dtype: object

降低()

import pandas as pd
import numpy as np

s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t', np.nan, '1234','SteveSmith'])

print s.str.lower()

输出如下-

0            tom
1   william rick
2           john
3        alber@t
4            NaN
5           1234
6    steve smith
dtype: object

上()

import pandas as pd
import numpy as np

s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t', np.nan, '1234','SteveSmith'])

print s.str.upper()

输出如下-

0            TOM
1   WILLIAM RICK
2           JOHN
3        ALBER@T
4            NaN
5           1234
6    STEVE SMITH
dtype: object

len()

import pandas as pd
import numpy as np

s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t', np.nan, '1234','SteveSmith'])
print s.str.len()

输出如下-

0    3.0
1   12.0
2    4.0
3    7.0
4    NaN
5    4.0
6   10.0
dtype: float64

跳闸()

import pandas as pd
import numpy as np
s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])
print s
print ("After Stripping:")
print s.str.strip()

输出如下-

0            Tom
1   William Rick
2           John
3        Alber@t
dtype: object

After Stripping:
0            Tom
1   William Rick
2           John
3        Alber@t
dtype: object

分割(图案)

import pandas as pd
import numpy as np
s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])
print s
print ("Split Pattern:")
print s.str.split(' ')

输出如下-

0            Tom
1   William Rick
2           John
3        Alber@t
dtype: object

Split Pattern:
0   [Tom, , , , , , , , , , ]
1   [, , , , , William, Rick]
2   [John]
3   [Alber@t]
dtype: object

猫(sep =模式)

import pandas as pd
import numpy as np

s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])

print s.str.cat(sep='_')

输出如下-

Tom _ William Rick_John_Alber@t

get_dummies()

import pandas as pd
import numpy as np

s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])

print s.str.get_dummies()

输出如下-

William Rick   Alber@t   John   Tom
0             0         0      0     1
1             1         0      0     0
2             0         0      1     0
3             0         1      0     0

包含()

import pandas as pd

s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])

print s.str.contains(' ')

输出如下-

0   True
1   True
2   False
3   False
dtype: bool

替换(a,b)

import pandas as pd
s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])
print s
print ("After replacing @ with $:")
print s.str.replace('@','$')

输出如下-

0   Tom
1   William Rick
2   John
3   Alber@t
dtype: object

After replacing @ with $:
0   Tom
1   William Rick
2   John
3   Alber$t
dtype: object

重复(值)

import pandas as pd

s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])

print s.str.repeat(2)

输出如下-

0   Tom            Tom
1   William Rick   William Rick
2                  JohnJohn
3                  Alber@tAlber@t
dtype: object

计数(模式)

import pandas as pd
 
s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])

print ("The number of 'm's in each string:")
print s.str.count('m')

输出如下-

The number of 'm's in each string:
0    1
1    1
2    0
3    0

startswith(模式)

import pandas as pd

s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])

print ("Strings that start with 'T':")
print s.str. startswith ('T')

输出如下-

0  True
1  False
2  False
3  False
dtype: bool

endwith(模式)

import pandas as pd
s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])
print ("Strings that end with 't':")
print s.str.endswith('t')

输出如下-

Strings that end with 't':
0  False
1  False
2  False
3  True
dtype: bool

查找(模式)

import pandas as pd

s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])

print s.str.find('e')

输出如下-

0  -1
1  -1
2  -1
3   3
dtype: int64

“ -1”表示元素中没有这样的模式。

findall(模式)

import pandas as pd

s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])

print s.str.findall('e')

输出如下-

0 []
1 []
2 []
3 [e]
dtype: object

空列表([])表示元素中没有这样的模式。

swapcase()

import pandas as pd

s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t'])
print s.str.swapcase()

输出如下-

0  tOM
1  wILLIAM rICK
2  jOHN
3  aLBER@T
dtype: object

islower()

import pandas as pd

s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t'])
print s.str.islower()

输出如下-

0  False
1  False
2  False
3  False
dtype: bool

isupper()

import pandas as pd

s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t'])

print s.str.isupper()

输出如下-

0  False
1  False
2  False
3  False
dtype: bool

isnumeric()

import pandas as pd

s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t'])

print s.str.isnumeric()

输出如下-

0  False
1  False
2  False
3  False
dtype: bool