八、处理文本

原文：Working with Text

校验：Kitty Du

自豪地采用谷歌翻译

# HIDDEN
# Clear previously defined variables
%reset -f

# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/08'))

大量的数据不是像 CSV 中的数字那样，而是像书籍、文档、博客文章和互联网评论中的自由文本。数字和分类数据通常是从物理现象中收集的，而文本数据是从人类的交流和表达中产生的。与大多数类型的数据一样，有许多处理文本的技术，需要多本书来详细解释。在本章中，我们将介绍这些技术中的一小部分，它们为处理文本提供各种有用的操作：Python 字符串操作和正则表达式。

8.1 python 字符串方法

# HIDDEN
# Clear previously defined variables
%reset -f

# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/08'))

# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi

sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)

python 为基本的字符串操作提供了多种方法。虽然这些方法简单，但是作为基础组合在一起形成了更加复杂的字符串操作。我们将在“处理文本：数据清洗”的通用用例中介绍 Python 的字符串方法。

清洗文本数据

数据通常来自几个不同的来源，每个来源都实现了自己的信息编码方式。在下面的示例中，我们用一个表记录一个县(County)所属的州(State)，另一个表记录该县的人口(Population)。

# HIDDEN
state = pd.DataFrame({
    'County': [
        'De Witt County',
        'Lac qui Parle County',
        'Lewis and Clark County',
        'St John the Baptist Parish',
    ],
    'State': [
        'IL',
        'MN',
        'MT',
        'LA',
    ]
})
population = pd.DataFrame({
    'County': [
        'DeWitt  ',
        'Lac Qui Parle',
        'Lewis & Clark',
        'St. John the Baptist',
    ],
    'Population': [
        '16,798',
        '8,067',
        '55,716',
        '43,044',
    ]
})

state

	County	State
0	De Witt County	IL(伊利诺伊州)
1	Lac qui Parle County	MN(明尼苏达州)
2	Lewis and Clark County	MT(蒙大拿州)
3	St John the Baptist Parish	LA(路易斯安那州)

population

	County	Population
0	DeWitt	16,798
1	Lac Qui Parle	8,067
2	Lewis & Clark	55,716
3	St. John the Baptist	43,044

我们当然希望使用County列连接state和population表。不幸的是，两张表中没有一个县的拼写相同。此示例说明了文本数据中存在以下常见问题：

大写：qui 对应 Qui
不同的标点符号习惯：St. 对应 St
缺少单词：在population表中缺少单词County/Parish
空白的使用：DeWitt 对应 De Witt
不同的缩写习惯：& 对应 and

字符串方法

python 的字符串方法允许我们着手解决这些问题。这些方法在所有 python 字符串上都被方便地定义，因此不需要导入其他模块。虽然有必要熟悉一下字符串方法的完整列表，但我们在下表中描述了一些最常用的方法。

方法	说明
`str[x:y]`	切片`str`，返回索引 x（包含）到 y（不包含）
`str.lower()`	返回字符串的副本，所有字母都转换为小写
`str.replace(a, b)`	用子字符串`b`替换`str`中子字符串`a`的所有实例
`str.split(a)`	返回在子字符串`a`处拆分的`str`子字符串
`str.strip()`	从`str`中删除前导空格和尾随空格

我们从state和population表中选择 St. John the Baptist parish 的字符串，并应用字符串方法去除大写、标点符号和county/parish的出现。

john1 = state.loc[3, 'County']
john2 = population.loc[3, 'County']

(john1
 .lower()
 .strip()
 .replace(' parish', '')
 .replace(' county', '')
 .replace('&', 'and')
 .replace('.', '')
 .replace(' ', '')
)

'stjohnthebaptist'

将同一组方法应用于john2，这样我们就能验证两个字符串现在是否相同。

(john2
 .lower()
 .strip()
 .replace(' parish', '')
 .replace(' county', '')
 .replace('&', 'and')
 .replace('.', '')
 .replace(' ', '')
)

'stjohnthebaptist'

满意的是，我们创建了一个名为clean_county的方法来规范化输入的 county。

def clean_county(county):
    return (county
            .lower()
            .strip()
            .replace(' county', '')
            .replace(' parish', '')
            .replace('&', 'and')
            .replace(' ', '')
            .replace('.', ''))

现在，我们可以验证clean_county方法为两个表中的所有的 county 生成匹配的 county：

([clean_county(county) for county in state['County']],
 [clean_county(county) for county in population['County']]
)

(['dewitt', 'lacquiparle', 'lewisandclark', 'stjohnthebaptist'],
 ['dewitt', 'lacquiparle', 'lewisandclark', 'stjohnthebaptist'])

因为两个表中的每个 county 都有相同的转换表示，所以我们可以使用转换后的 county 成功地连接两个表。

pandas 中的字符串方法

在上面的代码中，我们使用一个循环来转换每个 county。pandas的 Series 对象提供了一种将字符串方法应用于序列中每个项的方便方法。首先，state表中的 county 序列：

state['County']

0                De Witt County
1          Lac qui Parle County
2        Lewis and Clark County
3    St John the Baptist Parish
Name: County, dtype: object

pandas的 Series 的.str属性提供了和原生 Python 中相同的字符串方法。对.str属性调用方法会对序列中的每个项调用该方法。

state['County'].str.lower()

0                de witt county
1          lac qui parle county
2        lewis and clark county
3    st john the baptist parish
Name: County, dtype: object

这允许我们在不使用循环的情况下转换序列中的每个字符串。

(state['County']
 .str.lower()
 .str.strip()
 .str.replace(' parish', '')
 .str.replace(' county', '')
 .str.replace('&', 'and')
 .str.replace('.', '')
 .str.replace(' ', '')
)

0              dewitt
1         lacquiparle
2       lewisandclark
3    stjohnthebaptist
Name: County, dtype: object

我们将转换后的 county 保存回其原始表：

state['County'] = (state['County']
 .str.lower()
 .str.strip()
 .str.replace(' parish', '')
 .str.replace(' county', '')
 .str.replace('&', 'and')
 .str.replace('.', '')
 .str.replace(' ', '')
)

population['County'] = (population['County']
 .str.lower()
 .str.strip()
 .str.replace(' parish', '')
 .str.replace(' county', '')
 .str.replace('&', 'and')
 .str.replace('.', '')
 .str.replace(' ', '')
)

现在，这两个表包含了 county 的相同字符串表示：

state

	County	State
0	dewitt	IL
1	lacquiparle	MN
2	lewisandclark	MT
3	stjohnthebaptist	LA

population

	County	Population
0	dewitt	16,798
1	lacquiparle	8,067
2	lewisandclark	55,716
3	stjohnthebaptist	43,044

一旦 county 匹配，就很容易连接这些表了。

state.merge(population, on='County')

	County	State	Population
0	dewitt	IL	16,798
1	lacquiparle	MN	8,067
2	lewisandclark	MT	55,716
3	stjohnthebaptist	LA	43,044

摘要

python 的字符串方法形成了一组简单而有用的字符串操作。pandas的 Series 实现了相同的方法，将底层 python 方法应用于序列中的每个字符串。

您可以在这里找到关于 python 的string方法的完整文档。还可以在这里找到关于 pandas 的文档str方法的完整文档。

8.2 正则表达式

# HIDDEN
# Clear previously defined variables
%reset -f

# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/08'))

在本节中，我们将介绍正则表达式，这是一个在字符串中指定模式的重要工具。

动机

在较大篇幅的文本中，许多有用的子字符串以特定的格式出现。例如，下面的句子包含了一个美国电话号码。

"give me a call, my number is 123-456-7890."

电话号码包含以下模式：

三个数字
然后是破折号
后面是三个数字
然后是破折号
后面是四个数字

如果有一段自由格式的文本，我们自然会希望检测和提取电话号码。我们还可能希望提取特定的电话号码片段，例如，通过提取区号，我们可以推断文本中提到的个人的位置。

为了检测字符串是否包含电话号码，我们可以尝试编写如下方法：

def is_phone_number(string):

    digits = '0123456789'

    def is_not_digit(token):
        return token not in digits 

    # Three numbers
    for i in range(3):
        if is_not_digit(string[i]):
            return False

    # Followed by a dash
    if string[3] != '-':
        return False

    # Followed by three numbers
    for i in range(4, 7):
        if is_not_digit(string[i]):
            return False

    # Followed by a dash    
    if string[7] != '-':
        return False

    # Followed by four numbers
    for i in range(8, 12):
        if is_not_digit(string[i]):
            return False

    return True

is_phone_number("382-384-3840")

True

is_phone_number("phone number")

False

上面的代码令人不快且冗长。比起手动循环字符串中的字符，我们可能更喜欢指定一个模式并命令 python 来匹配该模式。

正则表达式（通常缩写为regex）通过允许我们为字符串创建通用模式，方便地解决了这个确切的问题。使用正则表达式，我们可以在短短两行 python 中重新实现is_phone_number方法：

import re

def is_phone_number(string):
    regex = r"[0-9]{3}-[0-9]{3}-[0-9]{4}"
    return re.search(regex, string) is not None

is_phone_number("382-384-3840")

True

在上面的代码中，我们使用 regex[0-9]{3}-[0-9]{3}-[0-9]{4}来匹配电话号码。虽然乍一看很神秘，但幸运的是，正则表达式的语法要比 Python 语言本身简单得多；我们仅在本节中介绍几乎所有的语法。

我们还将介绍使用 regex 执行字符串操作的内置 python 模块re。

regex 语法

我们从正则表达式的语法开始。在 Python 中，正则表达式最常见的存储形式是原始字符串。原始字符串的行为类似于普通的 python 字符串，除了需要对反斜杠进行特殊处理。

例如，要将字符串hello \ world存储在普通的 python 字符串中，我们必须编写：

# Backslashes need to be escaped in normal Python strings
some_string = 'hello \\ world'
print(some_string)

hello \ world

使用原始字符串可以消除对反斜杠的转义：

# Note the `r` prefix on the string
some_raw_string = r'hello \ world'
print(some_raw_string)

hello \ world

因为反斜杠经常出现在正则表达式中，所以我们将在本节中对所有正则表达式使用原始字符串。

文字

正则表达式中的文字字符与字符本身匹配。例如，regexr"a"将与"Say! I like green eggs and ham!"中的任何"a"匹配。所有字母数字字符和大多数标点符号都是 regex 文字。

# HIDDEN
def show_regex_match(text, regex):
    """
    Prints the string with the regex match highlighted.
    """
    print(re.sub(f'({regex})', r'\033[1;30;43m\1\033[m', text))

# The show_regex_match method highlights all regex matches in the input string
regex = r"green"
show_regex_match("Say! I like green eggs and ham!", regex)

Say! I like green eggs and ham!

show_regex_match("Say! I like green eggs and ham!", r"a")

Say! I like green eggs and ham!

在上面的示例中，我们观察到正则表达式可以匹配出现在输入字符串中任何位置的模式。在 python 中，此行为的差异取决于匹配 regex 所的方法，有些方法仅在 regex 出现在字符串开头时返回匹配；有些方法在字符串的任何位置返回匹配。

还要注意，show_regex_match方法突出显示了输入字符串中所有出现的 regex。同样，这取决于使用的 python 方法，某些方法返回所有匹配项，而有些方法只返回第一个匹配项。

正则表达式区分大小写。在下面的示例中，regex 只匹配eggs中的小写s，而不匹配Say中的大写S。

show_regex_match("Say! I like green eggs and ham!", r"s")

Say! I like green eggs and ham!

通配符

有些字符在正则表达式中有特殊的含义。这些元字符允许正则表达式匹配各种模式。

在正则表达式中，句点字符.与除换行符以外的任何字符匹配。

show_regex_match("Call me at 382-384-3840.", r".all")

Call me at 382-384-3840.

为了只匹配句点文字字符，我们必须用反斜杠转义它：

show_regex_match("Call me at 382-384-3840.", r"\.")

Call me at 382-384-3840 .

通过使用句点字符来标记不同模式的各个部分，我们构造了一个 regex 来匹配电话号码。例如，我们可以将原始电话号码382-384-3840中的数字替换为.，将破折号保留为文字。得到的 regex 结果为 ...-...-....。

show_regex_match("Call me at 382-384-3840.", "...-...-....")

Call me at 382-384-3840.

但是，由于句点字符与所有字符匹配，下面的输入字符串将产生一个伪匹配。

show_regex_match("My truck is not-all-blue.", "...-...-....")

My truck is not-all-blue.

字符类

字符类匹配指定的字符集，允许我们创建限制比只有.字符更严格的匹配。要创建字符类，请将所需字符集括在括号[ ]中。

show_regex_match("I like your gray shirt.", "gr[ae]y")

I like your gray shirt.

show_regex_match("I like your grey shirt.", "gr[ae]y")

I like your grey shirt.

# Does not match; a character class only matches one character from a set
show_regex_match("I like your graey shirt.", "gr[ae]y")

I like your graey shirt.

# In this example, repeating the character class will match
show_regex_match("I like your graey shirt.", "gr[ae][ae]y")

I like your graey shirt.

在字符类中，.字符被视为文字，而不是通配符。

show_regex_match("I like your grey shirt.", "irt[.]")

I like your grey shirt.

对于常用的字符类，我们可以使用一些特殊的速记符号：

速记	意义
[0－9]	所有的数字
[a-z]	小写字母
[A-Z]	大写字母

show_regex_match("I like your gray shirt.", "y[a-z]y")

I like your gray shirt.

字符类允许我们为电话号码创建更具体的 regex。

# We replaced every `.` character in ...-...-.... with [0-9] to restrict
# matches to digits.
phone_regex = r'[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]'
show_regex_match("Call me at 382-384-3840.", phone_regex)

Call me at 382-384-3840.

# Now we no longer match this string:
show_regex_match("My truck is not-all-blue.", phone_regex)

My truck is not-all-blue.

否定的字符类

否定的字符类匹配除类中字符以外的任何字符。要创建否定字符类，请将否定字符括在[^ ]中。

show_regex_match("The car parked in the garage.", r"[^c]ar")

The car parked in the garage.

量词

为了创建与电话号码匹配的 regex，我们编写了：

[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]

这匹配 3 位数字、一个破折号、3 位数字、一个破折号和 4 位数字。

量词允许我们匹配一个模式的多个连续出现。我们通过将数字放在大括号{ }中指定重复次数。

phone_regex = r'[0-9]{3}-[0-9]{3}-[0-9]{4}'
show_regex_match("Call me at 382-384-3840.", phone_regex)

Call me at 382-384-3840.

# No match
phone_regex = r'[0-9]{3}-[0-9]{3}-[0-9]{4}'
show_regex_match("Call me at 12-384-3840.", phone_regex)

Call me at 12-384-3840.

量词总是将字符或字符类修改为最左边。下表给出了量词的完整语法。

量词	意义
{m,n}	与前面的字符匹配 m ~ n 次。
{m}	与前面的字符刚好匹配 m 次。
{m,}	至少与前面的字符匹配 m 次。
{,n}	最多与前面的字符匹配 n 次。

速记量词

一些常用的量词有一个简写：

符号	量词	意义
*	{0,}	与前面的字符匹配 0 次或更多次
+	{1,}	与前面的字符匹配 1 次或更多次
?	{0,1}	与前面的字符匹配 0 或 1 次

在下面的示例中，我们使用*字符而不是{0,}。

# 3 a's
show_regex_match('He screamed "Aaaah!" as the cart took a plunge.', "Aa*h!")

He screamed "Aaaah!" as the cart took a plunge.

# Lots of a's
show_regex_match(
    'He screamed "Aaaaaaaaaaaaaaaaaaaah!" as the cart took a plunge.',
    "Aa*h!"
)

He screamed "Aaaaaaaaaaaaaaaaaaaah!" as the cart took a plunge.

# No lowercase a's
show_regex_match('He screamed "Ah!" as the cart took a plunge.', "Aa*h!")

He screamed "Ah!" as the cart took a plunge.

量词贪婪

量词总是返回尽可能长的匹配。这有时会导致令人惊讶的行为：

# We tried to match 311 and 911 but matched the ` and ` as well because
# `<311> and <911>` is the longest match possible for `<.+>`.
show_regex_match("Remember the numbers <311> and <911>", "<.+>")

Remember the numbers <311> and <911>

在许多情况下，使用更具体的字符类可以防止这些错误匹配：

show_regex_match("Remember the numbers <311> and <911>", "<[0-9]+>")

Remember the numbers <311> and <911>

固定

有时模式应该只在字符串的开头或结尾匹配。特殊字符^仅当模式出现在字符串的开头时才固定匹配 regex ；特殊字符$仅当模式出现在字符串的结尾时固定匹配 regex 。例如，regex well$ 只匹配字符串末尾的well。

show_regex_match('well, well, well', r"well$")

well, well, well

同时使用^和$需要 regex 匹配整个字符串。

phone_regex = r"^[0-9]{3}-[0-9]{3}-[0-9]{4}$"
show_regex_match('382-384-3840', phone_regex)

382-384-3840

# No match
show_regex_match('You can call me at 382-384-3840.', phone_regex)

You can call me at 382-384-3840.

转义元字符

在正则表达式中，所有 regex 元字符都具有特殊意义。为了将元字符匹配为文字，我们使用\字符对它们进行转义。

# `[` is a meta character and requires escaping
show_regex_match("Call me at [382-384-3840].", "\[")

Call me at [ 382-384-3840].

# `.` is a meta character and requires escaping
show_regex_match("Call me at [382-384-3840].", "\.")

Call me at [382-384-3840].

参考表

我们现在已经介绍了最重要的 regex 语法和元字符。为了更完整的参考，我们包括了下表。

元字符

此表包含大多数重要的 元字符 ，这有助于我们在字符串中指定要匹配的某些模式。

字符	说明	例子	匹配	不匹配
.	除\n 以外的任何字符	`...`	abc	ab abcd
[]	括号内的任何字符	`[cb.]ar`	car .ar	jar
[^ ]	除了括号内的任何字符	`[^b]ar`	car par	bar ar
*	≥0 次的前一个字符	`[pb]*ark`	bbark ark	dark
+	≥1 次的前一个字符	`[pb]+ark`	bbpark bark	dark ark
?	0 或 1 次的前一个字符	`s?he`	she he	the
{n}	刚好 n 次的前一个字符	`hello{3}`	hellooo	hello
\|	\|前或\|后的模式	`we\|[ui]s`	we us is	e s
\	转义下一个字符	`\[hi\]`	[hi]	hi
^	行首	`^ark`	ark two	dark
$	行尾	`ark$`	noahs ark	noahs arks

速记字符集

一些常用的字符集有速记。

说明	括号形式	速记
字母数字字符	`[a-zA-Z0-9]`	`\w`
不是字母数字字符	`[^a-zA-Z0-9]`	`\W`
数字	`[0-9]`	`\d`
不是数字	`[^0-9]`	`\D`
空白	`[\t\n\f\r\p{Z}]`	`\s`
不是空白	`[^\t\n\f\r\p{z}]`	`\S`

摘要

几乎所有的编程语言都有一个库，可以使用正则表达式来匹配模式，使它们无论使用哪种特定的语言都很有用。在本节中，我们介绍了 regex 语法和最有用的元字符。

8.3 Python 和 pandas 中的 Regex

# HIDDEN
# Clear previously defined variables
%reset -f

# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/08'))

# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi

sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)

在本节中，我们将介绍 python 内置的re模块中 regex 的用法。因为我们只介绍了一些最常用的方法，所以您也可以参考有关re模块的官方文档。

`re.search`

re.search(pattern, string)在string中的任意位置搜索 regexpattern的匹配项。如果找到模式，则返回一个 TruthyMatch 对象；如果没有，则返回None。

phone_re = r"[0-9]{3}-[0-9]{3}-[0-9]{4}"
text  = "Call me at 382-384-3840."
match = re.search(phone_re, text)
match

<_sre.SRE_Match object; span=(11, 23), match='382-384-3840'>

虽然返回的 match 对象有各种有用的属性，但我们最常用re.search来测试模式是否出现在字符串中。

if re.search(phone_re, text):
    print("Found a match!")

Found a match!

if re.search(phone_re, 'Hello world'):
    print("No match; this won't print")

另一个常用的方法re.match(pattern, string)的行为与re.search相同，但只检查string开头的匹配项，而不是字符串中任何位置的匹配项。

`re.findall`

我们使用re.findall(pattern, string)提取与 regex 匹配的子字符串。此方法返回string中所有匹配项的列表。

gmail_re = r'[a-zA-Z0-9]+@gmail\.com'
text = '''
From: email1@gmail.com
To: email2@yahoo.com and email3@gmail.com
'''
re.findall(gmail_re, text)

['email1@gmail.com', 'email3@gmail.com']

Regex 组

使用regex 组，我们通过将子模式括在括号( )中指定要从 regex 提取的子模式。当 regex 包含 regex 组时，re.findall返回包含子模式内容的元组列表。

例如，以下是我们熟悉的用 regex 从字符串中提取电话号码：

phone_re = r"[0-9]{3}-[0-9]{3}-[0-9]{4}"
text  = "Sam's number is 382-384-3840 and Mary's is 123-456-7890."
re.findall(phone_re, text)

['382-384-3840', '123-456-7890']

为了将一个电话号码的三位或四位组成部分分开，我们可以将每个数字组用括号括起来。

# Same regex with parentheses around the digit groups
phone_re = r"([0-9]{3})-([0-9]{3})-([0-9]{4})"
text  = "Sam's number is 382-384-3840 and Mary's is 123-456-7890."
re.findall(phone_re, text)

[('382', '384', '3840'), ('123', '456', '7890')]

正如所承诺的那样，re.findall返回包含匹配电话号码的各个组成部分的元组列表。

`re.sub`

re.sub(pattern, replacement, string)用replacement替换string中所有出现的pattern。此方法的行为类似于 python 字符串方法str.sub，但使用 regex 来匹配模式。

在下面的代码中，我们通过用破折号替换日期分隔符来将日期更改为通用格式。

messy_dates = '03/12/2018, 03.13.18, 03/14/2018, 03:15:2018'
regex = r'[/.:]'
re.sub(regex, '-', messy_dates)

'03-12-2018, 03-13-18, 03-14-2018, 03-15-2018'

`re.split`

re.split(pattern, string)在每次出现 regex pattern时分割输入的string。此方法的行为类似于 python 字符串方法str.split，但使用 regex 进行分割。

在下面的代码中，我们使用re.split将一本书目录的章节名称和它们的页码分开。

toc = '''
PLAYING PILGRIMS============3
A MERRY CHRISTMAS===========13
THE LAURENCE BOY============31
BURDENS=====================55
BEING NEIGHBORLY============76
'''.strip()

# First, split into individual lines
lines = re.split('\n', toc)
lines

['PLAYING PILGRIMS============3',
 'A MERRY CHRISTMAS===========13',
 'THE LAURENCE BOY============31',
 'BURDENS=====================55',
 'BEING NEIGHBORLY============76']

# Then, split into chapter title and page number
split_re = r'=+' # Matches any sequence of = characters
[re.split(split_re, line) for line in lines]

[['PLAYING PILGRIMS', '3'],
 ['A MERRY CHRISTMAS', '13'],
 ['THE LAURENCE BOY', '31'],
 ['BURDENS', '55'],
 ['BEING NEIGHBORLY', '76']]

Regex 和 pandas

回想一下，pandas Series 对象有一个.str属性，它支持使用 python 字符串方法进行字符串操作。很方便的是，.str属性还支持一些re模块的函数。我们演示了 regex 在pandas中的基本用法，完整的方法列表在有关字符串方法的pandas文档中。

我们在下面的 DataFrame 中保存了小说《小女人》(Little Women)前五句话的文本。我们可以使用pandas提供的字符串方法来提取每个句子中的口语对话。

# HIDDEN
text = '''
"Christmas won't be Christmas without any presents," grumbled Jo, lying on the rug.
"It's so dreadful to be poor!" sighed Meg, looking down at her old dress.
"I don't think it's fair for some girls to have plenty of pretty things, and other girls nothing at all," added little Amy, with an injured sniff.
"We've got Father and Mother, and each other," said Beth contentedly from her corner.
The four young faces on which the firelight shone brightened at the cheerful words, but darkened again as Jo said sadly, "We haven't got Father, and shall not have him for a long time."
'''.strip()
little = pd.DataFrame({
    'sentences': text.split('\n')
})

little

	sentences
0	"Christmas won't be Christmas without any pres...
1	"It's so dreadful to be poor!" sighed Meg, loo...
2	"I don't think it's fair for some girls to hav...
3	"We've got Father and Mother, and each other,"...
4	The four young faces on which the firelight sh...

由于口语对话位于双引号内，因此我们创建一个 regex，它捕获左双引号、除双引号外的任何字符序列和右双引号。

quote_re = r'"[^"]+"'
little['sentences'].str.findall(quote_re)

0    ["Christmas won't be Christmas without any pre...
1                     ["It's so dreadful to be poor!"]
2    ["I don't think it's fair for some girls to ha...
3     ["We've got Father and Mother, and each other,"]
4    ["We haven't got Father, and shall not have hi...
Name: sentences, dtype: object

由于Series.str.findall方法返回匹配项列表，pandas还提供Series.str.extract和Series.str.extractall方法将匹配项提取到 Series 或 DataFrame 中。这些方法要求 regex 至少包含一个 regex 组。

# Extract text within double quotes
quote_re = r'"([^"]+)"'
spoken = little['sentences'].str.extract(quote_re)
spoken

0    Christmas won't be Christmas without any prese...
1                         It's so dreadful to be poor!
2    I don't think it's fair for some girls to have...
3         We've got Father and Mother, and each other,
4    We haven't got Father, and shall not have him ...
Name: sentences, dtype: object

我们可以将此序列添加为littleDataFrame 的列：

little['dialog'] = spoken
little

	sentences	dialog
0	"Christmas won't be Christmas without any pres...	Christmas won't be Christmas without any prese...
1	"It's so dreadful to be poor!" sighed Meg, loo...	It's so dreadful to be poor!
2	"I don't think it's fair for some girls to hav...	I don't think it's fair for some girls to have...
3	"We've got Father and Mother, and each other,"...	We've got Father and Mother, and each other,
4	The four young faces on which the firelight sh...	We haven't got Father, and shall not have him ...

我们可以通过打印原始文本和提取文本来确认字符串操作在 DataFrame 中的最后一句话上是否如预期执行：

print(little.loc[4, 'sentences'])

The four young faces on which the firelight shone brightened at the cheerful words, but darkened again as Jo said sadly, "We haven't got Father, and shall not have him for a long time."

print(little.loc[4, 'dialog'])

We haven't got Father, and shall not have him for a long time.

摘要

python 中的re模块提供了一组使用正则表达式操作文本的实用方法。在处理 DataFrame 时，我们经常使用pandas中实现的类似的字符串操作方法。

有关re模块的完整文档，请参阅https://docs.python.org/3/library/re.html

有关pandas字符串方法的完整文档，请参阅https://pandas.pydata.org/pandas-docs/stable/text.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

8.md

8.md

八、处理文本

8.1 python 字符串方法

清洗文本数据

字符串方法

pandas 中的字符串方法

摘要

8.2 正则表达式

动机

regex 语法

文字

通配符

字符类

否定的字符类

量词

量词贪婪

固定

转义元字符

参考表

摘要

8.3 Python 和 pandas 中的 Regex

`re.search`

`re.findall`

Regex 组

`re.sub`

`re.split`

Regex 和 pandas

摘要

Files

8.md

Latest commit

History

8.md

File metadata and controls

八、处理文本

8.1 python 字符串方法

清洗文本数据

字符串方法

pandas 中的字符串方法

摘要

8.2 正则表达式

动机

regex 语法

文字

通配符

字符类

否定的字符类

量词

量词贪婪

固定

转义元字符

参考表

摘要

8.3 Python 和 pandas 中的 Regex

re.search

re.findall

Regex 组

re.sub

re.split

Regex 和 pandas

摘要

`re.search`

`re.findall`

`re.sub`

`re.split`