如何从 Python 中的正则表达式中只提取一个字符串？

Question

我一直在尝试使用 Python 为自己构建一个简单的客户经理类应用程序，它将从我的 phone 读取短信并根据一些正则表达式模式提取信息。

我编写了一个复杂的正则表达式模式并在 https://pythex.org/ 上进行了测试。示例：

Text: 1.00 is debited from ******1234  for food

Pattern: (account|a\/c|ac|from|acct|savings|credit in|ac\/|sb\-|acc|a\/c)(\s|\.|\-)*(no|number)*(\.|\s|:)*\s*(ending)*\s*(((n{1,}|x{1,}|[0-9]+|\*{1,}))+)\-*((n{1,}|x{1,}|[0-9]+|\*{1,}|\s))*\-*([0-9]*)

Result: from ******1234

但是，当我尝试使用 str.extract() 方法在 Python 中执行相同操作时，我得到的不是单个结果，而是每个组都有一个列的数据框。

Python 代码如下所示：

all_sms=pd.read_csv("all_sms.csv")

pattern = '(account|a\/c|ac|from|acct|savings|credit in|ac\/|sb\-|acc|a\/c)(\s|\.|\-)*(no|number)*(\.|\s|:)*\s*(ending)*\s*(((n{1,}|x{1,}|[0-9]+|\*{1,}))+)\-*((n{1,}|x{1,}|[0-9]+|\*{1,}|\s))*\-*([0-9]*)'

test = all_sms.extract(pattern, expand = False)

上述消息的 python 代码的输出：

0           from
1               
2            NaN
3            NaN
4            NaN
5     ******1234
6           1234
7           1234
8               
9               
10

我是 Python 的新手，正在尝试通过实践经验学习，如果有人能指出我的错误之处，那将非常有帮助？

Answer 1

在深入研究正则表达式模式之前，您应该了解为什么要使用 pandas。 Pandas适合做数据分析（因此适合你的问题）但在这里似乎有点矫枉过正。

如果你是初学者，我建议你坚持纯 python 不是因为 pandas 很复杂但是因为了解 python 标准库会帮助你长运行。如果你现在跳过基础知识，这可能会在很长一段时间内伤害你运行。

考虑到您将使用 python3（不使用 pandas），我将继续关注：

# Needed imports from standard library.
import csv
import re

# Declare the constants of my tiny program.
PATTERN = '(account|a\/c|ac|from|acct|savings|credit in|ac\/|sb\-|acc|a\/c)(\s|\.|\-)*(no|number)*(\.|\s|:)*\s*(ending)*\s*(((n{1,}|x{1,}|[0-9]+|\*{1,}))+)\-*((n{1,}|x{1,}|[0-9]+|\*{1,}|\s))*\-*([0-9]*)'
COMPILED_REGEX = re.compile(PATTERN)

# This list will store the matched regex.
found_regexes = list()

# Do the necessary loading to enable searching for the regex.
with open('mysmspath.csv', newline='') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=' ', quotechar='"')
    # Iterate over rows in your csv file.
    for row in csv_reader:
        match = COMPILED_REGEX.search(row)
        if match:
            found_regexes.append(row)

print(found_regexes)

这不一定能解决您的复制粘贴问题，但这可能会给您带来想出一种更简单的方法来解决您的问题。

如何从 Python 中的正则表达式中只提取一个字符串？

How to extract only one string from regex in Python?

python

regex

string

text-extraction