Python 2 位到 2 位模式的正则表达式，例如 - 26 到 40

Question

请帮忙，正则表达式让我大吃一惊。

我正在清理 Pandas 数据帧中的数据 (python 3)。

我尝试了很多在网上找到的用于数字的正则表达式组合，但 none 对我的情况有效。我似乎无法弄清楚如何为模式 2 位数字 space 到 space 2 位数字（示例 26 到 40）编写自己的正则表达式。

我的挑战是从 pandas 列 BLOOM（抓取的数据）中提取花瓣数。通常花瓣被指定为“dd to dd petals”。我知道正则表达式中的 2 位数字是 \d\d 或 \d{2} 但我如何将拆分为“to”？最好有一个条件，即模式后跟单词“花瓣”。

当然我不是第一个在 python 中需要正则表达式的人，模式 \d\d 到 \d\d。

编辑：

我意识到没有示例数据框的问题有点令人困惑。所以这是一个示例数据框。

import pandas as pd 
import re

# initialize list of lists 
data = [['Evert van Dijk', 'Carmine-pink, salmon-pink streaks, stripes, flecks.  Warm pink, clear carmine pink, rose pink shaded salmon.  Mild fragrance.  Large, very double, in small clusters, high-centered bloom form.  Blooms in flushes throughout the season.'],
    ['Every Good Gift', 'Red.  Flowers velvety red.  Moderate fragrance.  Average diameter 4".  Medium-large, full (26-40 petals), borne mostly solitary bloom form.  Blooms in flushes throughout the season.'], 
    ['Evghenya', 'Orange-pink.  75 petals.  Large, very double bloom form.  Blooms in flushes throughout the season.'], 
    ['Evita', 'White or white blend.  None to mild fragrance.  35 petals.  Large, full (26-40 petals), high-centered bloom form.  Blooms in flushes throughout the season.'],
    ['Evrathin', 'Light pink. [Deep pink.]  Outer petals white. Expand rarely.  Mild fragrance.  35 to 40 petals.  Average diameter 2.5".  Medium, double (17-25 petals), full (26-40 petals), cluster-flowered, in small clusters bloom form.  Prolific, once-blooming spring or summer.  Glandular sepals, leafy sepals, long sepals buds.'],
    ['Evita 2', 'White, blush shading.  Mild, wild rose fragrance.  20 to 25 petals.  Average diameter 1.25".  Small, very double, cluster-flowered bloom form.  Blooms in flushes throughout the season.']]

# Create the pandas DataFrame 
df = pd.DataFrame(data, columns = ['NAME', 'BLOOM']) 

# print dataframe. 
df

Answer 1

这对我有用：

import re

sample = '2 digits (example 26 to 40 petals) and 16 to 43 petals.'
re.compile(r"\d{2}\sto\s\d{2}\spetals").findall(sample)

输出：

['26 to 40 petals', '16 to 43 petals']

如您所述，\d{2} 找到 2 个数字，\sto\s 找到被空格 space 包围的单词 'to'，然后再次找到 \d{2}对于第二个 2 位数字，后跟 space (\s) 和单词 'petals'.

Answer 2

您可以使用

df['res_col'] = df['src_col'].str.extract(r'(?<!\d)(\d{2}\s+to\s+\d{2})\s*petal', expand=False)

见regex demo

详情

(?<!\d) - 否定后视确保当前位置左侧没有数字
(\d{2}\s+to\s+\d{2}) - 第 1 组（str.extract 的实际 return）：
- \d{2} - 两位数
- \s+to\s+ - 1+ 个空格，to 字符串，1+ 个空格
- \d{2} - 两位数
\s*petal - 0+ 个空格后跟 petal.

Answer 3

发布一个答案以展示我如何解决从 BLOOM 列中提取花瓣数据的问题。我不得不使用多个正则表达式来获取我想要的所有数据。这个问题只涉及我使用的正则表达式之一。

打印时示例数据框如下所示：

我在运行进入导致此 post 的问题之前创建了这些专栏。我最初的做法是获取括号中的所有数据。

#coping content in column BLOOM inside first brackets into new column PETALS
df['PETALS'] = df['BLOOM'].str.extract('(\(.*?)\)', expand=False).str.strip()
df['PETALS'] = df['PETALS'].str.replace("(","") 

# #coping content in column BLOOM inside all brackets into new column ALL_PETALS_BRACKETS
df['ALL_PETALS_BRACKETS'] = df['BLOOM'].str.findall('(\(.*?)\)')
df[['NAME','BLOOM','PETALS', 'ALL_PETALS_BRACKETS']]

后来我意识到这种方式只能获取某些行的花瓣值。可以通过不止一种方式在 BLOOM 列中指定花瓣。另一种常见模式是“2 位数字到 2 位数字”。还有图案“2位数花瓣”。

# solution provided by Wiktor Stribiżew
df['PETALS_Wiktor_S'] = df['BLOOM'].str.extract(r'(?<!\d)(\d{2}\s+to\s+\d{2})\s*petal', expand=False)

# my modification that worked on the main df and not only on the test one. 
# now lets copy part of column BLOOM that matches regex pattern two digits to two digits
df['PETALS5'] = df['BLOOM'].str.extract(r'(\d{2}\s+to\s+\d{2})', expand=False).str.strip()

# also came across cases where pattern is two digits followed by word "petals"
#now lets copy part of column BLOOM that matches regex patern two digits followed by word "petals"
df['PETALS6'] = df['BLOOM'].str.extract(r'(\d{2}\s+petals+\.)', expand=False).str.strip()
df

因为我在寻找模式“2 位花瓣”。我必须修改我的正则表达式，以便它在 r'(\d{2}\s+petals+\. 中使用 +\. 查找点如果正则表达式写为 r'(\d{2}\s+petals.，它会抓取单词花瓣后跟 . 和 [= 的情况16=].

Python 2 位到 2 位模式的正则表达式，例如 - 26 到 40

Python Regex for pattern 2 digits to 2 digits like - 26 to 40

regex

python-3.x

data-cleaning

data-wrangling