无法同时使用 python 中的多个特殊字符或模式提取字符串
Not able to extract string using multiple special characters or pattern in python simultaneously
我有一个数据集,我试图从这里显示的较长的混乱版本中提取简单的城镇名称。大部分后面是括号“(.*”,但有些不遵循这种模式,以“:”结尾(见第200行)。最后,还有一些没有括号,而是用逗号“, "(参见第 240、246 行)。
'Region'
196 Boston (Boston University, Boston College, Bos...
197 Bridgewater (Bridgewater State College)[2]
198 Cambridge (Harvard University, Massachusetts I...
199 Chestnut Hill (Boston College)
200 The Colleges of Worcester Consortium:
201 Dudley (Nichols College)
240 Faribault, South Central College
241 Mankato (Minnesota State University, Mankato),...
242 Marshall (Southwest Minnesota State University...
243 Moorhead (Minnesota State University, Moorhead...
244 Morris (University of Minnesota Morris)[2]
245 Northfield (Carleton College, St. Olaf College...
246 North Mankato, South Central College
247 St. Cloud (St. Cloud State University, The Col...
248 St. Joseph (College of Saint Benedict)[2]
249 St. Peter (Gustavus Adolphus College)[2]
我最希望看到的是:
'RegionName'
196 Boston
197 Bridgewater
198 Cambridge
199 Chestnut Hill
200 The Colleges of Worcester Consortium
201 Dudley
240 Faribault
241 Mankato
242 Marshall
243 Moorhead
244 Morris
245 Northfield
246 North Mankato
247 St. Cloud
248 St. Joseph
249 St. Peter
我目前的代码是:
df['RegionName'] = df['Region'].str.extract('(.*)[:(,]', expand=False)
但这给了我一个奇怪的结果,即没有正确使用括号:
196 Boston (Boston University, Boston College, Bos...
197 Bridgewater
198 Cambridge (Harvard University, Massachusetts I...
199 Chestnut Hill
200 The Colleges of Worcester Consortium
201 Dudley
240 Faribault
241 Mankato (Minnesota State University, Mankato)
242 Marshall
243 Moorhead (Minnesota State University, Moorhead
244 Morris
245 Northfield (Carleton College
246 North Mankato
247 St. Cloud (St. Cloud State University
248 St. Joseph
249 St. Peter
我也试过:
df['RegionName'] = df['Region'].str.extract('(.*)[ (.*|:|,]', expand=False)
我不确定如何同时使用所有三种模式提取字符串。也将对两行解决方案开放。
谢谢(如果格式不正确,我们深表歉意!)
使用这个正则表达式:
([\w\s.]+)(?<!\s)
如果您不关心尾随空格,您可以在最后删除负向回顾 (?<!\s)
。
由于您只有三个可能的分隔符,您可以利用链式 split(),因为如果找不到分隔符,split 将 return 未修改的字符串。
>>> s = """196 Boston (Boston University, Boston College, Bos...
... 197 Bridgewater (Bridgewater State College)[2]
... 198 Cambridge (Harvard University, Massachusetts I...
... 199 Chestnut Hill (Boston College)
... 200 The Colleges of Worcester Consortium:
... 201 Dudley (Nichols College)
... 240 Faribault, South Central College
... 241 Mankato (Minnesota State University, Mankato),...
... 242 Marshall (Southwest Minnesota State University...
... 243 Moorhead (Minnesota State University, Moorhead...
... 244 Morris (University of Minnesota Morris)[2]
... 245 Northfield (Carleton College, St. Olaf College...
... 246 North Mankato, South Central College
... 247 St. Cloud (St. Cloud State University, The Col...
... 248 St. Joseph (College of Saint Benedict)[2]
... 249 St. Peter (Gustavus Adolphus College)[2]"""
>>> for i in s.split('\n'):
... number, text = i.split('(')[0].split(',')[0].split(':')[0].split(' ',1)
... print('{} {}'.format(number, text.strip()))
...
196 Boston
197 Bridgewater
198 Cambridge
199 Chestnut Hill
200 The Colleges of Worcester Consortium
201 Dudley
240 Faribault
241 Mankato
242 Marshall
243 Moorhead
244 Morris
245 Northfield
246 North Mankato
247 St. Cloud
248 St. Joseph
249 St. Peter
您可以使用 df.apply
对字符串进行相同的转换。
您可以在
df['RegionName'] = df['Region'].str.extract(r'^([^:(,]*)\b', expand=False)
如果您使用 Python 2.x,请在模式的开头使用 (?u)
,这样单词边界 \b
也可以匹配中的正确位置一个 Unicode 字符串。
详情
^
- 字符串的开头
([^:(,]*)
- 第 1 组:零次或多次 (*
) 次连续出现除 [^...]
以外的任何字符,形成 negated 字符 class) :
, (
和 ,
.
\b
- 单词边界。
请参阅下面的 regex demo 和 Python 3 演示:
>>> from pandas import DataFrame
>>> import pandas as pd
>>> item_list = ['Boston (Boston University, Boston College, Bos...','Bridgewater (Bridgewater State College)[2]','Cambridge (Harvard University, Massachusetts I...','Chestnut Hill (Boston College)','The Colleges of Worcester Consortium:','Dudley (Nichols College)','Faribault, South Central College','Mankato (Minnesota State University, Mankato),...','Marshall (Southwest Minnesota State University...','Moorhead (Minnesota State University, Moorhead...','Morris (University of Minnesota Morris)[2]','Northfield (Carleton College, St. Olaf College...','North Mankato, South Central College','St. Cloud (St. Cloud State University, The Col...','St. Joseph (College of Saint Benedict)[2]','St. Peter (Gustavus Adolphus College)[2]']
>>> df = pd.DataFrame(item_list, columns=['Region'])
>>> df['RegionName'] = df['Region'].str.extract(r'^([^:(,]*)\b', expand=False)
>>> df['RegionName']
RegionName
0 Boston
1 Bridgewater
2 Cambridge
3 Chestnut Hill
4 The Colleges of Worcester Consortium
5 Dudley
6 Faribault
7 Mankato
8 Marshall
9 Moorhead
10 Morris
11 Northfield
12 North Mankato
13 St. Cloud
14 St. Joseph
15 St. Peter
>>>
我有一个数据集,我试图从这里显示的较长的混乱版本中提取简单的城镇名称。大部分后面是括号“(.*”,但有些不遵循这种模式,以“:”结尾(见第200行)。最后,还有一些没有括号,而是用逗号“, "(参见第 240、246 行)。
'Region'
196 Boston (Boston University, Boston College, Bos...
197 Bridgewater (Bridgewater State College)[2]
198 Cambridge (Harvard University, Massachusetts I...
199 Chestnut Hill (Boston College)
200 The Colleges of Worcester Consortium:
201 Dudley (Nichols College)
240 Faribault, South Central College
241 Mankato (Minnesota State University, Mankato),...
242 Marshall (Southwest Minnesota State University...
243 Moorhead (Minnesota State University, Moorhead...
244 Morris (University of Minnesota Morris)[2]
245 Northfield (Carleton College, St. Olaf College...
246 North Mankato, South Central College
247 St. Cloud (St. Cloud State University, The Col...
248 St. Joseph (College of Saint Benedict)[2]
249 St. Peter (Gustavus Adolphus College)[2]
我最希望看到的是:
'RegionName'
196 Boston
197 Bridgewater
198 Cambridge
199 Chestnut Hill
200 The Colleges of Worcester Consortium
201 Dudley
240 Faribault
241 Mankato
242 Marshall
243 Moorhead
244 Morris
245 Northfield
246 North Mankato
247 St. Cloud
248 St. Joseph
249 St. Peter
我目前的代码是:
df['RegionName'] = df['Region'].str.extract('(.*)[:(,]', expand=False)
但这给了我一个奇怪的结果,即没有正确使用括号:
196 Boston (Boston University, Boston College, Bos...
197 Bridgewater
198 Cambridge (Harvard University, Massachusetts I...
199 Chestnut Hill
200 The Colleges of Worcester Consortium
201 Dudley
240 Faribault
241 Mankato (Minnesota State University, Mankato)
242 Marshall
243 Moorhead (Minnesota State University, Moorhead
244 Morris
245 Northfield (Carleton College
246 North Mankato
247 St. Cloud (St. Cloud State University
248 St. Joseph
249 St. Peter
我也试过:
df['RegionName'] = df['Region'].str.extract('(.*)[ (.*|:|,]', expand=False)
我不确定如何同时使用所有三种模式提取字符串。也将对两行解决方案开放。 谢谢(如果格式不正确,我们深表歉意!)
使用这个正则表达式:
([\w\s.]+)(?<!\s)
如果您不关心尾随空格,您可以在最后删除负向回顾 (?<!\s)
。
由于您只有三个可能的分隔符,您可以利用链式 split(),因为如果找不到分隔符,split 将 return 未修改的字符串。
>>> s = """196 Boston (Boston University, Boston College, Bos...
... 197 Bridgewater (Bridgewater State College)[2]
... 198 Cambridge (Harvard University, Massachusetts I...
... 199 Chestnut Hill (Boston College)
... 200 The Colleges of Worcester Consortium:
... 201 Dudley (Nichols College)
... 240 Faribault, South Central College
... 241 Mankato (Minnesota State University, Mankato),...
... 242 Marshall (Southwest Minnesota State University...
... 243 Moorhead (Minnesota State University, Moorhead...
... 244 Morris (University of Minnesota Morris)[2]
... 245 Northfield (Carleton College, St. Olaf College...
... 246 North Mankato, South Central College
... 247 St. Cloud (St. Cloud State University, The Col...
... 248 St. Joseph (College of Saint Benedict)[2]
... 249 St. Peter (Gustavus Adolphus College)[2]"""
>>> for i in s.split('\n'):
... number, text = i.split('(')[0].split(',')[0].split(':')[0].split(' ',1)
... print('{} {}'.format(number, text.strip()))
...
196 Boston
197 Bridgewater
198 Cambridge
199 Chestnut Hill
200 The Colleges of Worcester Consortium
201 Dudley
240 Faribault
241 Mankato
242 Marshall
243 Moorhead
244 Morris
245 Northfield
246 North Mankato
247 St. Cloud
248 St. Joseph
249 St. Peter
您可以使用 df.apply
对字符串进行相同的转换。
您可以在
df['RegionName'] = df['Region'].str.extract(r'^([^:(,]*)\b', expand=False)
如果您使用 Python 2.x,请在模式的开头使用 (?u)
,这样单词边界 \b
也可以匹配中的正确位置一个 Unicode 字符串。
详情
^
- 字符串的开头([^:(,]*)
- 第 1 组:零次或多次 (*
) 次连续出现除[^...]
以外的任何字符,形成 negated 字符 class):
,(
和,
.\b
- 单词边界。
请参阅下面的 regex demo 和 Python 3 演示:
>>> from pandas import DataFrame
>>> import pandas as pd
>>> item_list = ['Boston (Boston University, Boston College, Bos...','Bridgewater (Bridgewater State College)[2]','Cambridge (Harvard University, Massachusetts I...','Chestnut Hill (Boston College)','The Colleges of Worcester Consortium:','Dudley (Nichols College)','Faribault, South Central College','Mankato (Minnesota State University, Mankato),...','Marshall (Southwest Minnesota State University...','Moorhead (Minnesota State University, Moorhead...','Morris (University of Minnesota Morris)[2]','Northfield (Carleton College, St. Olaf College...','North Mankato, South Central College','St. Cloud (St. Cloud State University, The Col...','St. Joseph (College of Saint Benedict)[2]','St. Peter (Gustavus Adolphus College)[2]']
>>> df = pd.DataFrame(item_list, columns=['Region'])
>>> df['RegionName'] = df['Region'].str.extract(r'^([^:(,]*)\b', expand=False)
>>> df['RegionName']
RegionName
0 Boston
1 Bridgewater
2 Cambridge
3 Chestnut Hill
4 The Colleges of Worcester Consortium
5 Dudley
6 Faribault
7 Mankato
8 Marshall
9 Moorhead
10 Morris
11 Northfield
12 North Mankato
13 St. Cloud
14 St. Joseph
15 St. Peter
>>>