删除Python中多个条件下带括号的内容
Remove content with parentheses under multiple conditions in Python
给定一个列表如下:
l = ['hydrogenated benzene (purity: 99.9 density (g/cm3), produced in ZB): SD',
'Car board price (tax included): JT Port',
'Ex-factory price (low-end price): Triethanolamine (85% commercial grade): North'
]
我想得到预期的结果如下:
['hydrogenated benzene: SD', 'Car board price: JT Port', 'Ex-factory price: Triethanolamine: North']
代码如下:
def remove_extra(content):
pat1 = '[\s]' # remove space
pat2 = '\(.*\)' # remove content within parentheses
combined_pat = r'|'.join((pat2, pat3))
return re.sub(combined_pat, '', str(content))
[remove_extra(item) for item in l]
它生成:
['hydrogenated benzene : SD',
'Car board price : JT Port',
'Ex-factory price : North']
如您所见,结果 'Ex-factory price : North'
的最后一个元素与预期不符,我如何才能达到我的需要?谢谢。
内括号使它变得复杂。您在此处看到的解决方案适用于您的示例,但可能不适用于您的整个数据集。如果您遇到错误,请更新问题,以便我们找到解决方案。
此函数首先计算字符串中有多少个单独的括号,然后将其删除。
def par_remover(st):
begin = [ i.start() for i in re.finditer('\(', st)]
end = [ i.start() for i in re.finditer('\)', st)]
count = len(list(re.finditer('\(', st))) +1 - len([i for i in begin if i < end[0]])
for i in range(count):
begin = [ i.start() for i in re.finditer('\(', st)]
end = [ i.start() for i in re.finditer('\)', st)]
end1 = len([i for i in begin if i < end[0]])
str_remove = st[st.find("("):list(re.finditer('\)', st))[end1-1].end()]
st = st.replace(str_remove,'')
return(st.replace(')',''))
df = pd.DataFrame({'value':l})
df['value'] = df['value'].apply(lambda st:par_remover(st))
结果:
| | value |
|---:|:-------------------------------------------|
| 0 | hydrogenated benzene : SD |
| 1 | Car board price : JT Port |
| 2 | Ex-factory price : Triethanolamine : North |
问题实际上不是您的第 3 项,而是第一项,因为有嵌套的括号。你应该像这样做一个循环并使用 subn
而不是 sub
def remove_text_between_parens(text):
n = 1
while n:
text, n = re.subn(r'\s*\([^()]*\)\s*', '', text)
return text
>>> [remove_text_between_parens(t) for t in l]
['hydrogenated benzene: SD',
'Car board price: JT Port',
'Ex-factory price: Triethanolamine: North']
正确的解释在这里:
您可以使用 \s*
修改链接解决方案以删除 (
:
之前的可选空格
#
def remove_text_between_parens(text):
n = 1 # run at least once
while n:
text, n = re.subn(r'\s*\([^()]*\)', '', text) #remove non-nested/flat balanced parts
return text
a = [remove_text_between_parens(item) for item in l]
print (a)
['hydrogenated benzene: SD',
'Car board price: JT Port',
'Ex-factory price: Triethanolamine: North']
给定一个列表如下:
l = ['hydrogenated benzene (purity: 99.9 density (g/cm3), produced in ZB): SD',
'Car board price (tax included): JT Port',
'Ex-factory price (low-end price): Triethanolamine (85% commercial grade): North'
]
我想得到预期的结果如下:
['hydrogenated benzene: SD', 'Car board price: JT Port', 'Ex-factory price: Triethanolamine: North']
代码如下:
def remove_extra(content):
pat1 = '[\s]' # remove space
pat2 = '\(.*\)' # remove content within parentheses
combined_pat = r'|'.join((pat2, pat3))
return re.sub(combined_pat, '', str(content))
[remove_extra(item) for item in l]
它生成:
['hydrogenated benzene : SD',
'Car board price : JT Port',
'Ex-factory price : North']
如您所见,结果 'Ex-factory price : North'
的最后一个元素与预期不符,我如何才能达到我的需要?谢谢。
内括号使它变得复杂。您在此处看到的解决方案适用于您的示例,但可能不适用于您的整个数据集。如果您遇到错误,请更新问题,以便我们找到解决方案。
此函数首先计算字符串中有多少个单独的括号,然后将其删除。
def par_remover(st):
begin = [ i.start() for i in re.finditer('\(', st)]
end = [ i.start() for i in re.finditer('\)', st)]
count = len(list(re.finditer('\(', st))) +1 - len([i for i in begin if i < end[0]])
for i in range(count):
begin = [ i.start() for i in re.finditer('\(', st)]
end = [ i.start() for i in re.finditer('\)', st)]
end1 = len([i for i in begin if i < end[0]])
str_remove = st[st.find("("):list(re.finditer('\)', st))[end1-1].end()]
st = st.replace(str_remove,'')
return(st.replace(')',''))
df = pd.DataFrame({'value':l})
df['value'] = df['value'].apply(lambda st:par_remover(st))
结果:
| | value |
|---:|:-------------------------------------------|
| 0 | hydrogenated benzene : SD |
| 1 | Car board price : JT Port |
| 2 | Ex-factory price : Triethanolamine : North |
问题实际上不是您的第 3 项,而是第一项,因为有嵌套的括号。你应该像这样做一个循环并使用 subn
而不是 sub
def remove_text_between_parens(text):
n = 1
while n:
text, n = re.subn(r'\s*\([^()]*\)\s*', '', text)
return text
>>> [remove_text_between_parens(t) for t in l]
['hydrogenated benzene: SD',
'Car board price: JT Port',
'Ex-factory price: Triethanolamine: North']
正确的解释在这里:
您可以使用 \s*
修改链接解决方案以删除 (
:
#
def remove_text_between_parens(text):
n = 1 # run at least once
while n:
text, n = re.subn(r'\s*\([^()]*\)', '', text) #remove non-nested/flat balanced parts
return text
a = [remove_text_between_parens(item) for item in l]
print (a)
['hydrogenated benzene: SD',
'Car board price: JT Port',
'Ex-factory price: Triethanolamine: North']