拉出变量字符串的正则表达式
Regular expression to pull out variable string
我在 PYthon 2.7:
中有这个字符串列表
list_a = ['temp_52_head sensor,
uploaded by TS','crack in the left quadrant, uploaded by AB, Left in 2hr
sunlight','FSL_pressure, uploaded by RS, no reported vacuum','art
9943_mercury, Uploaded by DY, accelerated, hurst potential too
low','uploaded by KKP, Space 55','avogadro reading level,
uploaded by HB, started mini counter, pulled lever','no comment
yesterday, Uploaded to TFG, level 1 escape but temperature stable,
pressure lever north']
在每个列表项中,都有一个字符串
uploaded by SOMEONE
我需要提取 SOMEONE
.
但是,如您所见,SOMEONE
:
- 从列表中的一项更改为下一项。
- 长度可以是 2 或 3 个字符(只有文本,没有数字)。
- 出现在字符串中的不同位置。
- 已上传也发生为已上传
- 上传有时出现在任何逗号之前
这是我需要提取的内容:
someone_names = ['TS','AB','RS','DY','KKP','HB','TFG']
我正在考虑使用正则表达式,但我面临的问题来自上面的第 2 点和第 3 点。
有没有办法从列表中提取这些字符?
看起来像这样的正则表达式可以满足您的要求,除非我遗漏了什么:
/[U|u]ploaded by ([A-Z]{2}|[A-Z]{3}),/
或者,(从您的示例中)您也可以将字符串拆分为逗号并从包含字符串 "ploaded by" 的数组中拉出元素(避免 upper/lower "u"), 将其拆分为空格,然后取出结果数组中的最后一个元素。
您可以使用列表理解来实现正则表达式。
>>> import re
>>> list_a = [
'temp_52_head sensor, uploaded by TS',
'crack in the left quadrant, uploaded by AB, Left in 2hr sunlight',
'FSL_pressure, uploaded by RS, no reported vacuum',
'art9943_mercury, Uploaded by DY, accelerated, hurst potential too low',
'uploaded by KKP, Space 55',
'avogadro reading level, uploaded by HB, started mini counter, pulled lever',
'no comment yesterday, Uploaded to TFG, level 1 escape but temperature stable,pressure lever north'
]
>>> regex = re.compile(r'(?i)\buploaded\s*(?:by|to)\s*([a-z]{2,3})')
>>> names = [m.group(1) for x in list_a for m in [regex.search(x)] if m]
['TS', 'AB', 'RS', 'DY', 'KKP', 'HB', 'TFG']
不是正则表达式,但更详细的方法可能是这样的:
import re
name = re.search(re.escape("uploaded by ")+"(.*?)"+re.escape(","),list_a[x]).group(1)
此正则表达式会命中所有这些,如果您更改上传者首字母中的字母数,它仍然有效。无论两个或三个字母后是否有逗号或单引号,这都会匹配。它还将捕获您要查找的所有数据:
import re
m = re.compile('uploaded ((by)|(to)) ([a-z]+)', flags=re.IGNORCASE)
然后您可以将搜索模式对象 m
与 search()
函数一起使用,它将提取所有匹配项。每次迭代中的第 4 个匹配项就是您要查找的数据。
我在 PYthon 2.7:
中有这个字符串列表list_a = ['temp_52_head sensor,
uploaded by TS','crack in the left quadrant, uploaded by AB, Left in 2hr
sunlight','FSL_pressure, uploaded by RS, no reported vacuum','art
9943_mercury, Uploaded by DY, accelerated, hurst potential too
low','uploaded by KKP, Space 55','avogadro reading level,
uploaded by HB, started mini counter, pulled lever','no comment
yesterday, Uploaded to TFG, level 1 escape but temperature stable,
pressure lever north']
在每个列表项中,都有一个字符串
uploaded by SOMEONE
我需要提取 SOMEONE
.
但是,如您所见,SOMEONE
:
- 从列表中的一项更改为下一项。
- 长度可以是 2 或 3 个字符(只有文本,没有数字)。
- 出现在字符串中的不同位置。
- 已上传也发生为已上传
- 上传有时出现在任何逗号之前
这是我需要提取的内容:
someone_names = ['TS','AB','RS','DY','KKP','HB','TFG']
我正在考虑使用正则表达式,但我面临的问题来自上面的第 2 点和第 3 点。
有没有办法从列表中提取这些字符?
看起来像这样的正则表达式可以满足您的要求,除非我遗漏了什么:
/[U|u]ploaded by ([A-Z]{2}|[A-Z]{3}),/
或者,(从您的示例中)您也可以将字符串拆分为逗号并从包含字符串 "ploaded by" 的数组中拉出元素(避免 upper/lower "u"), 将其拆分为空格,然后取出结果数组中的最后一个元素。
您可以使用列表理解来实现正则表达式。
>>> import re
>>> list_a = [
'temp_52_head sensor, uploaded by TS',
'crack in the left quadrant, uploaded by AB, Left in 2hr sunlight',
'FSL_pressure, uploaded by RS, no reported vacuum',
'art9943_mercury, Uploaded by DY, accelerated, hurst potential too low',
'uploaded by KKP, Space 55',
'avogadro reading level, uploaded by HB, started mini counter, pulled lever',
'no comment yesterday, Uploaded to TFG, level 1 escape but temperature stable,pressure lever north'
]
>>> regex = re.compile(r'(?i)\buploaded\s*(?:by|to)\s*([a-z]{2,3})')
>>> names = [m.group(1) for x in list_a for m in [regex.search(x)] if m]
['TS', 'AB', 'RS', 'DY', 'KKP', 'HB', 'TFG']
不是正则表达式,但更详细的方法可能是这样的:
import re
name = re.search(re.escape("uploaded by ")+"(.*?)"+re.escape(","),list_a[x]).group(1)
此正则表达式会命中所有这些,如果您更改上传者首字母中的字母数,它仍然有效。无论两个或三个字母后是否有逗号或单引号,这都会匹配。它还将捕获您要查找的所有数据:
import re
m = re.compile('uploaded ((by)|(to)) ([a-z]+)', flags=re.IGNORCASE)
然后您可以将搜索模式对象 m
与 search()
函数一起使用,它将提取所有匹配项。每次迭代中的第 4 个匹配项就是您要查找的数据。