从一串句子中提取地点和出版商
Extracting the Place and Publisher in a String of Sentence
所以我有一个数据列表,说明期刊的地点和出版商
数据在列表中的单个句子中给出
['Place: Amsterdam Publisher: Elsevier Science Bv WOS:000179813800003' ,
'Place: Hanoi Publisher: Vietnam Acad Science & Technology-Vast WOS:000530921100003' ,
'Publisher: SAGE Publications Ltd',
'Place: London']
因此,正如您在某些字符串中看到的那样,给出了 Publisher 但没有位置,有些位置反之亦然。
所以我希望输出类似于两个列表
Places = ['Amsterdam','Hanoi','London']
Publishers = ['Elsevier Science',
'Vietnam Acad Science & Technology- Vast',
'SAGE Publications Ltd']
我正在使用 Python 进行此数据分析..
我正在考虑使用 split() 函数来检测写入 Place 的位置并选择了它旁边的字符串,但它似乎不起作用
到目前为止我的代码
places=[]
for i in extrainfo : #E xtrainfo Name of Initial List
if ('Place') in i :
z=i
i=i.split()
counter=0
for q in i :
if q=='Place' :
break
counter=counter+1
places=pleaces+z[counter+1]
print(places)
- 使用
s.split(':')
; 在冒号上拆分 ':'
- 使用
s.strip()
丢弃尾随空格;
- 如果其中一个拆分子字符串以
'Publisher'
或'Place'
结尾,则将下一个子字符串添加到相关列表中;
- 添加到列表中的一些子字符串将以
'Place'
或 'Publisher'
结尾:请使用 s.removesuffix('Place').removesuffix('Publisher')
. 来处理
from itertools import pairwise # python>=3.10
# from itertools import tee
# def pairwise(iterable):
# "s -> (s0,s1), (s1,s2), (s2, s3), ..."
# a, b = tee(iterable)
# next(b, None)
# return zip(a, b)
data = ['Place: Amsterdam Publisher: Elsevier Science Bv WOS:000179813800003' , 'Place: Hanoi Publisher: Vietnam Acad Science & Technology-Vast WOS:000530921100003' , 'Publisher: SAGE Publications Ltd','Place: London']
things = {'Place': [], 'Publisher': [], 'WOS': []}
for sentence in data:
for k, v in pairwise(map(str.strip, sentence.split(':'))):
for cat in things:
if k.endswith(cat):
for suffix in things:
v = v.removesuffix(suffix).strip()
things[cat].append(v)
break
print(things)
# {'Place': ['Amsterdam', 'Hanoi', 'London'],
# 'Publisher': ['Elsevier Science Bv', 'Vietnam Acad Science & Technology-Vast', 'SAGE Publications Ltd'],
# 'WOS': ['000179813800003', '000530921100003']}
使用 re
模块的解决方案:
import re
lst = [
"Place: Amsterdam Publisher: Elsevier Science Bv WOS:000179813800003",
"Place: Hanoi Publisher: Vietnam Acad Science & Technology-Vast WOS:000530921100003",
"Publisher: SAGE Publications Ltd",
"Place: London",
]
places = [
m.group(1)
for i in lst
if (m := re.search(r"Place: (.*?)\s*(?:Publisher|$)", i))
]
publishers = [
m.group(1)
for i in lst
if (m := re.search(r"Publisher: (.*?)\s*(?:WOS|$)", i))
]
print(places)
print(publishers)
打印:
['Amsterdam', 'Hanoi', 'London']
['Elsevier Science Bv', 'Vietnam Acad Science & Technology-Vast', 'SAGE Publications Ltd']
所以我有一个数据列表,说明期刊的地点和出版商
数据在列表中的单个句子中给出
['Place: Amsterdam Publisher: Elsevier Science Bv WOS:000179813800003' ,
'Place: Hanoi Publisher: Vietnam Acad Science & Technology-Vast WOS:000530921100003' ,
'Publisher: SAGE Publications Ltd',
'Place: London']
因此,正如您在某些字符串中看到的那样,给出了 Publisher 但没有位置,有些位置反之亦然。
所以我希望输出类似于两个列表
Places = ['Amsterdam','Hanoi','London']
Publishers = ['Elsevier Science',
'Vietnam Acad Science & Technology- Vast',
'SAGE Publications Ltd']
我正在使用 Python 进行此数据分析..
我正在考虑使用 split() 函数来检测写入 Place 的位置并选择了它旁边的字符串,但它似乎不起作用
到目前为止我的代码
places=[]
for i in extrainfo : #E xtrainfo Name of Initial List
if ('Place') in i :
z=i
i=i.split()
counter=0
for q in i :
if q=='Place' :
break
counter=counter+1
places=pleaces+z[counter+1]
print(places)
- 使用
s.split(':')
; 在冒号上拆分 - 使用
s.strip()
丢弃尾随空格; - 如果其中一个拆分子字符串以
'Publisher'
或'Place'
结尾,则将下一个子字符串添加到相关列表中; - 添加到列表中的一些子字符串将以
'Place'
或'Publisher'
结尾:请使用s.removesuffix('Place').removesuffix('Publisher')
. 来处理
':'
from itertools import pairwise # python>=3.10
# from itertools import tee
# def pairwise(iterable):
# "s -> (s0,s1), (s1,s2), (s2, s3), ..."
# a, b = tee(iterable)
# next(b, None)
# return zip(a, b)
data = ['Place: Amsterdam Publisher: Elsevier Science Bv WOS:000179813800003' , 'Place: Hanoi Publisher: Vietnam Acad Science & Technology-Vast WOS:000530921100003' , 'Publisher: SAGE Publications Ltd','Place: London']
things = {'Place': [], 'Publisher': [], 'WOS': []}
for sentence in data:
for k, v in pairwise(map(str.strip, sentence.split(':'))):
for cat in things:
if k.endswith(cat):
for suffix in things:
v = v.removesuffix(suffix).strip()
things[cat].append(v)
break
print(things)
# {'Place': ['Amsterdam', 'Hanoi', 'London'],
# 'Publisher': ['Elsevier Science Bv', 'Vietnam Acad Science & Technology-Vast', 'SAGE Publications Ltd'],
# 'WOS': ['000179813800003', '000530921100003']}
使用 re
模块的解决方案:
import re
lst = [
"Place: Amsterdam Publisher: Elsevier Science Bv WOS:000179813800003",
"Place: Hanoi Publisher: Vietnam Acad Science & Technology-Vast WOS:000530921100003",
"Publisher: SAGE Publications Ltd",
"Place: London",
]
places = [
m.group(1)
for i in lst
if (m := re.search(r"Place: (.*?)\s*(?:Publisher|$)", i))
]
publishers = [
m.group(1)
for i in lst
if (m := re.search(r"Publisher: (.*?)\s*(?:WOS|$)", i))
]
print(places)
print(publishers)
打印:
['Amsterdam', 'Hanoi', 'London']
['Elsevier Science Bv', 'Vietnam Acad Science & Technology-Vast', 'SAGE Publications Ltd']