从一串句子中提取地点和出版商

Extracting the Place and Publisher in a String of Sentence

所以我有一个数据列表,说明期刊的地点和出版商

数据在列表中的单个句子中给出

['Place: Amsterdam Publisher: Elsevier Science Bv WOS:000179813800003' ,
 'Place: Hanoi Publisher: Vietnam Acad Science & Technology-Vast WOS:000530921100003' , 
 'Publisher: SAGE Publications Ltd',
 'Place: London'] 

因此,正如您在某些字符串中看到的那样,给出了 Publisher 但没有位置,有些位置反之亦然。

所以我希望输出类似于两个列表

Places = ['Amsterdam','Hanoi','London']
Publishers = ['Elsevier Science',
              'Vietnam Acad Science & Technology- Vast',
              'SAGE Publications Ltd']

我正在使用 Python 进行此数据分析..

我正在考虑使用 split() 函数来检测写入 Place 的位置并选择了它旁边的字符串,但它似乎不起作用

到目前为止我的代码

places=[]
for i in extrainfo :  #E xtrainfo Name of Initial List 
 
 if ('Place') in i :
       z=i
       i=i.split()
       counter=0
       for q in i :
        if q=='Place' :
          break
        counter=counter+1
 places=pleaces+z[counter+1]       
print(places)
  • 使用 s.split(':');
  • 在冒号上拆分 ':'
  • 使用 s.strip() 丢弃尾随空格;
  • 如果其中一个拆分子字符串以'Publisher''Place'结尾,则将下一个子字符串添加到相关列表中;
  • 添加到列表中的一些子字符串将以 'Place''Publisher' 结尾:请使用 s.removesuffix('Place').removesuffix('Publisher').
  • 来处理
from itertools import pairwise # python>=3.10
# from itertools import tee
# def pairwise(iterable):
#     "s -> (s0,s1), (s1,s2), (s2, s3), ..."
#     a, b = tee(iterable)
#     next(b, None)
#     return zip(a, b)

data = ['Place: Amsterdam Publisher: Elsevier Science Bv WOS:000179813800003' , 'Place: Hanoi Publisher: Vietnam Acad Science & Technology-Vast WOS:000530921100003' , 'Publisher: SAGE Publications Ltd','Place: London']

things = {'Place': [], 'Publisher': [], 'WOS': []}

for sentence in data:
    for k, v in pairwise(map(str.strip, sentence.split(':'))):
        for cat in things:
            if k.endswith(cat):
                for suffix in things:
                    v = v.removesuffix(suffix).strip()
                things[cat].append(v)
                break

print(things)
# {'Place': ['Amsterdam', 'Hanoi', 'London'],
#  'Publisher': ['Elsevier Science Bv', 'Vietnam Acad Science & Technology-Vast', 'SAGE Publications Ltd'],
#  'WOS': ['000179813800003', '000530921100003']}

使用 re 模块的解决方案:

import re

lst = [
    "Place: Amsterdam Publisher: Elsevier Science Bv WOS:000179813800003",
    "Place: Hanoi Publisher: Vietnam Acad Science & Technology-Vast WOS:000530921100003",
    "Publisher: SAGE Publications Ltd",
    "Place: London",
]

places = [
    m.group(1)
    for i in lst
    if (m := re.search(r"Place: (.*?)\s*(?:Publisher|$)", i))
]

publishers = [
    m.group(1)
    for i in lst
    if (m := re.search(r"Publisher: (.*?)\s*(?:WOS|$)", i))
]

print(places)
print(publishers)

打印:

['Amsterdam', 'Hanoi', 'London']
['Elsevier Science Bv', 'Vietnam Acad Science & Technology-Vast', 'SAGE Publications Ltd']