按多个关键字拆分字符串并创建字典
Splitting a string by multiple keywords and creating a dict
在做了一些网络抓取之后,我终于能够从字体主体中获取一个字符串,结果如下
string = Date: 02/13/2020 Court Time: 1030 Court Room: 0206 Microfilm: SD000000000
我需要弄清楚关于我的代码的最后一件事,我想这在这一点上看起来相当微不足道,就是将该字符串拆分为字典对,其中配对如下所示:
Date: 02/13/2020,
Court Time: 1030,
Court Room: 0206,
Microfilm: SD000000000
我想也许可以做一些事情:
keywords = ['Date:','Court Time:','Court Room:', 'Microfilm:']
for k in keywords:
print(string.split())
使用这些关键字作为分隔符。
但它多次吐出这个
['Date:', '02/13/2020', 'Court', 'Time:', '1030', 'Court', 'Room:', '0206', 'Microfilm:', 'SD000000000']
['Date:', '02/13/2020', 'Court', 'Time:', '1030', 'Court', 'Room:', '0206', 'Microfilm:', 'SD000000000']
['Date:', '02/13/2020', 'Court', 'Time:', '1030', 'Court', 'Room:', '0206', 'Microfilm:', 'SD000000000']
['Date:', '02/13/2020', 'Court', 'Time:', '1030', 'Court', 'Room:', '0206', 'Microfilm:', 'SD000000000']
按照你的例子:
s='Date: 02/13/2020 Court Time: 1030 Court Room: 0206 Microfilm: SD000000000'
假设双 space 是您的分隔符:
sep = ' '
lst = s.split(sep)
d = dict(zip(lst[0::2], lst[1::2]))
输出为:
{'Date:': '02/13/2020',
'Court Time:': '1030',
'Court Room:': '0206',
'Microfilm:': 'SD000000000'}
下面的一段代码可以解决问题。
my_string = "Date: 02/13/2020, Court Time: 1030, Court Room: 0206, Microfilm: SD000000000"
key_value_pair = [line.split(':') for line in my_string.split(',')]
output_dict = {k.strip(): v.strip() for k, v in key_value_pair}
print(output_dict)
输出:
{'Date': '02/13/2020', 'Court Time': '1030', 'Court Room': '0206', 'Microfilm': 'SD000000000'}
我会使用正则表达式并构建关键字列表的模式:
pattern = '|'.join(['(' + i + ')' for i in keywords])
这给出 '(Date:)|(Court Time:)|(Court Room:)|(Microfilm:)'
我们现在可以使用该模式拆分字符串:
lst = re.split(pattern, string)
到达这里:['', 'Date:', None, None, None, ' 02/13/2020 ', None, 'Court Time:', None, None, ' 1030 ', None, None, 'Court Room:', None, ' 0206 ', None, None, None, 'Microfilm:', ' SD000000000']
让我们post处理列表以提取最终字典的键和值:
def getkey(ls):
for i in ls:
if i is not None:
return i.strip().rstrip(':')
lk = len(keywords)
elts = [(lst[i: i+lk)], lst[i+lk]) for i in range(1, len(lst), lk+1)]
resul = {getkey(i): j.strip() for i,j in elts}
这给出了预期的结果:
{'Date': '02/13/2020', 'Court Time': '1030', 'Court Room': '0206', 'Microfilm': 'SD000000000'}
在做了一些网络抓取之后,我终于能够从字体主体中获取一个字符串,结果如下
string = Date: 02/13/2020 Court Time: 1030 Court Room: 0206 Microfilm: SD000000000
我需要弄清楚关于我的代码的最后一件事,我想这在这一点上看起来相当微不足道,就是将该字符串拆分为字典对,其中配对如下所示:
Date: 02/13/2020,
Court Time: 1030,
Court Room: 0206,
Microfilm: SD000000000
我想也许可以做一些事情:
keywords = ['Date:','Court Time:','Court Room:', 'Microfilm:']
for k in keywords:
print(string.split())
使用这些关键字作为分隔符。 但它多次吐出这个
['Date:', '02/13/2020', 'Court', 'Time:', '1030', 'Court', 'Room:', '0206', 'Microfilm:', 'SD000000000']
['Date:', '02/13/2020', 'Court', 'Time:', '1030', 'Court', 'Room:', '0206', 'Microfilm:', 'SD000000000']
['Date:', '02/13/2020', 'Court', 'Time:', '1030', 'Court', 'Room:', '0206', 'Microfilm:', 'SD000000000']
['Date:', '02/13/2020', 'Court', 'Time:', '1030', 'Court', 'Room:', '0206', 'Microfilm:', 'SD000000000']
按照你的例子:
s='Date: 02/13/2020 Court Time: 1030 Court Room: 0206 Microfilm: SD000000000'
假设双 space 是您的分隔符:
sep = ' '
lst = s.split(sep)
d = dict(zip(lst[0::2], lst[1::2]))
输出为:
{'Date:': '02/13/2020',
'Court Time:': '1030',
'Court Room:': '0206',
'Microfilm:': 'SD000000000'}
下面的一段代码可以解决问题。
my_string = "Date: 02/13/2020, Court Time: 1030, Court Room: 0206, Microfilm: SD000000000"
key_value_pair = [line.split(':') for line in my_string.split(',')]
output_dict = {k.strip(): v.strip() for k, v in key_value_pair}
print(output_dict)
输出:
{'Date': '02/13/2020', 'Court Time': '1030', 'Court Room': '0206', 'Microfilm': 'SD000000000'}
我会使用正则表达式并构建关键字列表的模式:
pattern = '|'.join(['(' + i + ')' for i in keywords])
这给出 '(Date:)|(Court Time:)|(Court Room:)|(Microfilm:)'
我们现在可以使用该模式拆分字符串:
lst = re.split(pattern, string)
到达这里:['', 'Date:', None, None, None, ' 02/13/2020 ', None, 'Court Time:', None, None, ' 1030 ', None, None, 'Court Room:', None, ' 0206 ', None, None, None, 'Microfilm:', ' SD000000000']
让我们post处理列表以提取最终字典的键和值:
def getkey(ls):
for i in ls:
if i is not None:
return i.strip().rstrip(':')
lk = len(keywords)
elts = [(lst[i: i+lk)], lst[i+lk]) for i in range(1, len(lst), lk+1)]
resul = {getkey(i): j.strip() for i,j in elts}
这给出了预期的结果:
{'Date': '02/13/2020', 'Court Time': '1030', 'Court Room': '0206', 'Microfilm': 'SD000000000'}