如何根据 python 中的字符串从列表中创建子列表?
How can I make sublists from a list based on strings in python?
我有一个这样的列表:
['<text id="32a45" language="ENG" date="2017-01-01" time="11:00" timezone="Eastern">',
'<text id="32a47" language="ENG" date="2017-01-05" time="1:00" timezone="Central">',
'<text id="32a48" language="ENG" date="2017-01-07" time="3:00" timezone="Pacific">']
据此,我想制作如下子列表:
id = ["32a45", "32a47", "32a48"]
date=["2017-01-01", "2017-01-05", "2017-01-07"]
我该怎么做?
谢谢。
编辑:这是
使用re.search()
函数的简单解决方案:
import re
l = ['<text id="32a45" language="ENG" date="2017-01-01" time="11:00" timezone="Eastern">',
'<text id="32a47" language="ENG" date="2017-01-05" time="1:00" timezone="Central">',
'<text id="32a48" language="ENG" date="2017-01-07" time="3:00" timezone="Pacific">']
ids, dates = [], []
for i in l:
ids.append(re.search(r'id="([^"]+)"', i).group(1))
dates.append(re.search(r'date="([^"]+)"', i).group(1))
print(ids) # ['32a45', '32a47', '32a48']
print(dates) # ['2017-01-01', '2017-01-05', '2017-01-07']
id = [i.split(' ')[1].split('=')[1].strip('"') for i in list]
date = [i.split(' ')[3].split('=')[1].strip('"') for i in list]
但是文件看起来很奇怪,如果原始文件是html或者xml,有更好的方法获取数据。
正在用 ET 解析:
import xml.etree.ElementTree as ET
strings = ['<text id="32a45" language="ENG" date="2017-01-01" time="11:00" timezone="Eastern">',
'<text id="32a47" language="ENG" date="2017-01-05" time="1:00" timezone="Central">',
'<text id="32a48" language="ENG" date="2017-01-07" time="3:00" timezone="Pacific">']
id_ = []
date = []
for string in strings:
tree = ET.fromstring(string+"</text>") #corrects wrong format
id_.append(tree.get("id"))
date.append(tree.get("date"))
print(id_) # ['32a45', '32a47', '32a48']
print(date) # ['2017-01-01', '2017-01-05', '2017-01-07']
更新,完整的压缩示例:
根据您在此处描述的原始问题:
import xml.etree.ElementTree as ET
import pandas as pd
strings = ['<text id="32a45" language="ENG" date="2017-01-01" time="11:00" timezone="Eastern">',
'<text id="32a47" language="ENG" date="2017-01-05" time="1:00" timezone="Central">',
'<text id="32a48" language="ENG" date="2017-01-07" time="3:00" timezone="Pacific">']
cols = ["id","language","date","time","timezone"]
data = [[ET.fromstring(string+"</text>").get(col) for col in cols] for string in strings]
df = pd.DataFrame(data,columns=cols)
id language date time timezone
0 32a45 ENG 2017-01-01 11:00 Eastern
1 32a47 ENG 2017-01-05 1:00 Central
2 32a48 ENG 2017-01-07 3:00 Pacific
现在您可以使用: df.to_sql()
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html
由于您提供的数据似乎是 broken/partial xml 片段,我会亲自尝试修复 xml 并使用 xml.etree
模块提取数据。但是,如果您从正确的 xml 获得当前列表,那么在该数据上使用 xml.etree
模块会更容易。
使用 xml.etree
的示例解决方案:
from xml.etree import ElementTree as ET
data = ['<text id="32a45" language="ENG" date="2017-01-01" time="11:00" timezone="Eastern">',
'<text id="32a47" language="ENG" date="2017-01-05" time="1:00" timezone="Central">',
'<text id="32a48" language="ENG" date="2017-01-07" time="3:00" timezone="Pacific">']
ids = []
dates = []
for element in data:
#This wraps the element in a root tag and gives it a closing tag to
# repair the xml to a valid format.
root = ET.fromstring('{}</text>'.format(element))
#As we have formatted the xml ourselves we can guarantee that it's first
# child will always be the desired element.
ids.append(root.attrib['id'])
dates.append(root.attrib['date'])
print(ids) # ['32a45', '32a47', '32a48']
print(dates) # ['2017-01-01', '2017-01-05', '2017-01-07']
连同其他更好的答案你可以手动解析数据(更简单):
for line in lines:
id = line[line.index('"')+1:]
line = id
line = id[line.index('"')+1:]
id = id[:id.index('"')]
print('id: ' + id)
然后您可以简单地将它推入新列表,对下面的其他值重复相同的过程,只需更改变量名称即可。
不如使用 re
的@RomanPerekhrest 解决方案优雅,但它是这样的:
def extract(lst, kwd):
out = []
for t in lst:
index1 = t.index(kwd) + len(kwd) + 1
index2 = index1 + t[index1:].index('"') + 1
index3 = index2 + t[index2:].index('"')
out.append(t[index2:index3])
return out
然后
>>> extract(lst, kwd='id')
['32a45', '32a47', '32a48']
使用 re
模块更容易理解:
这是代码:
l = ['<text id="32a45" language="ENG" date="2017-01-01" time="11:00" timezone="Eastern">',
'<text id="32a47" language="ENG" date="2017-01-05" time="1:00" timezone="Central">',
'<text id="32a48" language="ENG" date="2017-01-07" time="3:00" timezone="Pacific">']
import re
id =[]
dates= []
for i in l:
id.append(re.search(r'id="(.+?)"',i, re.M|re.I).group(1))
dates.append(re.search(r'date="(.+?)"',item, re.M|re.I).group(1))
输出:
print id #id= ['32a45', '32a47', '32a48']
print dates #dates= ['2017-01-07', '2017-01-07', '2017-01-07']