如何根据 python 中的字符串从列表中创建子列表?

How can I make sublists from a list based on strings in python?

我有一个这样的列表:

['<text id="32a45" language="ENG" date="2017-01-01" time="11:00" timezone="Eastern">',
'<text id="32a47" language="ENG" date="2017-01-05" time="1:00" timezone="Central">',
'<text id="32a48" language="ENG" date="2017-01-07" time="3:00" timezone="Pacific">']

据此,我想制作如下子列表:

id = ["32a45", "32a47", "32a48"]
date=["2017-01-01", "2017-01-05", "2017-01-07"]

我该怎么做?

谢谢。

编辑:这是 这是一个损坏的 xml 文件,标签也乱七八糟,因此我无法使用 xmltree。所以我正在尝试别的东西。

使用re.search()函数的简单解决方案:

import re

l = ['<text id="32a45" language="ENG" date="2017-01-01" time="11:00" timezone="Eastern">',
'<text id="32a47" language="ENG" date="2017-01-05" time="1:00" timezone="Central">',
'<text id="32a48" language="ENG" date="2017-01-07" time="3:00" timezone="Pacific">']

ids, dates = [], []
for i in l:
    ids.append(re.search(r'id="([^"]+)"', i).group(1))
    dates.append(re.search(r'date="([^"]+)"', i).group(1))

print(ids)    # ['32a45', '32a47', '32a48']
print(dates)  # ['2017-01-01', '2017-01-05', '2017-01-07']
id = [i.split(' ')[1].split('=')[1].strip('"') for i in list]
date = [i.split(' ')[3].split('=')[1].strip('"') for i in list]

但是文件看起来很奇怪,如果原始文件是html或者xml,有更好的方法获取数据。

正在用 ET 解析:

import xml.etree.ElementTree as ET
strings = ['<text id="32a45" language="ENG" date="2017-01-01" time="11:00" timezone="Eastern">',
'<text id="32a47" language="ENG" date="2017-01-05" time="1:00" timezone="Central">',
'<text id="32a48" language="ENG" date="2017-01-07" time="3:00" timezone="Pacific">']

id_ = []
date = []
for string in strings:
    tree = ET.fromstring(string+"</text>") #corrects wrong format
    id_.append(tree.get("id"))
    date.append(tree.get("date"))

print(id_) #  ['32a45', '32a47', '32a48']
print(date) # ['2017-01-01', '2017-01-05', '2017-01-07']

更新,完整的压缩示例: 根据您在此处描述的原始问题:

import xml.etree.ElementTree as ET
import pandas as pd

strings = ['<text id="32a45" language="ENG" date="2017-01-01" time="11:00" timezone="Eastern">',
'<text id="32a47" language="ENG" date="2017-01-05" time="1:00" timezone="Central">',
'<text id="32a48" language="ENG" date="2017-01-07" time="3:00" timezone="Pacific">']

cols = ["id","language","date","time","timezone"]
data = [[ET.fromstring(string+"</text>").get(col) for col in cols] for string in strings]    
df = pd.DataFrame(data,columns=cols)

    id  language    date    time    timezone
0   32a45   ENG     2017-01-01  11:00   Eastern
1   32a47   ENG     2017-01-05  1:00    Central
2   32a48   ENG     2017-01-07  3:00    Pacific

现在您可以使用: df.to_sql()

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html

由于您提供的数据似乎是 broken/partial xml 片段,我会亲自尝试修复 xml 并使用 xml.etree 模块提取数据。但是,如果您从正确的 xml 获得当前列表,那么在该数据上使用 xml.etree 模块会更容易。

使用 xml.etree 的示例解决方案:

from xml.etree import ElementTree as ET

data = ['<text id="32a45" language="ENG" date="2017-01-01" time="11:00" timezone="Eastern">',
'<text id="32a47" language="ENG" date="2017-01-05" time="1:00" timezone="Central">',
'<text id="32a48" language="ENG" date="2017-01-07" time="3:00" timezone="Pacific">']

ids = []
dates = []
for element in data:
    #This wraps the element in a root tag and gives it a closing tag to
    #  repair the xml to a valid format.
    root = ET.fromstring('{}</text>'.format(element))

    #As we have formatted the xml ourselves we can guarantee that it's first
    #  child will always be the desired element.
    ids.append(root.attrib['id'])
    dates.append(root.attrib['date'])

print(ids)    # ['32a45', '32a47', '32a48']
print(dates)  # ['2017-01-01', '2017-01-05', '2017-01-07']

连同其他更好的答案你可以手动解析数据(更简单):

for line in lines:
    id = line[line.index('"')+1:]
    line = id
    line = id[line.index('"')+1:]
    id = id[:id.index('"')]
    print('id: ' + id)

然后您可以简单地将它推入新列表,对下面的其他值重复相同的过程,只需更改变量名称即可。

不如使用 re 的@RomanPerekhrest 解决方案优雅,但它是这样的:

def extract(lst, kwd):
   out = []
   for t in lst:
       index1 = t.index(kwd) + len(kwd) + 1
       index2 = index1 + t[index1:].index('"') + 1
       index3 = index2 + t[index2:].index('"')
       out.append(t[index2:index3])
   return out

然后

>>> extract(lst, kwd='id')
['32a45', '32a47', '32a48']

使用 re 模块更容易理解: 这是代码:

l = ['<text id="32a45" language="ENG" date="2017-01-01" time="11:00" timezone="Eastern">', 
'<text id="32a47" language="ENG" date="2017-01-05" time="1:00" timezone="Central">',
 '<text id="32a48" language="ENG" date="2017-01-07" time="3:00" timezone="Pacific">']

import re
id =[]
dates= []
for i in l:
    id.append(re.search(r'id="(.+?)"',i, re.M|re.I).group(1))
    dates.append(re.search(r'date="(.+?)"',item, re.M|re.I).group(1))

输出:

print id     #id= ['32a45', '32a47', '32a48']
print dates  #dates= ['2017-01-07', '2017-01-07', '2017-01-07']