将一行 xml 转换为 csv
convert one line xml into csv
我有 xml 个格式如下所示的文档,但我找不到使用 python 将其转换为 csv 的成功方法。我正在使用 Spyder IDE
并且是一个非常业余的人 python-ista
。我设法对其中一个文件使用在线转换器,但其余文件太大而无法上传。
我正在寻找输出为 rowID, PostID, Score, Text
.
的列
有人可以帮忙吗?
<?xml version="1.0" encoding="utf-8"?>
<comments>
<row Id="1" PostId="1" Score="5" Text="Was there something in particular you didn't understand in the wikipedia article? http://en.wikipedia.org/wiki/Spin_%28physics%29" CreationDate="2010-11-02T19:11:07.043" UserId="42" />
<row Id="2" PostId="3" Score="1" Text="I thought the wikipedia article here was pretty good, but maybe it only makes sense if you have a little quantum mechanics background: http://en.wikipedia.org/wiki/Particle_physics_and_representation_theory Were you able to get anything out of it?" CreationDate="2010-11-02T19:13:34.870" UserId="42" />
<row Id="3" PostId="3" Score="0" Text="i mostly thought this was a better place for the question than MO." CreationDate="2010-11-02T19:16:09.873" UserId="40" />
<row Id="6" PostId="4" Score="11" Text="An accurate answer, but if the poster doesn't understand the actual concept of spin (not to mention group theory), this is all but useless." CreationDate="2010-11-02T19:32:15.410" UserId="13" />
<row Id="7" PostId="2" Score="2" Text="I'm tempted to answer: with much difficulty, in a highly qualitative way, and only by reading a fair-sized book. There are many decent pop-sci books on string theory; I can't remember the names of any I read, but I'm sure someone can recommend one or two." CreationDate="2010-11-02T19:36:53.290" UserId="13" />
<row Id="8" PostId="8" Score="0" Text="so the fundamental particle is acting on the quantum states?" CreationDate="2010-11-02T19:36:55.263" UserId="40" />
其次,如果某些行没有所有字段或有额外的字段,我如何忽略这些并只填充指定字段的内容?我收到以下错误消息,但不想要额外的 3 列?
ParserError: Error tokenizing data. C error: Expected 4 fields in line 41, saw 7
以下对我有用:
import os
import xml.etree.ElementTree as ET
xml_file = "c:/temp/test.xml"
csv_file_output = '{}_out.csv'.format(os.path.splitext(xml_file)[0])
tree = ET.parse(xml_file)
xml_root = tree.getroot()
with open(csv_file_output, 'w') as fout:
fout.write("Id,PostId,Score,Text")
for row in xml_root.iter("row"):
id = row.get("Id")
postId = row.get("PostId")
score = row.get("Score")
text = row.get("Text")
fout.write('\n{0},{1},{2},"{3}"'.format(id, postId, score, text))
这也可以使用 pandas 并将数据帧保存为 CSV 来完成,但我保持简单。
将在与 XML 文件相同的文件夹中生成一个同名但以 _out.csv 结尾的文件。
我有 xml 个格式如下所示的文档,但我找不到使用 python 将其转换为 csv 的成功方法。我正在使用 Spyder IDE
并且是一个非常业余的人 python-ista
。我设法对其中一个文件使用在线转换器,但其余文件太大而无法上传。
我正在寻找输出为 rowID, PostID, Score, Text
.
有人可以帮忙吗?
<?xml version="1.0" encoding="utf-8"?>
<comments>
<row Id="1" PostId="1" Score="5" Text="Was there something in particular you didn't understand in the wikipedia article? http://en.wikipedia.org/wiki/Spin_%28physics%29" CreationDate="2010-11-02T19:11:07.043" UserId="42" />
<row Id="2" PostId="3" Score="1" Text="I thought the wikipedia article here was pretty good, but maybe it only makes sense if you have a little quantum mechanics background: http://en.wikipedia.org/wiki/Particle_physics_and_representation_theory Were you able to get anything out of it?" CreationDate="2010-11-02T19:13:34.870" UserId="42" />
<row Id="3" PostId="3" Score="0" Text="i mostly thought this was a better place for the question than MO." CreationDate="2010-11-02T19:16:09.873" UserId="40" />
<row Id="6" PostId="4" Score="11" Text="An accurate answer, but if the poster doesn't understand the actual concept of spin (not to mention group theory), this is all but useless." CreationDate="2010-11-02T19:32:15.410" UserId="13" />
<row Id="7" PostId="2" Score="2" Text="I'm tempted to answer: with much difficulty, in a highly qualitative way, and only by reading a fair-sized book. There are many decent pop-sci books on string theory; I can't remember the names of any I read, but I'm sure someone can recommend one or two." CreationDate="2010-11-02T19:36:53.290" UserId="13" />
<row Id="8" PostId="8" Score="0" Text="so the fundamental particle is acting on the quantum states?" CreationDate="2010-11-02T19:36:55.263" UserId="40" />
其次,如果某些行没有所有字段或有额外的字段,我如何忽略这些并只填充指定字段的内容?我收到以下错误消息,但不想要额外的 3 列?
ParserError: Error tokenizing data. C error: Expected 4 fields in line 41, saw 7
以下对我有用:
import os
import xml.etree.ElementTree as ET
xml_file = "c:/temp/test.xml"
csv_file_output = '{}_out.csv'.format(os.path.splitext(xml_file)[0])
tree = ET.parse(xml_file)
xml_root = tree.getroot()
with open(csv_file_output, 'w') as fout:
fout.write("Id,PostId,Score,Text")
for row in xml_root.iter("row"):
id = row.get("Id")
postId = row.get("PostId")
score = row.get("Score")
text = row.get("Text")
fout.write('\n{0},{1},{2},"{3}"'.format(id, postId, score, text))
这也可以使用 pandas 并将数据帧保存为 CSV 来完成,但我保持简单。
将在与 XML 文件相同的文件夹中生成一个同名但以 _out.csv 结尾的文件。