使用 pandas 解析为 csv 读取 XML
Read XML using pandas parsing to csv
我有以下代码从XML中提取数据到CSV文件,但是出现错误,我不知道如何解决。
如果有人可以提供帮助,请。
url = "http://90.161.233.78:65519/services/user/records.xml?begin=04052022?end=06052022?var=EDSLINEEMBEDDED.Module2.VI1?var=EDSLINEEMBEDDED.Module2.API1?period=900"
s = unescape(requests.get(url).text)[5:-6]
df = pd.read_xml(s, xpath="//record/* | //dateTime")
df["field"] = df["field"].ffill()
df.to_csv('output0.csv')
错误是
doc = fromstring(
File "src\lxml\etree.pyx", line 3252, in lxml.etree.fromstring
File "src\lxml\parser.pxi", line 1913, in lxml.etree._parseMemoryDocument
File "src\lxml\parser.pxi", line 1800, in lxml.etree._parseDoc
File "src\lxml\parser.pxi", line 1141, in lxml.etree._BaseParser._parseDoc
File "src\lxml\parser.pxi", line 615, in lxml.etree._ParserContext._handleParseResultDoc
File "src\lxml\parser.pxi", line 725, in lxml.etree._handleParseResult
File "src\lxml\parser.pxi", line 654, in lxml.etree._raiseParseError
File "<string>", line 1
lxml.etree.XMLSyntaxError: Start tag expected, '<' not found, line 1, column 2
考虑在不使用 requests
的情况下阅读 URL 或将内容直接转义为 pandas.read_xml()
。根据文档,添加了重点:
path_or_buffer: str, path object, or file-like object
String, path object (implementing os.PathLike[str]
), or file-like object implementing a read()
function. The string can be any valid XML string or a path. The string can further be a URL. Valid URL schemes include http, ftp, s3, and file.
import pandas as pd
url = (
"http://90.161.233.78:65519/services/user/records.xml?"
"begin=04052022?end=06052022?var=EDSLINEEMBEDDED.Module2.VI1?"
"var=EDSLINEEMBEDDED.Module2.API1?period=900"
)
df = pd.read_xml(url, xpath="//record/* | //dateTime")
# FILL PARENT TEXT FORWARD TO CHILD ITEMS
df["dateTime"] = df["dateTime"].ffill()
# DROP UNNEEDED ROWS
df[(pd.notnull(df["id"])) & (pd.notnull(df["value"]))]
df.to_csv('output0.csv')
我有以下代码从XML中提取数据到CSV文件,但是出现错误,我不知道如何解决。
如果有人可以提供帮助,请。
url = "http://90.161.233.78:65519/services/user/records.xml?begin=04052022?end=06052022?var=EDSLINEEMBEDDED.Module2.VI1?var=EDSLINEEMBEDDED.Module2.API1?period=900"
s = unescape(requests.get(url).text)[5:-6]
df = pd.read_xml(s, xpath="//record/* | //dateTime")
df["field"] = df["field"].ffill()
df.to_csv('output0.csv')
错误是
doc = fromstring(
File "src\lxml\etree.pyx", line 3252, in lxml.etree.fromstring
File "src\lxml\parser.pxi", line 1913, in lxml.etree._parseMemoryDocument
File "src\lxml\parser.pxi", line 1800, in lxml.etree._parseDoc
File "src\lxml\parser.pxi", line 1141, in lxml.etree._BaseParser._parseDoc
File "src\lxml\parser.pxi", line 615, in lxml.etree._ParserContext._handleParseResultDoc
File "src\lxml\parser.pxi", line 725, in lxml.etree._handleParseResult
File "src\lxml\parser.pxi", line 654, in lxml.etree._raiseParseError
File "<string>", line 1
lxml.etree.XMLSyntaxError: Start tag expected, '<' not found, line 1, column 2
考虑在不使用 requests
的情况下阅读 URL 或将内容直接转义为 pandas.read_xml()
。根据文档,添加了重点:
path_or_buffer: str, path object, or file-like object
String, path object (implementingos.PathLike[str]
), or file-like object implementing aread()
function. The string can be any valid XML string or a path. The string can further be a URL. Valid URL schemes include http, ftp, s3, and file.
import pandas as pd
url = (
"http://90.161.233.78:65519/services/user/records.xml?"
"begin=04052022?end=06052022?var=EDSLINEEMBEDDED.Module2.VI1?"
"var=EDSLINEEMBEDDED.Module2.API1?period=900"
)
df = pd.read_xml(url, xpath="//record/* | //dateTime")
# FILL PARENT TEXT FORWARD TO CHILD ITEMS
df["dateTime"] = df["dateTime"].ffill()
# DROP UNNEEDED ROWS
df[(pd.notnull(df["id"])) & (pd.notnull(df["value"]))]
df.to_csv('output0.csv')