在 Python 的 XML 文件中搜索单词列表?
Searching for a list of words in XML file in Python?
我有这个 XML 文件,其中包含超过 2000 个短语,下面是一个小示例。
<TEXT>
<PHRASE>
<V>played</V>
<N>John</N>
<PREP>with</PREP>
<en x='PERS'>Adam</en>
<PREP>in</PREP>
<en x='LOC'> ASL school/en>
</PHRASE>
<PHRASE>
<V y='0'>went</V>
<en x='PERS'>Mark</en>
<PREP>to</PREP>
<en x='ORG>United Nations</en>
<PREP>for</PREP>
<PREP>a</PREP>
<N>visit</N>
</PHRASE>
<PHRASE>
<PREP>in</PREP>
<en x='DATE'>1987</en>
<en x='PERS'>Nick</en>
<V>founded</V>
<en x='ORG'>XYZ company</en>
</PHRASE>
<PHRASE>
<en x='ORG'>Google's</en>
<en x='PERS'>Frank</en>
<V>went</V>
<N>yesterday</N>
<PREP>to</PREP>
<en x='LOC'>San Fransisco/en>
</PHRASE>
</TEXT>
我有一个模式列表:
finalPatterns=['went \n to \n','created\n the\n', 'founded\n a\n', 'went\n yesterday\n to\n', 'a\n visit\n', 'founded\n in\n']
我想要的是以每个 finalPattern 为例:went to 并在文本的每个短语中搜索它的存在,如果任何短语同时包含 went AND to 然后打印出它的 2 <en>
标签。 [如果 en 标签不等于 PERS & ORG,则不会打印任何内容]
当它搜索:
-"went" & "to" --> this is the output: Frank -San Fransisco
-"founded" & "in" --> output: Nick-XYZ Company
我就是这么做的,但没有用。没有打印任何内容。
for phrase in root.findall('./PHRASE'):
ens = {en.get('x'): en.text for en in phrase.findall('en')}
if 'ORG' in ens and 'PERS' in ens:
if all(word in phrase for word in finalPatterns):
x="".join(phrase.itertext()) #print whats in between [since I would also like to print the whole sentence]
print("ORG is: {}, PERS is: {} /".format(ens["ORG"],ens["PERS"]))
考虑 XSLT(操纵 XML 文档的 special-purpose 语言)在处理搜索时根据匹配值重写原始 xml。
下面的 XSLT 嵌入 Python 以使用 finalPatterns
列表动态删除不匹配的元素。从那里,Python 可以转换(使用 lxml
模块)原始文档,然后将此类输出用于您的最终使用需求。
Python 脚本
import lxml.etree as ET
finalPatterns=['went \n to \n','created\n the\n', 'founded\n a\n', 'went\n yesterday\n to\n', 'a\n visit\n', 'founded\n in\n']
# BUILDING XSLT FILTER STRING
contains = ''
for p in finalPatterns:
contains += "("
for i in p.split('\n '):
contains += "contains(., '{}') and \n".format(i.replace('\n', '').strip(' '))
contains += ")"
contains = contains.replace(' and \n)', ') or ')
xslstr = '''<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" />
<xsl:strip-space elements="*"/>
<!-- Identity Transform -->
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<!-- Rewrites Matching Phrase elements -->
<xsl:template match="PHRASE">
<xsl:copy>
<wholetext>
<xsl:call-template name="join">
<xsl:with-param name="valueList" select="*"/>
<xsl:with-param name="separator" select="' '"/>
</xsl:call-template>
</wholetext>
<xsl:choose>
<xsl:when test="contains(., 'went') = True and contains(., 'to') = True">
<match>went to</match>
</xsl:when>
<xsl:when test="contains(., 'founded') = True and contains(., 'in') = True">
<match>founded in</match>
</xsl:when>
<xsl:when test="contains(., 'created') = True and contains(., 'the') = True">
<match>created the</match>
</xsl:when>
<xsl:otherwise test="contains(., 'a') = True and contains(., 'visit') = True">
<match>a visit</match>
</xsl:otherwise>
</xsl:choose>
<person><xsl:value-of select="en[@x='PERS']"/></person>
<organization><xsl:value-of select="en[@x='ORG']"/></organization>
<location><xsl:value-of select="en[@x='LOC']"/></location>
</xsl:copy>
</xsl:template>
<!-- Rewrites Unmatched Phrase elements -->
<xsl:template match="PHRASE[not({0})]"/>
<!-- Join Text values -->
<xsl:template name="join">
<xsl:param name="valueList" select="''"/>
<xsl:param name="separator" select="','"/>
<xsl:for-each select="$valueList">
<xsl:choose>
<xsl:when test="position() = 1">
<xsl:value-of select="."/>
</xsl:when>
<xsl:otherwise>
<xsl:value-of select="concat($separator, .) "/>
</xsl:otherwise>
</xsl:choose>
</xsl:for-each>
</xsl:template>
</xsl:transform>'''.format(contains[:-4])
dom = ET.parse(os.path.join(cd, 'SearchWords.xml'))
xslt = ET.fromstring(xslstr)
transform = ET.XSLT(xslt)
newdom = transform(dom)
tree_out = ET.tostring(newdom, encoding='UTF-8', pretty_print=True)
print(tree_out.decode("utf-8"))
for phrase in newdom.findall('PHRASE'):
print("Text: {} \n ORG is: {}, PERS is: {} /".format(phrase.find('wholetext').text,
phrase.find('organization').text,
phrase.find('person').text))
输出
下面包括转换后的 xml 以供演示。 tree_out
字符串可以在外部保存为新的 xml 文件。
<TEXT>
<PHRASE>
<wholetext>went Mark to United Nations for a visit</wholetext>
<person>Mark</person>
<organization>United Nations</organization>
<location/>
</PHRASE>
<PHRASE>
<wholetext>in 1987 Nick founded XYZ company</wholetext>
<person>Nick</person>
<organization>XYZ company</organization>
<location/>
</PHRASE>
<PHRASE>
<wholetext>Google's Frank went yesterday to San Fransisco</wholetext>
<person>Frank</person>
<organization>Google's</organization>
<location>San Fransisco</location>
</PHRASE>
</TEXT>
Text: went Mark to United Nations for a visit
ORG is: United Nations, PERS is: Mark /
Text: in 1987 Nick founded XYZ company
ORG is: XYZ company, PERS is: Nick /
Text: Google's Frank went yesterday to San Fransisco
ORG is: Google's, PERS is: Frank /
列表理解
查看使用 xpath
的列表理解尝试。但是,挑战在于您的 finalPatterns
与一致匹配项不匹配。例如,文本可以使用 went \n to
和中间的词,如 went \n Mark \n to
。如果您只为列表的每个元素包含一个关键字,那么下面的方法就可以了。否则考虑 regex 进行模式识别。
dom = ET.parse(os.path.join(cd, 'Input.xml'))
phraselist = dom.xpath('//PHRASE')
for phrase in phraselist:
if any(word in p for p in phrase.xpath('./*/text()') for word in finalPatterns):
print(' '.join(phrase.xpath('./*/text()')))
print('ORG is: {0}, PERS is: {1}'.format(phrase.xpath("./en[@x='ORG']")[0].text, \
phrase.xpath("./en[@x='PERS']")[0].text))
这应该可以解决问题:
phrasewords = [w.text for w in phrase.findall('V')+phrase.findall('N')+phrase.findall('PREP')]
for words in finalPatterns:
if all(word in phrasewords for word in words.split()):
print "found"
我有这个 XML 文件,其中包含超过 2000 个短语,下面是一个小示例。
<TEXT>
<PHRASE>
<V>played</V>
<N>John</N>
<PREP>with</PREP>
<en x='PERS'>Adam</en>
<PREP>in</PREP>
<en x='LOC'> ASL school/en>
</PHRASE>
<PHRASE>
<V y='0'>went</V>
<en x='PERS'>Mark</en>
<PREP>to</PREP>
<en x='ORG>United Nations</en>
<PREP>for</PREP>
<PREP>a</PREP>
<N>visit</N>
</PHRASE>
<PHRASE>
<PREP>in</PREP>
<en x='DATE'>1987</en>
<en x='PERS'>Nick</en>
<V>founded</V>
<en x='ORG'>XYZ company</en>
</PHRASE>
<PHRASE>
<en x='ORG'>Google's</en>
<en x='PERS'>Frank</en>
<V>went</V>
<N>yesterday</N>
<PREP>to</PREP>
<en x='LOC'>San Fransisco/en>
</PHRASE>
</TEXT>
我有一个模式列表:
finalPatterns=['went \n to \n','created\n the\n', 'founded\n a\n', 'went\n yesterday\n to\n', 'a\n visit\n', 'founded\n in\n']
我想要的是以每个 finalPattern 为例:went to 并在文本的每个短语中搜索它的存在,如果任何短语同时包含 went AND to 然后打印出它的 2 <en>
标签。 [如果 en 标签不等于 PERS & ORG,则不会打印任何内容]
当它搜索:
-"went" & "to" --> this is the output: Frank -San Fransisco
-"founded" & "in" --> output: Nick-XYZ Company
我就是这么做的,但没有用。没有打印任何内容。
for phrase in root.findall('./PHRASE'):
ens = {en.get('x'): en.text for en in phrase.findall('en')}
if 'ORG' in ens and 'PERS' in ens:
if all(word in phrase for word in finalPatterns):
x="".join(phrase.itertext()) #print whats in between [since I would also like to print the whole sentence]
print("ORG is: {}, PERS is: {} /".format(ens["ORG"],ens["PERS"]))
考虑 XSLT(操纵 XML 文档的 special-purpose 语言)在处理搜索时根据匹配值重写原始 xml。
下面的 XSLT 嵌入 Python 以使用 finalPatterns
列表动态删除不匹配的元素。从那里,Python 可以转换(使用 lxml
模块)原始文档,然后将此类输出用于您的最终使用需求。
Python 脚本
import lxml.etree as ET
finalPatterns=['went \n to \n','created\n the\n', 'founded\n a\n', 'went\n yesterday\n to\n', 'a\n visit\n', 'founded\n in\n']
# BUILDING XSLT FILTER STRING
contains = ''
for p in finalPatterns:
contains += "("
for i in p.split('\n '):
contains += "contains(., '{}') and \n".format(i.replace('\n', '').strip(' '))
contains += ")"
contains = contains.replace(' and \n)', ') or ')
xslstr = '''<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" />
<xsl:strip-space elements="*"/>
<!-- Identity Transform -->
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<!-- Rewrites Matching Phrase elements -->
<xsl:template match="PHRASE">
<xsl:copy>
<wholetext>
<xsl:call-template name="join">
<xsl:with-param name="valueList" select="*"/>
<xsl:with-param name="separator" select="' '"/>
</xsl:call-template>
</wholetext>
<xsl:choose>
<xsl:when test="contains(., 'went') = True and contains(., 'to') = True">
<match>went to</match>
</xsl:when>
<xsl:when test="contains(., 'founded') = True and contains(., 'in') = True">
<match>founded in</match>
</xsl:when>
<xsl:when test="contains(., 'created') = True and contains(., 'the') = True">
<match>created the</match>
</xsl:when>
<xsl:otherwise test="contains(., 'a') = True and contains(., 'visit') = True">
<match>a visit</match>
</xsl:otherwise>
</xsl:choose>
<person><xsl:value-of select="en[@x='PERS']"/></person>
<organization><xsl:value-of select="en[@x='ORG']"/></organization>
<location><xsl:value-of select="en[@x='LOC']"/></location>
</xsl:copy>
</xsl:template>
<!-- Rewrites Unmatched Phrase elements -->
<xsl:template match="PHRASE[not({0})]"/>
<!-- Join Text values -->
<xsl:template name="join">
<xsl:param name="valueList" select="''"/>
<xsl:param name="separator" select="','"/>
<xsl:for-each select="$valueList">
<xsl:choose>
<xsl:when test="position() = 1">
<xsl:value-of select="."/>
</xsl:when>
<xsl:otherwise>
<xsl:value-of select="concat($separator, .) "/>
</xsl:otherwise>
</xsl:choose>
</xsl:for-each>
</xsl:template>
</xsl:transform>'''.format(contains[:-4])
dom = ET.parse(os.path.join(cd, 'SearchWords.xml'))
xslt = ET.fromstring(xslstr)
transform = ET.XSLT(xslt)
newdom = transform(dom)
tree_out = ET.tostring(newdom, encoding='UTF-8', pretty_print=True)
print(tree_out.decode("utf-8"))
for phrase in newdom.findall('PHRASE'):
print("Text: {} \n ORG is: {}, PERS is: {} /".format(phrase.find('wholetext').text,
phrase.find('organization').text,
phrase.find('person').text))
输出
下面包括转换后的 xml 以供演示。 tree_out
字符串可以在外部保存为新的 xml 文件。
<TEXT>
<PHRASE>
<wholetext>went Mark to United Nations for a visit</wholetext>
<person>Mark</person>
<organization>United Nations</organization>
<location/>
</PHRASE>
<PHRASE>
<wholetext>in 1987 Nick founded XYZ company</wholetext>
<person>Nick</person>
<organization>XYZ company</organization>
<location/>
</PHRASE>
<PHRASE>
<wholetext>Google's Frank went yesterday to San Fransisco</wholetext>
<person>Frank</person>
<organization>Google's</organization>
<location>San Fransisco</location>
</PHRASE>
</TEXT>
Text: went Mark to United Nations for a visit
ORG is: United Nations, PERS is: Mark /
Text: in 1987 Nick founded XYZ company
ORG is: XYZ company, PERS is: Nick /
Text: Google's Frank went yesterday to San Fransisco
ORG is: Google's, PERS is: Frank /
列表理解
查看使用 xpath
的列表理解尝试。但是,挑战在于您的 finalPatterns
与一致匹配项不匹配。例如,文本可以使用 went \n to
和中间的词,如 went \n Mark \n to
。如果您只为列表的每个元素包含一个关键字,那么下面的方法就可以了。否则考虑 regex 进行模式识别。
dom = ET.parse(os.path.join(cd, 'Input.xml'))
phraselist = dom.xpath('//PHRASE')
for phrase in phraselist:
if any(word in p for p in phrase.xpath('./*/text()') for word in finalPatterns):
print(' '.join(phrase.xpath('./*/text()')))
print('ORG is: {0}, PERS is: {1}'.format(phrase.xpath("./en[@x='ORG']")[0].text, \
phrase.xpath("./en[@x='PERS']")[0].text))
这应该可以解决问题:
phrasewords = [w.text for w in phrase.findall('V')+phrase.findall('N')+phrase.findall('PREP')]
for words in finalPatterns:
if all(word in phrasewords for word in words.split()):
print "found"