使用 Python 从 JATS XML 文件中提取文本

Extracting text from JATS XML file using Python

我想从 JATS-XML file

中提取文本

JATS 是表示研究出版物的标准化 XML 格式。

<article>
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Elsevier Science B.V. All rights reserved.
P I I S</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>How does foreign direct investment affect economic 1 growth? E. Borenszteina ,*, J. De Gregoriob, J-W. Leec</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>E. Borensztein</string-name>
          <email>eborensztein@imf.org</email>
          <xref ref-type="aff" rid="0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>J. De Gregorio</string-name>
          <xref ref-type="aff" rid="2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>J-W. Lee</string-name>
          <xref ref-type="aff" rid="3">3</xref>
        </contrib>
        <aff id="0">
          <label>0</label>
          <institution>International Monetary Fund, Research Department</institution>
          ,
          <addr-line>Washington DC 20431</addr-line>
          <country country="US">USA</country>
        </aff>
        <aff id="1">
          <label>1</label>
          <institution>We are grateful for comments from Robert Barro</institution>
          ,
          <addr-line>Elhanan Helpman, Boyan Jovanovic, Mohsin Khan, Se-Jik Kim, Donald Mathieson, Sergio Rebelo, Jeffrey Sachs</addr-line>
          ,
          <institution>Peter Wickham, and two anonymous referees. Comments by participants in seminars at 1995 World Congress of the Econometric Society, Korean Macroeconomics Workshop, Kobe University, and Osaka University were very helpful. This paper was partially prepared while Jose ́ de Gregorio and Jong-Wha Lee were at the Research Department, International Monetary Fund. Any opinions expressed are only those of the</institution>
        </aff>
        <aff id="2">
          <label>2</label>
          <institution>Center for Applied Economics, Department of Industrial Engineering, Universidad de Chile</institution>
          ,
          <addr-line>Santiago</addr-line>
          ,
          <country country="CL">Chile</country>
        </aff>
        <aff id="3">
          <label>3</label>
          <institution>Economics Department, Korea University and NBER</institution>
          ,
          <addr-line>Seoul 136 -701</addr-line>
          <country country="KR">Korea</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We test the effect of foreign direct investment (FDI) on economic growth in a cross-country regression framework, utilizing data on FDI flows from industrial countries to 69 developing countries over the last two decades. Our results suggest that FDI is an important vehicle for the transfer of technology, contributing relatively more to growth than domestic investment. However, the higher productivity of FDI holds only when the host country has a minimum threshold stock of human capital. Thus, FDI contributes to economic growth only when a sufficient absorptive capability of the advanced technologies is available in the host economy. 1998 Elsevier Science B.V.</p>
      </abstract>
      <kwd-group>
        <kwd>Foreign direct investment</kwd>
        <kwd>Economic growth</kwd>
        <kwd>Cross-country regression framework</kwd>
        <kwd>Developing countries</kwd>
      </kwd-group>
      <volume>0</volume>
      <issue>0</issue>
      <fpage>115</fpage>
      <lpage>135</lpage>
      <pub-date>
        <year>1998</year>
      </pub-date>
      <history>
        <date date-type="accepted">
          <day>20</day>
          <month>5</month>
          <year>1997</year>
        </date>
        <date date-type="received">
          <day>21</day>
          <month>2</month>
          <year>1996</year>
        </date>
        <date date-type="revised">
          <day>24</day>
          <month>2</month>
          <year>1997</year>
        </date>
      </history>
    </article-meta>
  </front>
  <back>
    <ref-list>
      <ref id="1">
        <mixed-citation>
          <string-name>
            <surname>Aitken</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Harrison</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <year>1993</year>
          ,
          <article-title>Do Domestically-Owned Firms Benefit from Foreign Direct Investment: Evidence from Panel Data, Unpublished manuscript</article-title>
          ,
          <source>International Monetary Fund.</source>
        </mixed-citation>
      </ref>
      <ref id="2">
        <mixed-citation>
          <string-name>
            <surname>Barro</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>J-W.</given-names>
          </string-name>
          ,
          <year>1993</year>
          .
          <article-title>International comparisons of educational attainment</article-title>
          .
          <source>Journal of Monetary Economics</source>
          <volume>32</volume>
          ,
          <fpage>361</fpage>
          -
          <lpage>394</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="3">
        <mixed-citation>
          <string-name>
            <surname>Barro</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>J-W.</given-names>
          </string-name>
          ,
          <year>1994</year>
          .
          <article-title>Sources of economic growth</article-title>
          .
          <source>Carnegie Rochester Conference Series on Public Policy</source>
          <volume>40</volume>
          ,
          <fpage>1</fpage>
          -
          <lpage>46</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="4">
        <mixed-citation>
          <string-name>
            <surname>Barro</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <article-title>Sala-i-</article-title>
          <string-name>
            <surname>Martin</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <year>1995</year>
          . Economic Growth,
          <string-name>
            <surname>McGraw-Hill</surname>
          </string-name>
          , Cambridge, MA.
        </mixed-citation>
      </ref>
      <ref id="5">
        <mixed-citation>
          <string-name>
            <surname>Benhabib</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Spiegel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <year>1994</year>
          .
          <article-title>The roles of human capital in economic development: evidence from aggregate cross-country data</article-title>
          .
          <source>Journal of Monetary Economics</source>
          <volume>34</volume>
          ,
          <fpage>143</fpage>
          -
          <lpage>173</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="6">
        <mixed-citation>
          <string-name>
            <surname>Blomstrom</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lipsey</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zejan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <year>1992</year>
          .
          <article-title>What Explains Developing Country Growth</article-title>
          . NBER Working Paper No.
          <volume>4132</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="7">
        <mixed-citation>
          <string-name>
            <surname>Cohen</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <year>1993</year>
          .
          <article-title>Foreign Finance and Economic Growth - An Empirical Analysis</article-title>
          .
          <article-title>Unpublished manuscript</article-title>
          , CEPREMAP.
        </mixed-citation>
      </ref>
      <ref id="8">
        <mixed-citation>
          <string-name>
            <surname>De Gregorio</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <year>1992</year>
          .
          <article-title>Economic growth in Latin America</article-title>
          .
          <source>Journal of Development Economics</source>
          <volume>39</volume>
          ,
          <fpage>58</fpage>
          -
          <lpage>84</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="9">
        <mixed-citation>
          <string-name>
            <surname>Easterly</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <year>1993</year>
          .
          <article-title>How much do distortions affect growth</article-title>
          .
          <source>Journal of Monetary Economics</source>
          <volume>32</volume>
          ,
          <fpage>187</fpage>
          -
          <lpage>212</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="10">
        <mixed-citation>
          <string-name>
            <surname>Easterly</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>King</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Levine</surname>
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rebelo</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <year>1994</year>
          . Policy,
          <article-title>Technology Adoption and Growth</article-title>
          . NBER Working Paper No.
          <volume>4681</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="11">
        <mixed-citation>
          <string-name>
            <surname>Edwards</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <year>1990</year>
          . Capital Flows, Foreign Direct Investment, and
          <article-title>Debt-Equity Swaps in Developing Countries</article-title>
          . NBER Working Paper No.
          <volume>3497</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="12">
        <mixed-citation>
          <string-name>
            <surname>Ethier</surname>
            ,
            <given-names>W.J.</given-names>
          </string-name>
          ,
          <year>1982</year>
          .
          <article-title>National and international returns to scale in the modern theory of international trade</article-title>
          .
          <source>American Economic Review</source>
          <volume>72</volume>
          ,
          <fpage>389</fpage>
          -
          <lpage>405</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="13">
        <mixed-citation>
          <string-name>
            <surname>Findlay</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <year>1978</year>
          .
          <article-title>Relative backwardness, direct foreign investment, and the transfer of technology: a simple dynamic model</article-title>
          .
          <source>Quarterly Journal of Economics 92</source>
          ,
          <fpage>1</fpage>
          -
          <lpage>16</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="14">
        <mixed-citation>
          <string-name>
            <surname>Grossman</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Helpman</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <year>1991</year>
          .
          <article-title>Innovation and Growth in the Global Economy</article-title>
          , MIT Press Cambridge, MA.
        </mixed-citation>
      </ref>
      <ref id="15">
        <mixed-citation>
          <string-name>
            <surname>Gastil</surname>
            ,
            <given-names>R.D.</given-names>
          </string-name>
          ,
          <year>1987</year>
          . Freedom in the World, Greenwood Press, Westport, CT.
        </mixed-citation>
      </ref>
      <ref id="16">
        <mixed-citation>
          <string-name>
            <surname>Graham</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krugman</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <year>1991</year>
          .
          <article-title>Foreign Direct Investment in the United States</article-title>
          , Institute for International Economics, Washington DC.
        </mixed-citation>
      </ref>
      <ref id="17">
        <mixed-citation>
          <string-name>
            <surname>Jovanovic</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rob</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <year>1989</year>
          .
          <article-title>Growth and diffusion of technology</article-title>
          .
          <source>Review of Economic Studies</source>
          <volume>56</volume>
          ,
          <fpage>569</fpage>
          -
          <lpage>582</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="18">
        <mixed-citation>
          <string-name>
            <surname>King</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Levine</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <year>1993</year>
          .
          <article-title>Finance and growth: Schumpeter might be right</article-title>
          .
          <source>Quarterly Journal of Economics</source>
          <volume>108</volume>
          ,
          <fpage>717</fpage>
          -
          <lpage>738</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="19">
        <mixed-citation>
          <string-name>
            <surname>Knack</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Keefer</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <year>1995</year>
          .
          <article-title>Institutions and economic performance: cross-country tests using alternative institutional measures</article-title>
          .
          <source>Economics and Politics</source>
          <volume>7</volume>
          ,
          <fpage>207</fpage>
          -
          <lpage>227</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="20">
        <mixed-citation>
          <string-name>
            <surname>Levine</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Renelt</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <year>1992</year>
          .
          <article-title>A sensitivity analysis of cross-country growth regressions</article-title>
          .
          <source>American Economic Review</source>
          <volume>82</volume>
          ,
          <fpage>942</fpage>
          -
          <lpage>963</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="21">
        <mixed-citation>
          <string-name>
            <surname>Nelson</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Phelps</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <year>1966</year>
          .
          <article-title>Investment in humans, technological diffusion, and economic growth</article-title>
          .
          <source>American Economic Review: Papers and Proceedings</source>
          <volume>61</volume>
          ,
          <fpage>69</fpage>
          -
          <lpage>75</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="22">
        <mixed-citation>
          <string-name>
            <surname>Romer</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <year>1990</year>
          .
          <article-title>Endogenous technological change</article-title>
          .
          <source>Journal of Political Economy</source>
          <volume>98</volume>
          ,
          <fpage>S71</fpage>
          -
          <lpage>S102</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="23">
        <mixed-citation>
          <string-name>
            <surname>Romer</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <year>1993</year>
          .
          <article-title>Idea gaps and object gaps in economic development</article-title>
          .
          <source>Journal of Monetary Economics</source>
          <volume>32</volume>
          ,
          <fpage>543</fpage>
          -
          <lpage>573</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="24">
        <mixed-citation>
          <string-name>
            <surname>Segerstrom</surname>
            ,
            <given-names>P.S.</given-names>
          </string-name>
          ,
          <year>1991</year>
          . Innovation, imitation, and
          <article-title>economic growth</article-title>
          .
          <source>Journal of Political Economy</source>
          <volume>99</volume>
          ,
          <fpage>807</fpage>
          -
          <lpage>827</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="25">
        <mixed-citation>
          <string-name>
            <given-names>United</given-names>
            <surname>Nations</surname>
          </string-name>
          ,
          <year>1992</year>
          .
          <source>World Investment Report 1992 Transnational Corporations as Engines of Growth</source>
          , Department of Economic and Social Development, United Nations, New York.
        </mixed-citation>
      </ref>
      <ref id="26">
        <mixed-citation>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>J-Y.</given-names>
          </string-name>
          ,
          <year>1990</year>
          .
          <article-title>Growth, technology transfer, and the long-run theory of international capital movements</article-title>
          .
          <source>Journal of International Economics</source>
          <volume>29</volume>
          ,
          <fpage>255</fpage>
          -
          <lpage>271</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>

第58行附近有一个标签<abstract>。我打算提取对应的文本。这里的警告是文件结构太复杂,虽然它的结构类似于 XML,但我无法获得输出。我尝试了很多库,例如 untangle lxmlbeautifulsoup 但没有成功。

这是我试过的代码之一。

fo = open('The international law on foreign investment.cermxml')
doc = etree.parse(fo)
## TRY 1
doc.find('abstract') # This yields nothing

## TRY 2
path_result = doc.xpath('//abstract') ## Returns empty list

## TRY 3
root = doc.getroot()
result = root.iter('abstract') # This yeilds <lxml.etree.ElementDepthFirstIterator at 0x7f1f71c15a20>
## Dont know what to do ahead. Printing in loop doesn't work

## TRY 4
for child in root[0][1]:
    print(child.tag)
## Abstract tag is child of article-meta[0] which in turn is a child of main parent tag. Hence [0][1]
## This should ideally give abstract as one of its child - but it does not.

编辑: 我还有一些带有动态名称的嵌套标签。我想提取

标签之间的文本,例如

<body>
    <sec id="1">
      <title>1. Introduction</title>
      <p>Technology diffusion plays a central role in the process of economic
    development.2 In contrast to the traditional growth framework, where technological
    change was left as an unexplained residual, the recent growth literature has
    highlighted the dependence of growth rates on the state of domestic technology
    relative to that of the rest of the world. Thus, growth rates in developing countries
    are, in part, explained by a ‘catch-up’ process in the level of technology. In a
    typical model of technology diffusion, the rate of economic growth of a backward
    country depends on the extent of adoption and implementation of new
    technologies that are already in use in leading countries.</p>
<p>The paper is divided into four sections. Section 2 presents a simple model to
motivate our empirical investigation; Section 3 provides an account of the data
used in the empirical analysis; Section 4 describes the regression results, and
Section 5 presents some concluding remarks.</p>
     </sec>
 <sec id="2">... </sec>
</body>

您可以使用 bs4 库实现它。

from bs4 import BeautifulSoup

soup = BeautifulSoup(xmla)
print (soup.find('abstract'))

>>> '<abstract>haha</abstract>'

lxml 似乎正在为我使用 xpath:

处理您的数据
>>> d = etree.parse(open('...'))  # file with your exact content
>>> e = d.getroot()
>>> e.xpath('.//abstract')
[<Element abstract at 0x7f9239c10710>]
>>> e.xpath('.//abstract/p')[0].text  # first p inside abstract
'We test the effect of foreign direct investment (FDI) ...'

我还使用 xpathlxml.etree 模块成功获得了摘要。

import os
import lxml.etree as et

def get_article_abstract(article_file, tag_path_elements=None):
    """
    :param article_file: the xml file for a single article
    :param tag_path_elements: xpath search results of the location in the article's XML tree
    :param article_file: individual local PLOS XML article
    :return: plain-text string of content in abstract
    """
    if tag_path_elements is None:
        tag_path_elements = ("/",
                             "article",
                             "front",
                             "article-meta",
                             "abstract")

    article_tree = et.parse(article_file)
    article_root = article_tree.getroot()
    tag_location = '/'.join(tag_path_elements)
    abstract = article_root.xpath(tag_location)
    abstract_text = et.tostring(abstract[0], encoding='unicode', method='text')

    # clean up text: rem white space, new line marks, blank lines
    abstract_text = abstract_text.strip().replace('  ', '')
    abstract_text = os.linesep.join([s for s in abstract_text.splitlines() if s])

    return print(abstract_text)