PyParsing 书目引用

Question

我在使用 PyParsing 时遇到了一些问题。我需要从简历中解析一些书目信息。一个例子：

AuthorA, B., AuthorB, M. R., AuthorC, V., and B. LastAuthor. Some sciency title. Name of the confernce, City, State, December 3, 2012

我想出了一些代码来解析（主要）作者列表和日期...其他信息对我来说不是特别重要。

from pyparsing import (Word, Literal, OneOrMore, alphanums, delimitedList, printables, 
    alphas, nums)

family_name = Word(alphanums+'-')
first_init = Word(alphanums+'.')
author = (family_name("LastName") + Literal(',').suppress() + 
          OneOrMore(first_init("FirstInitials") ) )
last_author = first_init("FirstInitials") + family_name("LastName")

author_list = delimitedList(author) + Literal('and').suppress() + last_author

sentence = OneOrMore(Word(printables))
location = delimitedList(Word(printables))
date = Word(alphas) + Word(nums) + Literal(',').suppress() + Word(nums)

citation = (author_list('AuthorLst') + sentence('Title') + location('Location') 
            + date('Date'))

citation.parseString(ntext)

然而，它放屁在 "and" 作为作者列表和最后一个作者之间的区别。

我收到错误消息：

---------------------------------------------------------------------------
ParseException                            Traceback (most recent call last)
<ipython-input-142-5d7946dcb775> in <module>()
     15 
     16 
---> 17 citation.parseString(ntext)

/Users/willdampier/anaconda/lib/python2.7/site-packages/pyparsing.pyc in parseString(self, instring, parseAll)
   1123             else:
   1124                 # catch and re-raise exception from here, clears out pyparsing internal stack trace
-> 1125                 raise exc
   1126         else:
   1127             return tokens

ParseException: Expected "and" (at char 40), (line:1, col:41)

有什么建议吗？

Answer 1

定义author后，添加这一行：

author.setName("author").setDebug()

跟踪 author 表达式的匹配。然后为了获得更好的整体诊断，将您的测试线更改为：

author_list.runTests(ntext)

进行这些更改后，您将获得如下输出：

Match author at loc 0(1,1)
Matched author -> ['AuthorA', 'B.']
Match author at loc 12(1,13)
Matched author -> ['AuthorB', 'M.', 'R.']
Match author at loc 28(1,29)
Matched author -> ['AuthorC', 'V.']
Match author at loc 41(1,42)
Exception raised:Expected "," (at char 46), (line:1, col:47)

AuthorA, B., AuthorB, M. R., AuthorC, V., and B. LastAuthor. Some sciency title. Name of the confernce, City, State, December 3, 2012
                                        ^
FAIL: Expected "and" (at char 40), (line:1, col:41)

所以您的直接问题是您没有处理 'and' 之前的尾随 ','。您还需要添加尾随“。”根据您对 author_list.

的定义

但是从那里开始，您的 sentence 解析器将出现问题，因为它将处理整个字符串的其余部分。由于您的主要兴趣是获取日期，因此这可能适合您：

stuff = OneOrMore(Word(printables), stopOn=date)
citation = (author_list('AuthorLst') + stuff('body') + date('Date'))

最后，关于您对结果名称的使用（"FirstInitials"、"LastName" 等）。好样的，这是一个功能我对 pyparsing 特别满意。但是你需要对每个作者参考文献中的名字进行一些隔离，否则你只会得到最后一位作者的名字。为此，将每个作者包装在一个 pyparsing 组中：

author = Group(family_name("LastName") + Literal(',').suppress() + 
          OneOrMore(first_init("FirstInitials") ) )
last_author = Group(first_init("FirstInitials") + family_name("LastName"))

现在你的 author_list 应该给你一个子结构列表。如果你这样做，你可以看到它们：

print(citation.parseString(ntext).dump())

经过我的更改，我得到了您的示例文本：

[['AuthorA', 'B.'], ['AuthorB', 'M.', 'R.'], ['AuthorC', 'V.'], ',', 
 ['B.', 'LastAuthor'], '.', 'Some', 'sciency', 'title.', 'Name', 'of', 
 'the', 'confernce,', 'City,', 'State,', 'December', '3', '2012']
- AuthorLst: [['AuthorA', 'B.'], ['AuthorB', 'M.', 'R.'], 
              ['AuthorC', 'V.'], ',', ['B.', 'LastAuthor'], '.']
  [0]:
    ['AuthorA', 'B.']
    - FirstInitials: 'B.'
    - LastName: 'AuthorA'
  [1]:
    ['AuthorB', 'M.', 'R.']
    - FirstInitials: 'R.'
    - LastName: 'AuthorB'
  [2]:
    ['AuthorC', 'V.']
    - FirstInitials: 'V.'
    - LastName: 'AuthorC'
  [3]:
    ,
  [4]:
    ['B.', 'LastAuthor']
    - FirstInitials: 'B.'
    - LastName: 'LastAuthor'
  [5]:
    .

仍然需要取消“,”和“.”标点符号，但这只是清理。然后你就可以轻松地遍历您的作者列表并获取每位作者的姓名。

PyParsing 书目引用

PyParsing bibliographic citations

python

pyparsing