从文本中获取书目列表及其计数 - Python
Get bibliography list and its count from text - Python
在我的 python 任务中,我必须阅读 PDF 论文并获取所有参考文献及其计数(在论文中提到)。 This is the PDF as example 它有 18 个参考文献,并且说 Ref#1 在论文中提到了 3 次,Ref#2 被提到了 1 次,这就是我想要的;
Ref# Count Reference
1 3 Arto Anttila. 1995. How to recognise subjects in English. In Karlsson et al., chapt. 9, pp. 315-358.
2 1 Dekang Lin. 1996. Evaluation of Principar with the Susanne corpus. In John arroll, editor, Workshop on Robust Parsing, pages 54-69, Prague
...
我已经完成了列表中的 Ref # 和 References,并且通过使用此正则表达式以某种方式设法从其中包含 Reference 的文本中获取行:
regex = re.compile(r'[A-Z]{1}[a-z\u0000-\u007F]+ \([0-9]{4}\)|\([A-Z]{1}[a-z\u0000-\u007F]+, [0-9]{4}\)|\([A-Z]{1}[a-z\u0000-\u007F]+, [0-9]{4}; [A-Za-z \u0000-\u007F,;]*\)|[A-Z]{1}[a-z\u0000-\u007F]+ \([0-9]{4},[A-Za-z0-9\u0000-\u007F ]*\)|[A-Z]{1}[a-z\u0000-\u007F ]+ [a-z]{2} [a-z]{2}. \([0-9]{4}\)')
因此,当我遍历字符串列表(按句子拆分的文本)并使用此代码通过上层正则表达式查找时:
for i in range(0, len(lstString)):
refLine = re.findall(regex, lstString[i])
if(refLine != [] and refLine [0] != []):
print(refLine)
我得到这样的输出:
(Karls- son et al., 1995)
Our work is partly based on the work done with the Constraint Grammar framework that was orig- inally proposed by Fred Karlsson
(1990)
(Tapanainen, 1996)
(Tapanainen, 1996) is dif- ferent from the former (Karlsson et al., 1995)
Hurskainen (1996)
In essence, the same formalism is used in the syn- tactic analysis in J~rvinen (1994) and Anttila (1995)
Our notation follows the classical model of depen- dency theory (Heringer, 1993) introduced by Lucien Tesni~re (1959) and later
advocated by Igor Mel'~uk (1987)
Hudson (1991)
(Hays, 1964)
(McCord, 1990; Sleator and Tem- perley, 1991; Eisner, 1996)
(Hudson, 1991)
(J~irvinen, 1994)
The CG-2 program (Tapanainen, 1996) runs a mod- ified disambiguation grammar of Voutilainen (1995)
(J~rvinen, 1994; Tapanainen and J/~rvinen, 1994)
(Eisner, 1996)
Dekang Lin (1996)
Acknowledgments We are using Atro Voutilainen's (1995)
它 returns 我所有的字符串都有引用,但我遇到了一些问题,比如
- It is not capturing Reference like this Karlsson et al. (1995)
- Some of these contains 2 reference in them
- How can I update count for each reference in reference list
我试过这段代码来获取每个参考的计数,但它总是 returns 整个列表;
matching = [s for s in lstRef if any(xs in s for xs in refLine)]
我们将不胜感激。
我想知道是否要从文档末尾的 References
中获取姓名(和年份)并使用它们来搜索文档中的引用。
在上一个问题中,您得到的代码在文档末尾得到 References
。
使用正则表达式 '((.*)\. (\d{4})\.
我可以将名字作为一个字符串,将年份作为一个字符串(最终两者都在一个字符串中)
authors_and_year = re.match('((.*)\. (\d{4})\.)', line)
text, authors, year = authors_and_year.groups()
即
text: Christer Samuelsson, Pasi Tapanainen, and Atro Voutilainen. 1996.
authors: Christer Samuelsson, Pasi Tapanainen, and Atro Voutilainen
year: 1996
使用下一个正则表达式 ',[ ]*and |,[ ]*| and '
我可以将带有名称的字符串拆分为名称列表
names = re.split(',[ ]*and |,[ ]*| and ', authors)
并使用普通 split(" ")
我可以获得比全名更有用的姓氏
names = [(name, name.split(' ')[-1]) for name in names]
即
names: [('Christer Samuelsson', 'Samuelsson'), ('Pasi Tapanainen', 'Tapanainen'), ('Atro Voutilainen', 'Voutilainen')]
现在我可以使用这些名字(或者更确切地说是姓氏)和年份来生成像 surname (year)
、surname, year
这样的字符串,然后在文档中搜索。
如果有很多姓氏,那么我可以获取第一个姓氏并生成 surname et al. (year)
,等等
并且使用这些字符串和标准字符串函数text.count(generated_string)
我可以计算它们。
现在已经是我的全部了,但还是不够理想
您可以手动找到文档中的所有引用并使用它们来测试代码。您会看到哪些计算正确,哪些需要更多更改。
例如,文本 We are using Atro Voutilainen's (1995)
中有 's
的引用。也许应该使用 nltk
像 NLP
(自然语言处理)中那样清理文档
一些本地字符会产生问题 - 名称 Järvinen
在一个地方被提取为 J~rvinen
而在其他地方被提取为 J/irvinen
import PyPDF2
from PyPDF2.pdf import * # to import function used in origimal `extractText`
# --- functions ---
def myExtractText(self, distance=None):
# original code from `page.extractText()`
# https://github.com/mstamy2/PyPDF2/blob/d7b8d3e0f471530267827511cdffaa2ab48bc1ad/PyPDF2/pdf.py#L2645
text = u_("")
content = self["/Contents"].getObject()
if not isinstance(content, ContentStream):
content = ContentStream(content, self.pdf)
prev_x = 0
prev_y = 0
for operands, operator in content.operations:
# used only for test to see values in variables
#print('>>>', operator, operands)
if operator == b_("Tj"):
_text = operands[0]
if isinstance(_text, TextStringObject):
text += _text
elif operator == b_("T*"):
text += "\n"
elif operator == b_("'"):
text += "\n"
_text = operands[0]
if isinstance(_text, TextStringObject):
text += operands[0]
elif operator == b_('"'):
_text = operands[2]
if isinstance(_text, TextStringObject):
text += "\n"
text += _text
elif operator == b_("TJ"):
for i in operands[0]:
if isinstance(i, TextStringObject):
text += i
text += "\n"
if operator == b_("Tm"):
if distance is True:
text += '\n'
elif isinstance(distance, int):
x = operands[-2]
y = operands[-1]
diff_x = prev_x - x
diff_y = prev_y - y
#print('>>>', diff_x, diff_y - y)
#text += f'| {diff_x}, {diff_y - y} |'
if diff_y > distance or diff_y < 0: # (bigger margin) or (move to top in next column)
text += '\n'
#text += '\n' # to add empty line between elements
prev_x = x
prev_y = y
return text
# --- main ---
pdfFileObj = open('A97-1011.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
text = ''
for page in pdfReader.pages:
#text += page.extractText() # original function
#text += myExtractText(page) # modified function (works like original version)
#text += myExtractText(page, True) # modified function (add `\n` after every `Tm`)
text += myExtractText(page, 17) # modified function (add `\n` only if distance is bigger then `17`)
# get only text after word `References`
pos = text.lower().find('references')
# only referencers as text
references = text[pos+len('references '):]
# doc without references
doc = text[:pos]
# referencers as list
references = references.split('\n')
# remove empty lines and lines which have 2 chars (ie. page number)
references = [item.strip() for item in references if len(item.strip()) > 2]
print('\n--- names ---\n')
data = []
for nubmer, line in enumerate(references, 1): # skip last element with page number
line = line.strip()
if line: # skip empty line
authors_and_year = re.match('((.*)\. (\d{4})\.)', line)
text, authors, year = authors_and_year.groups()
#print(text, '|', authors, '|', year)
names = re.split(',[ ]*and |,[ ]*| and ', authors)
#print(names)
# [(name, last_name), ...]
names = [(name, name.split(' ')[-1]) for name in names]
#print(names)
#print(' line:', line)
print(' text:', text)
print('authors:', authors)
print(' year:', year)
print(' names:', names)
print('---')
data.append((authors, names, year))
print('\n--- counting ---\n')
# https://guides.lib.monash.edu/citing-referencing/APA-In-text
# Tapanainen and J/~rvine,
for authors, names, year in data:
print('authors:', authors)
print(' year:', year)
print(' names:', names)
print(' et al.:', len(names) > 1)
print(' and :', len(names) == 2)
print('---')
first_lastname = names[0][-1]
print(doc.count(first_lastname), first_lastname)
print(doc.count(first_lastname + ', ' + year), first_lastname + ', ' + year)
print(doc.count(first_lastname + ' (' + year + ')'), first_lastname + ' (' + year + ')')
if len(names) > 1:
first_lastname_et_al = first_lastname + ' et al.'
print(doc.count(first_lastname_et_al), first_lastname_et_al)
print(doc.count(first_lastname_et_al + ', ' + year), first_lastname_et_al + ', ' + year)
print(doc.count(first_lastname_et_al + ' (' + year + ')'), first_lastname_et_al + ' (' + year + ')')
if len(names) == 2:
all_lastnames = ' and '.join(item[-1] for item in names)
print(doc.count(all_lastnames), all_lastnames)
print(doc.count(all_lastnames + ', ' + year), all_lastnames + ', ' + year)
print(doc.count(all_lastnames + ' (' + year + ')'), all_lastnames + ' (' + year + ')')
print('----------')
名称提取结果:
--- names ---
text: Arto Anttila. 1995.
authors: Arto Anttila
year: 1995
names: [('Arto Anttila', 'Anttila')]
---
text: Dekang Lin. 1996.
authors: Dekang Lin
year: 1996
names: [('Dekang Lin', 'Lin')]
---
text: Jason M. Eisner. 1996.
authors: Jason M. Eisner
year: 1996
names: [('Jason M. Eisner', 'Eisner')]
---
text: David G. Hays. 1964.
authors: David G. Hays
year: 1964
names: [('David G. Hays', 'Hays')]
---
text: Hans Jiirgen Heringer. 1993.
authors: Hans Jiirgen Heringer
year: 1993
names: [('Hans Jiirgen Heringer', 'Heringer')]
---
text: Richard Hudson. 1991.
authors: Richard Hudson
year: 1991
names: [('Richard Hudson', 'Hudson')]
---
text: Arvi Hurskainen. 1996.
authors: Arvi Hurskainen
year: 1996
names: [('Arvi Hurskainen', 'Hurskainen')]
---
text: Time J~rvinen. 1994.
authors: Time J~rvinen
year: 1994
names: [('Time J~rvinen', 'J~rvinen')]
---
text: Fred Karlsson, Atro Voutilainen, Juha Heikkil~, and Arto Anttila, editors. 1995.
authors: Fred Karlsson, Atro Voutilainen, Juha Heikkil~, and Arto Anttila, editors
year: 1995
names: [('Fred Karlsson', 'Karlsson'), ('Atro Voutilainen', 'Voutilainen'), ('Juha Heikkil~', 'Heikkil~'), ('Arto Anttila', 'Anttila'), ('editors', 'editors')]
---
text: Fred Karlsson. 1990.
authors: Fred Karlsson
year: 1990
names: [('Fred Karlsson', 'Karlsson')]
---
text: Michael McCord. 1990.
authors: Michael McCord
year: 1990
names: [('Michael McCord', 'McCord')]
---
text: Igor A. Mel'~uk. 1987.
authors: Igor A. Mel'~uk
year: 1987
names: [("Igor A. Mel'~uk", "Mel'~uk")]
---
text: Christer Samuelsson, Pasi Tapanainen, and Atro Voutilainen. 1996.
authors: Christer Samuelsson, Pasi Tapanainen, and Atro Voutilainen
year: 1996
names: [('Christer Samuelsson', 'Samuelsson'), ('Pasi Tapanainen', 'Tapanainen'), ('Atro Voutilainen', 'Voutilainen')]
---
text: Daniel Sleator and Davy Temperley. 1991.
authors: Daniel Sleator and Davy Temperley
year: 1991
names: [('Daniel Sleator', 'Sleator'), ('Davy Temperley', 'Temperley')]
---
text: Pasi Tapanainen and Time J/irvinen. 1994.
authors: Pasi Tapanainen and Time J/irvinen
year: 1994
names: [('Pasi Tapanainen', 'Tapanainen'), ('Time J/irvinen', 'J/irvinen')]
---
text: Pasi Tapanainen. 1996.
authors: Pasi Tapanainen
year: 1996
names: [('Pasi Tapanainen', 'Tapanainen')]
---
text: Lucien TesniSre. 1959.
authors: Lucien TesniSre
year: 1959
names: [('Lucien TesniSre', 'TesniSre')]
---
text: Atro Voutilainen. 1995.
authors: Atro Voutilainen
year: 1995
names: [('Atro Voutilainen', 'Voutilainen')]
---
统计结果:
--- counting ---
authors: Arto Anttila
year: 1995
names: [('Arto Anttila', 'Anttila')]
et al.: False
and : False
---
1 Anttila
0 Anttila, 1995
1 Anttila (1995)
----------
authors: Dekang Lin
year: 1996
names: [('Dekang Lin', 'Lin')]
et al.: False
and : False
---
4 Lin
0 Lin, 1996
1 Lin (1996)
----------
authors: Jason M. Eisner
year: 1996
names: [('Jason M. Eisner', 'Eisner')]
et al.: False
and : False
---
2 Eisner
2 Eisner, 1996
0 Eisner (1996)
----------
authors: David G. Hays
year: 1964
names: [('David G. Hays', 'Hays')]
et al.: False
and : False
---
1 Hays
1 Hays, 1964
0 Hays (1964)
----------
authors: Hans Jiirgen Heringer
year: 1993
names: [('Hans Jiirgen Heringer', 'Heringer')]
et al.: False
and : False
---
1 Heringer
1 Heringer, 1993
0 Heringer (1993)
----------
authors: Richard Hudson
year: 1991
names: [('Richard Hudson', 'Hudson')]
et al.: False
and : False
---
2 Hudson
1 Hudson, 1991
1 Hudson (1991)
----------
authors: Arvi Hurskainen
year: 1996
names: [('Arvi Hurskainen', 'Hurskainen')]
et al.: False
and : False
---
1 Hurskainen
0 Hurskainen, 1996
1 Hurskainen (1996)
----------
authors: Time J~rvinen
year: 1994
names: [('Time J~rvinen', 'J~rvinen')]
et al.: False
and : False
---
2 J~rvinen
1 J~rvinen, 1994
1 J~rvinen (1994)
----------
authors: Fred Karlsson, Atro Voutilainen, Juha Heikkil~, and Arto Anttila, editors
year: 1995
names: [('Fred Karlsson', 'Karlsson'), ('Atro Voutilainen', 'Voutilainen'), ('Juha Heikkil~', 'Heikkil~'), ('Arto Anttila', 'Anttila'), ('editors', 'editors')]
et al.: True
and : False
---
3 Karlsson
0 Karlsson, 1995
0 Karlsson (1995)
2 Karlsson et al.
1 Karlsson et al., 1995
1 Karlsson et al. (1995)
----------
authors: Fred Karlsson
year: 1990
names: [('Fred Karlsson', 'Karlsson')]
et al.: False
and : False
---
3 Karlsson
0 Karlsson, 1990
1 Karlsson (1990)
----------
authors: Michael McCord
year: 1990
names: [('Michael McCord', 'McCord')]
et al.: False
and : False
---
1 McCord
1 McCord, 1990
0 McCord (1990)
----------
authors: Igor A. Mel'~uk
year: 1987
names: [("Igor A. Mel'~uk", "Mel'~uk")]
et al.: False
and : False
---
1 Mel'~uk
0 Mel'~uk, 1987
1 Mel'~uk (1987)
----------
authors: Christer Samuelsson, Pasi Tapanainen, and Atro Voutilainen
year: 1996
names: [('Christer Samuelsson', 'Samuelsson'), ('Pasi Tapanainen', 'Tapanainen'), ('Atro Voutilainen', 'Voutilainen')]
et al.: True
and : False
---
1 Samuelsson
0 Samuelsson, 1996
0 Samuelsson (1996)
1 Samuelsson et al.
0 Samuelsson et al., 1996
1 Samuelsson et al. (1996)
----------
authors: Daniel Sleator and Davy Temperley
year: 1991
names: [('Daniel Sleator', 'Sleator'), ('Davy Temperley', 'Temperley')]
et al.: True
and : True
---
1 Sleator
0 Sleator, 1991
0 Sleator (1991)
0 Sleator et al.
0 Sleator et al., 1991
0 Sleator et al. (1991)
0 Sleator and Temperley
0 Sleator and Temperley, 1991
0 Sleator and Temperley (1991)
----------
authors: Pasi Tapanainen and Time J/irvinen
year: 1994
names: [('Pasi Tapanainen', 'Tapanainen'), ('Time J/irvinen', 'J/irvinen')]
et al.: True
and : True
---
6 Tapanainen
0 Tapanainen, 1994
0 Tapanainen (1994)
0 Tapanainen et al.
0 Tapanainen et al., 1994
0 Tapanainen et al. (1994)
0 Tapanainen and J/irvinen
0 Tapanainen and J/irvinen, 1994
0 Tapanainen and J/irvinen (1994)
----------
authors: Pasi Tapanainen
year: 1996
names: [('Pasi Tapanainen', 'Tapanainen')]
et al.: False
and : False
---
6 Tapanainen
3 Tapanainen, 1996
0 Tapanainen (1996)
----------
authors: Lucien TesniSre
year: 1959
names: [('Lucien TesniSre', 'TesniSre')]
et al.: False
and : False
---
0 TesniSre
0 TesniSre, 1959
0 TesniSre (1959)
----------
authors: Atro Voutilainen
year: 1995
names: [('Atro Voutilainen', 'Voutilainen')]
et al.: False
and : False
---
3 Voutilainen
0 Voutilainen, 1995
1 Voutilainen (1995)
----------
在我的 python 任务中,我必须阅读 PDF 论文并获取所有参考文献及其计数(在论文中提到)。 This is the PDF as example 它有 18 个参考文献,并且说 Ref#1 在论文中提到了 3 次,Ref#2 被提到了 1 次,这就是我想要的;
Ref# Count Reference
1 3 Arto Anttila. 1995. How to recognise subjects in English. In Karlsson et al., chapt. 9, pp. 315-358.
2 1 Dekang Lin. 1996. Evaluation of Principar with the Susanne corpus. In John arroll, editor, Workshop on Robust Parsing, pages 54-69, Prague
...
我已经完成了列表中的 Ref # 和 References,并且通过使用此正则表达式以某种方式设法从其中包含 Reference 的文本中获取行:
regex = re.compile(r'[A-Z]{1}[a-z\u0000-\u007F]+ \([0-9]{4}\)|\([A-Z]{1}[a-z\u0000-\u007F]+, [0-9]{4}\)|\([A-Z]{1}[a-z\u0000-\u007F]+, [0-9]{4}; [A-Za-z \u0000-\u007F,;]*\)|[A-Z]{1}[a-z\u0000-\u007F]+ \([0-9]{4},[A-Za-z0-9\u0000-\u007F ]*\)|[A-Z]{1}[a-z\u0000-\u007F ]+ [a-z]{2} [a-z]{2}. \([0-9]{4}\)')
因此,当我遍历字符串列表(按句子拆分的文本)并使用此代码通过上层正则表达式查找时:
for i in range(0, len(lstString)):
refLine = re.findall(regex, lstString[i])
if(refLine != [] and refLine [0] != []):
print(refLine)
我得到这样的输出:
(Karls- son et al., 1995)
Our work is partly based on the work done with the Constraint Grammar framework that was orig- inally proposed by Fred Karlsson
(1990)
(Tapanainen, 1996)
(Tapanainen, 1996) is dif- ferent from the former (Karlsson et al., 1995)
Hurskainen (1996)
In essence, the same formalism is used in the syn- tactic analysis in J~rvinen (1994) and Anttila (1995)
Our notation follows the classical model of depen- dency theory (Heringer, 1993) introduced by Lucien Tesni~re (1959) and later
advocated by Igor Mel'~uk (1987)
Hudson (1991)
(Hays, 1964)
(McCord, 1990; Sleator and Tem- perley, 1991; Eisner, 1996)
(Hudson, 1991)
(J~irvinen, 1994)
The CG-2 program (Tapanainen, 1996) runs a mod- ified disambiguation grammar of Voutilainen (1995)
(J~rvinen, 1994; Tapanainen and J/~rvinen, 1994)
(Eisner, 1996)
Dekang Lin (1996)
Acknowledgments We are using Atro Voutilainen's (1995)
它 returns 我所有的字符串都有引用,但我遇到了一些问题,比如
- It is not capturing Reference like this Karlsson et al. (1995)
- Some of these contains 2 reference in them
- How can I update count for each reference in reference list
我试过这段代码来获取每个参考的计数,但它总是 returns 整个列表;
matching = [s for s in lstRef if any(xs in s for xs in refLine)]
我们将不胜感激。
我想知道是否要从文档末尾的 References
中获取姓名(和年份)并使用它们来搜索文档中的引用。
在上一个问题中,您得到的代码在文档末尾得到 References
。
使用正则表达式 '((.*)\. (\d{4})\.
我可以将名字作为一个字符串,将年份作为一个字符串(最终两者都在一个字符串中)
authors_and_year = re.match('((.*)\. (\d{4})\.)', line)
text, authors, year = authors_and_year.groups()
即
text: Christer Samuelsson, Pasi Tapanainen, and Atro Voutilainen. 1996.
authors: Christer Samuelsson, Pasi Tapanainen, and Atro Voutilainen
year: 1996
使用下一个正则表达式 ',[ ]*and |,[ ]*| and '
我可以将带有名称的字符串拆分为名称列表
names = re.split(',[ ]*and |,[ ]*| and ', authors)
并使用普通 split(" ")
我可以获得比全名更有用的姓氏
names = [(name, name.split(' ')[-1]) for name in names]
即
names: [('Christer Samuelsson', 'Samuelsson'), ('Pasi Tapanainen', 'Tapanainen'), ('Atro Voutilainen', 'Voutilainen')]
现在我可以使用这些名字(或者更确切地说是姓氏)和年份来生成像 surname (year)
、surname, year
这样的字符串,然后在文档中搜索。
如果有很多姓氏,那么我可以获取第一个姓氏并生成 surname et al. (year)
,等等
并且使用这些字符串和标准字符串函数text.count(generated_string)
我可以计算它们。
现在已经是我的全部了,但还是不够理想
您可以手动找到文档中的所有引用并使用它们来测试代码。您会看到哪些计算正确,哪些需要更多更改。
例如,文本 We are using Atro Voutilainen's (1995)
中有 's
的引用。也许应该使用 nltk
NLP
(自然语言处理)中那样清理文档
一些本地字符会产生问题 - 名称 Järvinen
在一个地方被提取为 J~rvinen
而在其他地方被提取为 J/irvinen
import PyPDF2
from PyPDF2.pdf import * # to import function used in origimal `extractText`
# --- functions ---
def myExtractText(self, distance=None):
# original code from `page.extractText()`
# https://github.com/mstamy2/PyPDF2/blob/d7b8d3e0f471530267827511cdffaa2ab48bc1ad/PyPDF2/pdf.py#L2645
text = u_("")
content = self["/Contents"].getObject()
if not isinstance(content, ContentStream):
content = ContentStream(content, self.pdf)
prev_x = 0
prev_y = 0
for operands, operator in content.operations:
# used only for test to see values in variables
#print('>>>', operator, operands)
if operator == b_("Tj"):
_text = operands[0]
if isinstance(_text, TextStringObject):
text += _text
elif operator == b_("T*"):
text += "\n"
elif operator == b_("'"):
text += "\n"
_text = operands[0]
if isinstance(_text, TextStringObject):
text += operands[0]
elif operator == b_('"'):
_text = operands[2]
if isinstance(_text, TextStringObject):
text += "\n"
text += _text
elif operator == b_("TJ"):
for i in operands[0]:
if isinstance(i, TextStringObject):
text += i
text += "\n"
if operator == b_("Tm"):
if distance is True:
text += '\n'
elif isinstance(distance, int):
x = operands[-2]
y = operands[-1]
diff_x = prev_x - x
diff_y = prev_y - y
#print('>>>', diff_x, diff_y - y)
#text += f'| {diff_x}, {diff_y - y} |'
if diff_y > distance or diff_y < 0: # (bigger margin) or (move to top in next column)
text += '\n'
#text += '\n' # to add empty line between elements
prev_x = x
prev_y = y
return text
# --- main ---
pdfFileObj = open('A97-1011.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
text = ''
for page in pdfReader.pages:
#text += page.extractText() # original function
#text += myExtractText(page) # modified function (works like original version)
#text += myExtractText(page, True) # modified function (add `\n` after every `Tm`)
text += myExtractText(page, 17) # modified function (add `\n` only if distance is bigger then `17`)
# get only text after word `References`
pos = text.lower().find('references')
# only referencers as text
references = text[pos+len('references '):]
# doc without references
doc = text[:pos]
# referencers as list
references = references.split('\n')
# remove empty lines and lines which have 2 chars (ie. page number)
references = [item.strip() for item in references if len(item.strip()) > 2]
print('\n--- names ---\n')
data = []
for nubmer, line in enumerate(references, 1): # skip last element with page number
line = line.strip()
if line: # skip empty line
authors_and_year = re.match('((.*)\. (\d{4})\.)', line)
text, authors, year = authors_and_year.groups()
#print(text, '|', authors, '|', year)
names = re.split(',[ ]*and |,[ ]*| and ', authors)
#print(names)
# [(name, last_name), ...]
names = [(name, name.split(' ')[-1]) for name in names]
#print(names)
#print(' line:', line)
print(' text:', text)
print('authors:', authors)
print(' year:', year)
print(' names:', names)
print('---')
data.append((authors, names, year))
print('\n--- counting ---\n')
# https://guides.lib.monash.edu/citing-referencing/APA-In-text
# Tapanainen and J/~rvine,
for authors, names, year in data:
print('authors:', authors)
print(' year:', year)
print(' names:', names)
print(' et al.:', len(names) > 1)
print(' and :', len(names) == 2)
print('---')
first_lastname = names[0][-1]
print(doc.count(first_lastname), first_lastname)
print(doc.count(first_lastname + ', ' + year), first_lastname + ', ' + year)
print(doc.count(first_lastname + ' (' + year + ')'), first_lastname + ' (' + year + ')')
if len(names) > 1:
first_lastname_et_al = first_lastname + ' et al.'
print(doc.count(first_lastname_et_al), first_lastname_et_al)
print(doc.count(first_lastname_et_al + ', ' + year), first_lastname_et_al + ', ' + year)
print(doc.count(first_lastname_et_al + ' (' + year + ')'), first_lastname_et_al + ' (' + year + ')')
if len(names) == 2:
all_lastnames = ' and '.join(item[-1] for item in names)
print(doc.count(all_lastnames), all_lastnames)
print(doc.count(all_lastnames + ', ' + year), all_lastnames + ', ' + year)
print(doc.count(all_lastnames + ' (' + year + ')'), all_lastnames + ' (' + year + ')')
print('----------')
名称提取结果:
--- names ---
text: Arto Anttila. 1995.
authors: Arto Anttila
year: 1995
names: [('Arto Anttila', 'Anttila')]
---
text: Dekang Lin. 1996.
authors: Dekang Lin
year: 1996
names: [('Dekang Lin', 'Lin')]
---
text: Jason M. Eisner. 1996.
authors: Jason M. Eisner
year: 1996
names: [('Jason M. Eisner', 'Eisner')]
---
text: David G. Hays. 1964.
authors: David G. Hays
year: 1964
names: [('David G. Hays', 'Hays')]
---
text: Hans Jiirgen Heringer. 1993.
authors: Hans Jiirgen Heringer
year: 1993
names: [('Hans Jiirgen Heringer', 'Heringer')]
---
text: Richard Hudson. 1991.
authors: Richard Hudson
year: 1991
names: [('Richard Hudson', 'Hudson')]
---
text: Arvi Hurskainen. 1996.
authors: Arvi Hurskainen
year: 1996
names: [('Arvi Hurskainen', 'Hurskainen')]
---
text: Time J~rvinen. 1994.
authors: Time J~rvinen
year: 1994
names: [('Time J~rvinen', 'J~rvinen')]
---
text: Fred Karlsson, Atro Voutilainen, Juha Heikkil~, and Arto Anttila, editors. 1995.
authors: Fred Karlsson, Atro Voutilainen, Juha Heikkil~, and Arto Anttila, editors
year: 1995
names: [('Fred Karlsson', 'Karlsson'), ('Atro Voutilainen', 'Voutilainen'), ('Juha Heikkil~', 'Heikkil~'), ('Arto Anttila', 'Anttila'), ('editors', 'editors')]
---
text: Fred Karlsson. 1990.
authors: Fred Karlsson
year: 1990
names: [('Fred Karlsson', 'Karlsson')]
---
text: Michael McCord. 1990.
authors: Michael McCord
year: 1990
names: [('Michael McCord', 'McCord')]
---
text: Igor A. Mel'~uk. 1987.
authors: Igor A. Mel'~uk
year: 1987
names: [("Igor A. Mel'~uk", "Mel'~uk")]
---
text: Christer Samuelsson, Pasi Tapanainen, and Atro Voutilainen. 1996.
authors: Christer Samuelsson, Pasi Tapanainen, and Atro Voutilainen
year: 1996
names: [('Christer Samuelsson', 'Samuelsson'), ('Pasi Tapanainen', 'Tapanainen'), ('Atro Voutilainen', 'Voutilainen')]
---
text: Daniel Sleator and Davy Temperley. 1991.
authors: Daniel Sleator and Davy Temperley
year: 1991
names: [('Daniel Sleator', 'Sleator'), ('Davy Temperley', 'Temperley')]
---
text: Pasi Tapanainen and Time J/irvinen. 1994.
authors: Pasi Tapanainen and Time J/irvinen
year: 1994
names: [('Pasi Tapanainen', 'Tapanainen'), ('Time J/irvinen', 'J/irvinen')]
---
text: Pasi Tapanainen. 1996.
authors: Pasi Tapanainen
year: 1996
names: [('Pasi Tapanainen', 'Tapanainen')]
---
text: Lucien TesniSre. 1959.
authors: Lucien TesniSre
year: 1959
names: [('Lucien TesniSre', 'TesniSre')]
---
text: Atro Voutilainen. 1995.
authors: Atro Voutilainen
year: 1995
names: [('Atro Voutilainen', 'Voutilainen')]
---
统计结果:
--- counting ---
authors: Arto Anttila
year: 1995
names: [('Arto Anttila', 'Anttila')]
et al.: False
and : False
---
1 Anttila
0 Anttila, 1995
1 Anttila (1995)
----------
authors: Dekang Lin
year: 1996
names: [('Dekang Lin', 'Lin')]
et al.: False
and : False
---
4 Lin
0 Lin, 1996
1 Lin (1996)
----------
authors: Jason M. Eisner
year: 1996
names: [('Jason M. Eisner', 'Eisner')]
et al.: False
and : False
---
2 Eisner
2 Eisner, 1996
0 Eisner (1996)
----------
authors: David G. Hays
year: 1964
names: [('David G. Hays', 'Hays')]
et al.: False
and : False
---
1 Hays
1 Hays, 1964
0 Hays (1964)
----------
authors: Hans Jiirgen Heringer
year: 1993
names: [('Hans Jiirgen Heringer', 'Heringer')]
et al.: False
and : False
---
1 Heringer
1 Heringer, 1993
0 Heringer (1993)
----------
authors: Richard Hudson
year: 1991
names: [('Richard Hudson', 'Hudson')]
et al.: False
and : False
---
2 Hudson
1 Hudson, 1991
1 Hudson (1991)
----------
authors: Arvi Hurskainen
year: 1996
names: [('Arvi Hurskainen', 'Hurskainen')]
et al.: False
and : False
---
1 Hurskainen
0 Hurskainen, 1996
1 Hurskainen (1996)
----------
authors: Time J~rvinen
year: 1994
names: [('Time J~rvinen', 'J~rvinen')]
et al.: False
and : False
---
2 J~rvinen
1 J~rvinen, 1994
1 J~rvinen (1994)
----------
authors: Fred Karlsson, Atro Voutilainen, Juha Heikkil~, and Arto Anttila, editors
year: 1995
names: [('Fred Karlsson', 'Karlsson'), ('Atro Voutilainen', 'Voutilainen'), ('Juha Heikkil~', 'Heikkil~'), ('Arto Anttila', 'Anttila'), ('editors', 'editors')]
et al.: True
and : False
---
3 Karlsson
0 Karlsson, 1995
0 Karlsson (1995)
2 Karlsson et al.
1 Karlsson et al., 1995
1 Karlsson et al. (1995)
----------
authors: Fred Karlsson
year: 1990
names: [('Fred Karlsson', 'Karlsson')]
et al.: False
and : False
---
3 Karlsson
0 Karlsson, 1990
1 Karlsson (1990)
----------
authors: Michael McCord
year: 1990
names: [('Michael McCord', 'McCord')]
et al.: False
and : False
---
1 McCord
1 McCord, 1990
0 McCord (1990)
----------
authors: Igor A. Mel'~uk
year: 1987
names: [("Igor A. Mel'~uk", "Mel'~uk")]
et al.: False
and : False
---
1 Mel'~uk
0 Mel'~uk, 1987
1 Mel'~uk (1987)
----------
authors: Christer Samuelsson, Pasi Tapanainen, and Atro Voutilainen
year: 1996
names: [('Christer Samuelsson', 'Samuelsson'), ('Pasi Tapanainen', 'Tapanainen'), ('Atro Voutilainen', 'Voutilainen')]
et al.: True
and : False
---
1 Samuelsson
0 Samuelsson, 1996
0 Samuelsson (1996)
1 Samuelsson et al.
0 Samuelsson et al., 1996
1 Samuelsson et al. (1996)
----------
authors: Daniel Sleator and Davy Temperley
year: 1991
names: [('Daniel Sleator', 'Sleator'), ('Davy Temperley', 'Temperley')]
et al.: True
and : True
---
1 Sleator
0 Sleator, 1991
0 Sleator (1991)
0 Sleator et al.
0 Sleator et al., 1991
0 Sleator et al. (1991)
0 Sleator and Temperley
0 Sleator and Temperley, 1991
0 Sleator and Temperley (1991)
----------
authors: Pasi Tapanainen and Time J/irvinen
year: 1994
names: [('Pasi Tapanainen', 'Tapanainen'), ('Time J/irvinen', 'J/irvinen')]
et al.: True
and : True
---
6 Tapanainen
0 Tapanainen, 1994
0 Tapanainen (1994)
0 Tapanainen et al.
0 Tapanainen et al., 1994
0 Tapanainen et al. (1994)
0 Tapanainen and J/irvinen
0 Tapanainen and J/irvinen, 1994
0 Tapanainen and J/irvinen (1994)
----------
authors: Pasi Tapanainen
year: 1996
names: [('Pasi Tapanainen', 'Tapanainen')]
et al.: False
and : False
---
6 Tapanainen
3 Tapanainen, 1996
0 Tapanainen (1996)
----------
authors: Lucien TesniSre
year: 1959
names: [('Lucien TesniSre', 'TesniSre')]
et al.: False
and : False
---
0 TesniSre
0 TesniSre, 1959
0 TesniSre (1959)
----------
authors: Atro Voutilainen
year: 1995
names: [('Atro Voutilainen', 'Voutilainen')]
et al.: False
and : False
---
3 Voutilainen
0 Voutilainen, 1995
1 Voutilainen (1995)
----------