如何使用 Python 计算 pdf 文本提取列表中的单词数?
How to count the number of words from a list from a text extract in a pdf using Python?
我正在尝试计算从 PDF 中提取的一系列单词,但我只得到 0,这是不正确的。
total_number_of_keywords = 0
pdf_file = "CapitalCorp.pdf"
tables=[]
words = ['blank','warrant ','offering','combination ','SPAC','founders']
count={} # is a dictionary data structure in Python
with pdfplumber.open(pdf_file) as pdf:
pages = pdf.pages
for i,pg in enumerate(pages):
tbl = pages[i].extract_tables()
for elem in words:
count[elem] = 0
for line in f'{i} --- {tbl}' :
elements = line.split()
for word in words:
count[word] = count[word]+elements.count(word)
print (count)
这将完成工作:
import pdfplumber
pdf_file = "CapitalCorp.pdf"
words = ['blank','warrant ','offering','combination ','SPAC','founders']
# Get text
text = ''
with pdfplumber.open(pdf_file) as pdf:
for i, page in enumerate(pdf.pages):
text = text+'\n'+str(page.extract_text())
# Setup count dictionary
count = {}
for elem in words:
count[elem] = 0
# Count occurences
for i, el in enumerate(words):
count[f'{words[i]}'] = text.count(el)
首先,您将 PDF 的内容存储在变量 text
中,它是一个字符串。
然后,设置 count
字典,words
的每个元素都有一个键,各自的值为 0。
最后,您使用 count()
方法计算 text
中 words
的每个元素的出现次数,并将其存储在 count
字典的相应键中。
我正在尝试计算从 PDF 中提取的一系列单词,但我只得到 0,这是不正确的。
total_number_of_keywords = 0
pdf_file = "CapitalCorp.pdf"
tables=[]
words = ['blank','warrant ','offering','combination ','SPAC','founders']
count={} # is a dictionary data structure in Python
with pdfplumber.open(pdf_file) as pdf:
pages = pdf.pages
for i,pg in enumerate(pages):
tbl = pages[i].extract_tables()
for elem in words:
count[elem] = 0
for line in f'{i} --- {tbl}' :
elements = line.split()
for word in words:
count[word] = count[word]+elements.count(word)
print (count)
这将完成工作:
import pdfplumber
pdf_file = "CapitalCorp.pdf"
words = ['blank','warrant ','offering','combination ','SPAC','founders']
# Get text
text = ''
with pdfplumber.open(pdf_file) as pdf:
for i, page in enumerate(pdf.pages):
text = text+'\n'+str(page.extract_text())
# Setup count dictionary
count = {}
for elem in words:
count[elem] = 0
# Count occurences
for i, el in enumerate(words):
count[f'{words[i]}'] = text.count(el)
首先,您将 PDF 的内容存储在变量 text
中,它是一个字符串。
然后,设置 count
字典,words
的每个元素都有一个键,各自的值为 0。
最后,您使用 count()
方法计算 text
中 words
的每个元素的出现次数,并将其存储在 count
字典的相应键中。