如何为非英语语言解析 PDF 中的 table

Question

我正在使用 Camelot 和 tabula 来解析内部包含西里尔符号的 pdf 文件。但是在输出的 CSV 文件中，我得到了乱七八糟的字体，没有俄语的迹象。

什么可以帮助我解析非英语语言的 pdf table？

import camelot
file = 'file-name.pdf'
tables = camelot.read_pdf(file, pages = "1-end", encoding='utf-8')

输出： 00550529-1295-06-UP。 Р§Р§45

Answer 1

所以，基本上，Camelot 与西里尔字母相当不错。

pip install camelot-py[cv]
import pandas as pd
import camelot
file = 'file-name.pdf'
tables = camelot.read_pdf(file, pages = "4, 5", encoding='utf-8')
df_p4 = tables[0].df

输出将非常原始，需要清理，但符号不会被破坏，我认为这是一个很好的结果。

如何为非英语语言解析 PDF 中的 table

How to parse table in PDF for non-english language

pdf

parsing

python-3.x

python-camelot