如何改进印地语文本提取？

Question

我正在尝试从 PDF 中提取印地语文本。我尝试了所有从 PDF 中提取的方法，但其中 none 有效。有解释为什么它不起作用，但没有答案。所以，我决定将 PDF 转换为图像，然后使用 pytesseract 提取文本。我已经下载了经过印地语训练的数据，但是这也给出了非常不准确的文本。

这是 PDF (download link) 中的实际印地语文本：

到目前为止，这是我的代码：

import fitz

filepath = "D:\BADI KA BANS-Ward No-002.pdf"

doc = fitz.open(filepath)
page = doc.loadPage(3)  # number of page
pix = page.getPixmap()
output = "outfile.png"
pix.writePNG(output)
from PIL import Image
import pytesseract

# Include tesseract executable in your path
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

# Create an image object of PIL library
image = Image.open('outfile.png')

# pass image into pytesseract module
# pytesseract is trained in many languages
image_to_text = pytesseract.image_to_string(image, lang='hin')

# Print the text
print(image_to_text)

这是一些输出示例：

कार बिता देवी व ०... नाम बाइुनान िक०क नाक तो
पति का नाव: रवजी लात. “50९... पिला का सामशामाव.... “पति का नाम: बादुलल
कान सब: 43 लसमनंध्या: 93९. मकान ंब्या: 3९
आप: 29 _ लिंग सी. | आइ 57 लिंग पुरुष आप: 62 लिंग सी
एजगल्णब्णस्य (बन्द जगाख्मिणण्य
नमः बायगी बसों ०४... नि बयावर्णो ०५०... निफर सनक नी
चिता का नामजबूजल वर्ष.“ ००० | पिला का नामब्राइलाल वर्षो... 0 2... | पिता कामामशुल चब्द .... “20०
|सकानसंब्या: 43९ बसवकंब्या: 43९. कान संब्या: 44
जाए: 27 लिंग सो कई: 27 नि खी मा लिंग पुरुष

这个问题有一个答案，似乎告诉了如何做，但没有提供任何解释。

除了自己训练语言模型，还有什么方法可以做到这一点吗？

Answer 1

似乎模块 pdfplumber 完成了工作：

import pdfplumber

pdf = pdfplumber.open('BADI KA BANS-Ward No-002.pdf')

pages = pdf.pages
text = ""

for page in pages:
    text += page.extract_text()

pdf.close()

with open('output.txt', 'w', encoding="utf8") as f:
    f.write(text)

输出（片段）：

ररजज ननरररचन आजयग, ररजससरन 
 पपचरजत चचनरर ननरररचक नरमररलल, 2021   
नजलरपररषद कर नरम : जजपचर नज॰ प॰ सदसज ननरररचन ककत : 21
पपचरजत सनमनत कर नरम : सरपगरनकर पप॰ स॰ सदसज ननरररचन ककत : 6
गरमपपचरजत : बरल कर बरपस रररर कमरपक : 2
नरधरनसभर ककत कक सपखजर एरप नरम:-56-बगर
मचखज गरपर        : लकमलपचरर उरर नटरनलपचरर
तहसलल         : सरपगरनकर
नजलर            : जजपचर
पचनरलकण कर नरररण
पचनरलकण कर रषर  :  2021
पचनरलकण कर पकरर               :  गहन पचनरलकण
अहतर र ददनरपक  :  01-01-2021
अपनतम पकरशन कक ददनरपक     :  19-04-2021
...

输入（第一页）：

但我对印地语一无所知。我无法理解输出是否足够好。

https://github.com/jsvine/pdfplumber

安装模块（Windows7，Python3.8）：

pip install pdfplumber

据说该模块甚至可以处理表格。不过我没试过。

Answer 2

如果你想从这些 'cards' 中获取文本，我已经通过模块 tabula-py 以这种方式为第 3 页做到了：

import tabula

pdf_file = "BADI KA BANS-Ward No-002.pdf"
page = 3

x = 30      # left edge of the table
y = 160     # top edge of the table
w = 173     # width of a card
h = 73      # height of a card
photo = 61  # width of a photo

rows = 8    # number of rows of the table
cols = 3    # number of columns of the table

counter = 1

def get_area(row, col):
    ''' return area of the card in given position in the table '''
    top    = y + h * row
    left   = x + w * col
    bottom = top + h
    right  = left + w - photo
    return (top, left, bottom, right)

for row in range(rows):
    for col in range(cols):
        file_name = "card_" + str(counter).zfill(3) + ".txt"
        tabula.convert_into(pdf_file, file_name,
        pages=page,
        output_format = "csv",
        java_options = "-Dfile.encoding=UTF8",
        lattice = False,
        area = get_area(row, col))
        counter += 1

输入：

输出

24 个 txt 文件：

card_001.txt
card_002.txt
card_003.txt
card_004.txt
.
.
.
card_023.txt
card_024.txt

card_001.txt:

1 RBP2469583
नरम: आरतल चररलर
नपतर कर नरम:लरलर ररम चररल
मकरन सखजर: १९
आज:  21 ललग: सल

card_002.txt

2 MRQ3101367
नरम: सरज दरल
नपतर कर नरम:ररमररतरर
मकरन सखजर: रल /18
आज:  44 ललग: सल

card_024.txt

24 RBP0230979
नरम: सनमतकरर
पनत कर नरम: हररलसह
मकरन सखजर: 13
आज:  41 ललग: सल

据我所知，所有 'cards' 都具有相同的尺寸。该解决方案可以应用于所有看起来相似的页面。不幸的是，页面有差异。因此必须为每个页面更改初始变量。我看不到自动进行更改的方法。除了可以从卡片中取出卡片的数字而不是简单的计数器。

https://pypi.org/project/tabula-py/

https://aegis4048.github.io/parse-pdf-files-while-retaining-structure-with-tabula-py

Answer 3

我会给出一些如何处理您的图像的想法，但我会将其限制在给定文档的第 3 页，即问题中显示的页面。

为了将 PDF 页面转换为一些图像，我使用了 pdf2image。

对于 OCR，我使用 pytesseract，但我使用 lang='Devanagari' 而不是 lang='hin'，请参见。 Tesseract GitHub. In general, make sure to work through Improving the quality of the output from the Tesseract documentation, especially the page segmentation method.

这是整个过程的（冗长）描述：

对图像进行逆二值化以寻找轮廓：黑色背景上的白色文本、形状等。
找出所有轮廓，过滤掉两个非常大的轮廓，即这两个table。
提取两个 table 之外的文本：
1. 屏蔽掉二值化图像中的 tables。
2. 进行词法闭合以连接剩余的文本行。
3. 找到这些文本行的轮廓和边界矩形。
4. 运行 pytesseract 提取文本。
提取两个 table 中的文本：
1. 从当前 table.
2. 对于第一个 table：
  1. 运行 pytesseract 按原样提取文本。
3. 对于第二个 table：
  1. 填充数字周围的矩形以防止错误的 OCR 输出。
  2. 屏蔽左侧（印地语）和右侧（英语）部分。
  3. 运行 pytesseract 在左侧使用 lang='Devaganari'，在右侧使用 lang='eng' 以提高两者的 OCR 质量。

这就是全部代码：

import cv2
import numpy as np
import pdf2image
import pytesseract

# Extract page 3 from PDF in proper quality
page_3 = np.array(pdf2image.convert_from_path('BADI KA BANS-Ward No-002.pdf',
                                              first_page=3, last_page=3,
                                              dpi=300, grayscale=True)[0])

# Inverse binarize for contour finding
thr = cv2.threshold(page_3, 128, 255, cv2.THRESH_BINARY_INV)[1]

# Find contours w.r.t. the OpenCV version
cnts = cv2.findContours(thr, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]

# STEP 1: Extract texts outside of the two tables

# Mask out the two tables
cnts_tables = [cnt for cnt in cnts if cv2.contourArea(cnt) > 10000]
no_tables = cv2.drawContours(thr.copy(), cnts_tables, -1, 0, cv2.FILLED)

# Find bounding rectangles of texts outside of the two tables
no_tables = cv2.morphologyEx(no_tables, cv2.MORPH_CLOSE, np.full((21, 51), 255))
cnts = cv2.findContours(no_tables, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
rects = sorted([cv2.boundingRect(cnt) for cnt in cnts], key=lambda r: r[1])

# Extract texts from each bounding rectangle
print('\nExtract texts outside of the two tables\n')
for (x, y, w, h) in rects:
    text = pytesseract.image_to_string(page_3[y:y+h, x:x+w],
                                       config='--psm 6', lang='Devanagari')
    text = text.replace('\n', '').replace('\f', '')
    print('x: {}, y: {}, text: {}'.format(x, y, text))

# STEP 2: Extract texts from inside of the two tables

rects = sorted([cv2.boundingRect(cnt) for cnt in cnts_tables],
               key=lambda r: r[1])

# Iterate each table
for i_r, (x, y, w, h) in enumerate(rects, start=1):

    # Find bounding rectangles of cells inside of the current table
    cnts = cv2.findContours(page_3[y+2:y+h-2, x+2:x+w-2],
                            cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)
    cnts = cnts[0] if len(cnts) == 2 else cnts[1]
    inner_rects = sorted([cv2.boundingRect(cnt) for cnt in cnts],
                         key=lambda r: (r[1], r[0]))

    # Extract texts from each cell of the current table
    print('\nExtract texts inside table {}\n'.format(i_r))
    for (xx, yy, ww, hh) in inner_rects:

        # Set current coordinates w.r.t. full image
        xx += x
        yy += y

        # Get current cell
        cell = page_3[yy+2:yy+hh-2, xx+2:xx+ww-2]

        # For table 1, simply extract texts as-is
        if i_r == 1:
            text = pytesseract.image_to_string(cell, config='--psm 6',
                                               lang='Devanagari')
            text = text.replace('\n', '').replace('\f', '')
            print('x: {}, y: {}, text: {}'.format(xx, yy, text))

        # For table 2, extract single elements
        if i_r == 2:

            # Floodfill rectangles around numbers
            ys, xs = np.min(np.argwhere(cell == 0), axis=0)
            temp = cv2.floodFill(cell.copy(), None, (xs, ys), 255)[1]
            mask = cv2.floodFill(thr[yy+2:yy+hh-2, xx+2:xx+ww-2].copy(),
                                 None, (xs, ys), 0)[1]

            # Extract left (Hindi) and right (English) parts
            mask = cv2.morphologyEx(mask, cv2.MORPH_CLOSE,
                                    np.full((2 * hh, 5), 255))
            cnts = cv2.findContours(mask,
                                    cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)
            cnts = cnts[0] if len(cnts) == 2 else cnts[1]
            boxes = sorted([cv2.boundingRect(cnt) for cnt in cnts],
                           key=lambda b: b[0])

            # Extract texts from each part of the current cell
            for i_b, (x_b, y_b, w_b, h_b) in enumerate(boxes, start=1):

                # For the left (Hindi) part, extract Hindi texts
                if i_b == 1:

                    text = pytesseract.image_to_string(
                        temp[y_b:y_b+h_b, x_b:x_b+w_b],
                        config='--psm 6',
                        lang='Devanagari')
                    text = text.replace('\f', '')

                # For the left (English) part, extract English texts
                if i_b == 2:

                    text = pytesseract.image_to_string(
                        temp[y_b:y_b+h_b, x_b:x_b+w_b],
                        config='--psm 6',
                        lang='eng')
                    text = text.replace('\f', '')

                print('x: {}, y: {}, text:\n{}'.format(xx, yy, text))

并且，这是输出的前几行：

Extract texts outside of the two tables

x: 972, y: 93, text: राज्य निर्वाचन आयोग, राजस्थान
x: 971, y: 181, text: पंचायत चुनाव निर्वाचक नामावली, 2021
x: 166, y: 610, text: मिश्र का बाढ़ ,श्रीराम की नॉगल
x: 151, y: 3417, text: आयु 1 जनवरी 2021 के अनुसार
x: 778, y: 3419, text: पृष्ठ संख्या : 3 / 10

Extract texts inside table 1

x: 146, y: 240, text: जिलापरिषद का नाम : जयपुर
x: 1223, y: 240, text: जि° प° सदस्य निर्वाचन क्षेत्र : 21
x: 146, y: 327, text: पंचायत समिति का नाम : सांगानेर
x: 1223, y: 327, text: पं° स° सदस्य निर्वाचन क्षेत्र : 6
x: 146, y: 415, text: ग्रामपंचायत : बडी का बांस
x: 1223, y: 415, text: वार्ड क्रमांक : 2
x: 146, y: 502, text: विधानसभा क्षेत्र की संख्या एवं नाम:- 56-बगरु

Extract texts inside table 2

x: 142, y: 665, text:
1 RBP2469583
नाम: आरती चावला
पिता का नामःलाला राम चावला
मकान संख्याः १९
आयुः 21 लिंगः स्त्री

x: 142, y: 665, text:
Photo is
Available

x: 867, y: 665, text:
2 MRQ3101367
नामः सूरज देवी
पिता का नामःरामावतार
मकान संख्याः डी /18
आयुः 44 लिंगः स्त्री

x: 867, y: 665, text:
Photo is
Available

我用手动逐字比较检查了一些文本，认为它看起来不错，但无法理解印地语或阅读天城文脚本，我无法对 OCR 的整体质量发表评论。请告诉我！

令人讨厌的是，相应“卡片”中的数字 9 被错误地提取为 2。我假设，这是由于与文本的其余部分相比不同的字体以及与 lang='Devanagari' 相结合而发生的。无法找到解决方案——不从“卡片”中单独提取矩形。

----------------------------------------
System information
----------------------------------------
Platform:      Windows-10-10.0.19041-SP0
Python:        3.9.1
PyCharm:       2021.1.1
NumPy:         1.19.5
OpenCV:        4.5.2
pdf2image      1.14.0
pytesseract:   5.0.0-alpha.20201127
----------------------------------------

Answer 4

如果您想从 pdf 中抓取 100% 正确的文本，您应该使用正确的字体系列并在从图像到文本的解析时进行编码。

如何改进印地语文本提取？

How to improve Hindi text extraction?

python

pdf-extraction

python-tesseract