如何使用 Python 或 Tesseract OCR 从输入图像中检测语言或脚本？

Question

给定一张可以使用任何语言或书写系统的输入图像，我如何检测图片中的文字使用的脚本？

任何基于 Python 或基于 Tesseract-OCR 的解决方案将不胜感激。

请注意，此处的脚本是指编写 拉丁文、西里尔文、天城文等系统，用于相应的语言，如 英语、俄语、印地语等（分别）

Answer 1

先决条件：

安装 Tesseract：sudo apt install tesseract-ocr tesseract-ocr-all
安装 PyTessract：pip install pytesseract

脚本检测：

import pytesseract
import re

def detect_image_lang(img_path):
    try:
        osd = pytesseract.image_to_osd(img_path)
        script = re.search("Script: ([a-zA-Z]+)\n", osd).group(1)
        conf = re.search("Script confidence: (\d+\.?(\d+)?)", osd).group(1)
        return script, float(conf)
    except e:
        return None, 0.0

script_name, confidence = detect_image_lang("image.png")

语言检测：

执行 OCR 后（using Tesseract), pass the text through langdetect library（或任何其他库）。

如何使用 Python 或 Tesseract OCR 从输入图像中检测语言或脚本？

How to detect language or script from an input image using Python or Tesseract OCR?

tesseract

python-tesseract