使用 fopen PHP 读取从 tesseract 中提取的文件时出错

Question

我在 PHP 中使用 fopen 打开从 tesseract OCR 中提取的文件。返回的文本包含 <<<<<<，fopen 读取直到找到第一个 < 字符然后停止。

从 OCR 返回的文件：

P<dsdasdasd<<dasd<adsda<dsada<<<<<<<<<<ec<
dasdasdsdasdasdasdasd<<<<<<<<<<<<<<06

£ y

来自fopen的回显：

如果我查看源代码，我发现其余文本为红色。

我使用的代码：

<?php
file_put_contents("tmpFile.jpg",file_get_contents("1.jpg"));
$cmd = "tesseract tmpFile.jpg ee ";
exec($cmd);
$myfile = fopen("ee.txt", "r") or die("Unable to open file!");
$data= fread($myfile,100000000);
fclose($myfile);
echo $data;
?>

我粘贴了有问题的文本，它也被隐藏了。

我输入问题和隐藏在问题中的文本时的屏幕截图：

来自输出和视图源的屏幕截图：

Answer 1

据我所知，该问题与 tesseract 或您的输入文本文件无关。

fopen reads till it finds the first < character then stops

我不认为那是真的。那么，为什么您会在 "view source" 中看到其余的源代码？ fopen 读取整个文件，但问题在于在浏览器中显示该信息。

您想显示为 HTML 标签保留的字符 - 在本例中为 <（"less than" 符号）。这就是为什么你在 "view source" 中得到红色文本的原因，因为浏览器不知道如何解释 HTML 代码。

作为第一个解决方法，只需在 <?php 周围放置一个 <textarea> 标签即可查看数据：

<textarea><?php
/* ...
your regular code here
... */
?></textarea>

下一步应该是在将这些特殊字符提供给 echo 之前对其进行编码。看看 htmlspecialchars 或 htmlentities.

您还可以在以下位置找到有关该主题的有用信息：

Print less-than and greater-than symbols in PHP
How to display HTML tags as plain text

使用 fopen PHP 读取从 tesseract 中提取的文件时出错

Error reading file extracted from tesseract using fopen PHP

php

tesseract