如何从扫描页面的 PDF 制作可搜索的 PDF?
How can I make a searchable PDF from an PDF of scanned pages?
如何为我的 Java 应用程序使用 tesseract 从扫描页面的 PDF 制作可搜索的 PDF?
String image2Text(String imagePath)
{
dataPath= Environment.getExternalStorageDirectory().toString() + "/Android/data/" + appContext.getPackageName() + "/";
File tessdata = new File(dataPath);
if (!tessdata.exists() || !tessdata.isDirectory())
{
throw new IllegalArgumentException("Data path must contain subfolder tessdata!");
}
Bitmap image= BitmapFactory.decodeFile(imagePath);
TessBaseAPI baseApi = new TessBaseAPI();
baseApi.init(dataPath, "eng");
baseApi.setImage(image);
String recognizedText = baseApi.getUTF8Text();
baseApi.end();
return recognizedText;
}
可以使用 Gnostice XtremeDocumentStudio(针对 Java)。
http://www.gnostice.com/nl_article.asp?id=289&t=How_to_convert_scanned_images_to_searchable_PDF_using_OCR_in_Java
DocumentConverter dc = new DocumentConverter();
DigitizerSettings ds = dc.getPreferences().getDigitizerSettings();
ds.setDigitizationMode(DigitizationMode.ALL_IMAGES);
ds.setRecognizeElementTypes(RecognizeElementTypes.TEXT);
try {
dc.convertToFile(
"H:\Screenshot-2.png",
"e:\converted_image.pdf");
} catch (FormatNotSupportedException e) {
e.printStackTrace();
} catch (ConverterException e) {
e.printStackTrace();
} catch (XDocException e) {
e.printStackTrace();
}
免责声明:我在 Gnostice 工作。
如何为我的 Java 应用程序使用 tesseract 从扫描页面的 PDF 制作可搜索的 PDF?
String image2Text(String imagePath)
{
dataPath= Environment.getExternalStorageDirectory().toString() + "/Android/data/" + appContext.getPackageName() + "/";
File tessdata = new File(dataPath);
if (!tessdata.exists() || !tessdata.isDirectory())
{
throw new IllegalArgumentException("Data path must contain subfolder tessdata!");
}
Bitmap image= BitmapFactory.decodeFile(imagePath);
TessBaseAPI baseApi = new TessBaseAPI();
baseApi.init(dataPath, "eng");
baseApi.setImage(image);
String recognizedText = baseApi.getUTF8Text();
baseApi.end();
return recognizedText;
}
可以使用 Gnostice XtremeDocumentStudio(针对 Java)。 http://www.gnostice.com/nl_article.asp?id=289&t=How_to_convert_scanned_images_to_searchable_PDF_using_OCR_in_Java
DocumentConverter dc = new DocumentConverter();
DigitizerSettings ds = dc.getPreferences().getDigitizerSettings();
ds.setDigitizationMode(DigitizationMode.ALL_IMAGES);
ds.setRecognizeElementTypes(RecognizeElementTypes.TEXT);
try {
dc.convertToFile(
"H:\Screenshot-2.png",
"e:\converted_image.pdf");
} catch (FormatNotSupportedException e) {
e.printStackTrace();
} catch (ConverterException e) {
e.printStackTrace();
} catch (XDocException e) {
e.printStackTrace();
}
免责声明:我在 Gnostice 工作。