如何使用 Apache Tika 仅从 .ppt 中提取文本
How to extracting only text from the .ppt using Apache Tika
publicclass测试{
public static void main(String[] args) throws Exception{
String data;
TikaConfig tikaConfig = TikaConfig.getDefaultConfig();
Metadata metadata = new Metadata();
ContentHandler handler;
try (InputStream stream = new BufferedInputStream(new FileInputStream(new File("E:\AllTypes\PPT\Presentation1.pptx")))) {
Detector detector = tikaConfig.getDetector();
Parser parser = tikaConfig.getParser();
MediaType type = detector.detect(stream, metadata);
metadata.set(Metadata.CONTENT_TYPE, type.toString());
handler = new BodyContentHandler(-1);
parser.parse(stream, handler, metadata, new ParseContext());
data = handler.toString();
System.out.println(data);
}
}
}
我只有Hello world!在输入 ppt 所以我只想要你好世界!
输出:[Content_Types].xml
_rels/.rels
ppt/slides/_rels/slide1.xml.rels
ppt/_rels/presentation.xml.rels
ppt/presentation.xml
ppt/slides/slide1.xml
你好世界!
ppt/slideLayouts/_rels/slideLayout6.xml.rels
ppt/slideLayouts/_rels/slideLayout7.xml.rels
ppt/slideLayouts/_rels/slideLayout9.xml.rels
ppt/slideLayouts/_rels/slideLayout10.xml.rels
ppt/slideLayouts/_rels/slideLayout8.xml.rels
ppt/slideLayouts/_rels/slideLayout11.xml.rels
ppt/slideLayouts/_rels/slideLayout1.xml.rels
ppt/slideLayouts/_rels/slideLayout2.xml.rels
ppt/slideLayouts/_rels/slideLayout3.xml.rels
ppt/slideLayouts/_rels/slideLayout4.xml.rels
ppt/slideMasters/_rels/slideMaster1.xml.rels
ppt/slideLayouts/slideLayout11.xml
点击编辑母版标题样式点击编辑母版文字样式二级三级四级五级1/30/2018 ‹#›
ppt/slideLayouts/slideLayout10.xml
点击编辑母版标题样式点击编辑母版文字样式二级三级四级五级1/30/2018 ‹#›
ppt/slideLayouts/slideLayout3.xml
单击以编辑母版标题样式单击以编辑母版文本样式1/30/2018 ‹#›
ppt/slideLayouts/slideLayout2.xml
点击编辑母版标题样式点击编辑母版文字样式二级三级四级五级1/30/2018 ‹#›
ppt/slideLayouts/slideLayout1.xml
点击编辑母版标题样式点击编辑母版字幕样式1/30/2018 ‹#›
ppt/slideMasters/slideMaster1.xml
点击编辑母版标题样式点击编辑母版文字样式二级三级四级五级1/30/2018 ‹#›
ppt/slideLayouts/slideLayout4.xml
点击编辑母版标题样式点击编辑母版文字样式二级三级四级五级点击编辑母版文字样式二级三级四级第五级1/30/2018‹#›
ppt/slideLayouts/slideLayout5.xml
点击编辑母版标题样式点击编辑母版文字样式点击编辑母版文字样式二级三级四级五级点击编辑母版文字样式点击编辑母版文字样式二级三级四级第五级1/30/ 2018 ‹#›
ppt/slideLayouts/slideLayout6.xml
单击以编辑大师标题样式 1/30/2018 ‹#›
ppt/slideLayouts/slideLayout7.xml
1/30/2018 ‹#›
ppt/slideLayouts/slideLayout8.xml
点击编辑母版标题样式点击编辑母版文字样式二级三级四级第五级点击编辑母版文字样式1/30/2018 ‹#›
ppt/slideLayouts/slideLayout9.xml
单击以编辑母版标题样式单击以编辑母版文本样式1/30/2018 ‹#›
ppt/slideLayouts/_rels/slideLayout5.xml.rels
ppt/theme/theme1.xml
docProps/thumbnail.jpeg
ppt/presProps.xml
ppt/tableStyles.xml
ppt/viewProps.xml
docProps/core.xml
PowerPoint 演示文稿 srinuk srinuk 1 2018-01-30T10:19:34Z 2018-01-30T10:22:05Z
docProps/app.xml
2 3 Microsoft Office PowerPoint 宽屏 1 1 0 0 0 false 使用的字体 3 主题 1 幻灯片标题 1 Arial Calibri Calibri Light Office 主题 PowerPoint 演示文稿 false false false 15.0000
您可以尝试使用tika-app.jar.Just使用Tika提取文本功能。
Tika tika = new Tika();
File file = new File("path");
String str = tika.parseToString(file);
此代码仅解析文件中的文本内容。
publicclass测试{
public static void main(String[] args) throws Exception{
String data;
TikaConfig tikaConfig = TikaConfig.getDefaultConfig();
Metadata metadata = new Metadata();
ContentHandler handler;
try (InputStream stream = new BufferedInputStream(new FileInputStream(new File("E:\AllTypes\PPT\Presentation1.pptx")))) {
Detector detector = tikaConfig.getDetector();
Parser parser = tikaConfig.getParser();
MediaType type = detector.detect(stream, metadata);
metadata.set(Metadata.CONTENT_TYPE, type.toString());
handler = new BodyContentHandler(-1);
parser.parse(stream, handler, metadata, new ParseContext());
data = handler.toString();
System.out.println(data);
}
}
}
我只有Hello world!在输入 ppt 所以我只想要你好世界! 输出:[Content_Types].xml
_rels/.rels
ppt/slides/_rels/slide1.xml.rels
ppt/_rels/presentation.xml.rels
ppt/presentation.xml
ppt/slides/slide1.xml 你好世界!
ppt/slideLayouts/_rels/slideLayout6.xml.rels
ppt/slideLayouts/_rels/slideLayout7.xml.rels
ppt/slideLayouts/_rels/slideLayout9.xml.rels
ppt/slideLayouts/_rels/slideLayout10.xml.rels
ppt/slideLayouts/_rels/slideLayout8.xml.rels
ppt/slideLayouts/_rels/slideLayout11.xml.rels
ppt/slideLayouts/_rels/slideLayout1.xml.rels
ppt/slideLayouts/_rels/slideLayout2.xml.rels
ppt/slideLayouts/_rels/slideLayout3.xml.rels
ppt/slideLayouts/_rels/slideLayout4.xml.rels
ppt/slideMasters/_rels/slideMaster1.xml.rels
ppt/slideLayouts/slideLayout11.xml 点击编辑母版标题样式点击编辑母版文字样式二级三级四级五级1/30/2018 ‹#›
ppt/slideLayouts/slideLayout10.xml 点击编辑母版标题样式点击编辑母版文字样式二级三级四级五级1/30/2018 ‹#›
ppt/slideLayouts/slideLayout3.xml 单击以编辑母版标题样式单击以编辑母版文本样式1/30/2018 ‹#›
ppt/slideLayouts/slideLayout2.xml 点击编辑母版标题样式点击编辑母版文字样式二级三级四级五级1/30/2018 ‹#›
ppt/slideLayouts/slideLayout1.xml 点击编辑母版标题样式点击编辑母版字幕样式1/30/2018 ‹#›
ppt/slideMasters/slideMaster1.xml 点击编辑母版标题样式点击编辑母版文字样式二级三级四级五级1/30/2018 ‹#›
ppt/slideLayouts/slideLayout4.xml 点击编辑母版标题样式点击编辑母版文字样式二级三级四级五级点击编辑母版文字样式二级三级四级第五级1/30/2018‹#›
ppt/slideLayouts/slideLayout5.xml 点击编辑母版标题样式点击编辑母版文字样式点击编辑母版文字样式二级三级四级五级点击编辑母版文字样式点击编辑母版文字样式二级三级四级第五级1/30/ 2018 ‹#›
ppt/slideLayouts/slideLayout6.xml 单击以编辑大师标题样式 1/30/2018 ‹#›
ppt/slideLayouts/slideLayout7.xml 1/30/2018 ‹#›
ppt/slideLayouts/slideLayout8.xml 点击编辑母版标题样式点击编辑母版文字样式二级三级四级第五级点击编辑母版文字样式1/30/2018 ‹#›
ppt/slideLayouts/slideLayout9.xml 单击以编辑母版标题样式单击以编辑母版文本样式1/30/2018 ‹#›
ppt/slideLayouts/_rels/slideLayout5.xml.rels
ppt/theme/theme1.xml
docProps/thumbnail.jpeg
ppt/presProps.xml
ppt/tableStyles.xml
ppt/viewProps.xml
docProps/core.xml PowerPoint 演示文稿 srinuk srinuk 1 2018-01-30T10:19:34Z 2018-01-30T10:22:05Z
docProps/app.xml 2 3 Microsoft Office PowerPoint 宽屏 1 1 0 0 0 false 使用的字体 3 主题 1 幻灯片标题 1 Arial Calibri Calibri Light Office 主题 PowerPoint 演示文稿 false false false 15.0000
您可以尝试使用tika-app.jar.Just使用Tika提取文本功能。
Tika tika = new Tika();
File file = new File("path");
String str = tika.parseToString(file);
此代码仅解析文件中的文本内容。