apache poi word 到 html 转换 - words boundry
apache poi word to html conversion - words boundry
我正在使用下面的代码将 word 转换为 html 文件
public Map convert(String wordDocPath, String htmlPath,
Map conversionParams)
{
log.info("Converting word file "+wordDocPath)
try
{
String workingFolder = "C:\temp"
File workingFolderFile = new File(workingFolder)
FileInputStream fis = new FileInputStream(wordDocPath);
XWPFDocument document = new XWPFDocument(fis);
XHTMLOptions options = XHTMLOptions.create().URIResolver(new FileURIResolver(workingFolderFile));
options.setExtractor(new FileImageExtractor(workingFolderFile))
File htmlFile = new File(htmlPath);
OutputStream out = new FileOutputStream(htmlFile)
XHTMLConverter.getInstance().convert(document, out, options);
log.info("Converted to HTML file "+htmlPath)
}
catch(Exception e)
{
log.error("Exception :"+e.getMessage(),e)
}
}
代码正确生成 html 输出。
我需要在文档中添加一些参数,例如 [[AGENT_NAME]]
,稍后我将在代码中用正则表达式替换这些参数。但是 apache poi 并没有将这种模式视为单个单词,有时会拆分“[[”,"AGENT_NAME" & “]]”并在其间插入一些带有样式的标签。因此我无法编写正则表达式和替换参数。
apache poi是如何判断字界的?有办法控制吗?
经过一番努力,我终于决定写代码来解析word doc并合并拆分运行。这是代码,希望它能帮助别人
注意:我使用的模式是 ${pattern}
void mergeSplittedPatterns(XWPFDocument document)
{
List<XWPFParagraph> paragraphs = document.paragraphs
for(XWPFParagraph paragraph : paragraphs)
{
List<XWPFRun> runs = paragraph.getRuns()
int firstCharRun,closingCharRun
boolean firstCharFound = false;
boolean secondCharFoundImmediately = false;
boolean closingCharFound = false;
boolean gotoNextRun = true
boolean scan = (runs!=null && runs.size()>0)
int index = 0
while(scan)
{
gotoNextRun = true;
XWPFRun run = runs.get(index)
String runText = run.getText(0)
if(runText!=null)
for (int i = 0; i < runText.length(); i++)
{
char character = runText.charAt(i);
if(secondCharFoundImmediately)
{
closingCharFound = (character=="}")
if(closingCharFound)
{
closingCharRun = index
if(firstCharRun==closingCharRun)
{
firstCharFound = secondCharFoundImmediately = closingCharFound = false
continue;
}
else
{
String mergedText= ""
for(int j=firstCharRun;j<=closingCharRun;j++)
{
mergedText += runs.get(j).getText(0)
}
runs.get(firstCharRun).setText(mergedText,0)
for(int j=closingCharRun;j>firstCharRun;j--)
{
paragraph.removeRun(j)
}
firstCharFound = secondCharFoundImmediately = closingCharFound = gotoNextRun = false
index = firstCharRun
break;
}
}
}
else if(firstCharFound)
{
secondCharFoundImmediately = (character=="{")
if(!secondCharFoundImmediately)
{
firstCharFound = secondCharFoundImmediately = closingCharFound = false
}
}
else if(character=="$")
{
firstCharFound = true;
firstCharRun = index
}
}
if(gotoNextRun)
{
index++;
}
if(index>=runs.size())
{
scan = false;
}
}
}
}
我正在使用下面的代码将 word 转换为 html 文件
public Map convert(String wordDocPath, String htmlPath,
Map conversionParams)
{
log.info("Converting word file "+wordDocPath)
try
{
String workingFolder = "C:\temp"
File workingFolderFile = new File(workingFolder)
FileInputStream fis = new FileInputStream(wordDocPath);
XWPFDocument document = new XWPFDocument(fis);
XHTMLOptions options = XHTMLOptions.create().URIResolver(new FileURIResolver(workingFolderFile));
options.setExtractor(new FileImageExtractor(workingFolderFile))
File htmlFile = new File(htmlPath);
OutputStream out = new FileOutputStream(htmlFile)
XHTMLConverter.getInstance().convert(document, out, options);
log.info("Converted to HTML file "+htmlPath)
}
catch(Exception e)
{
log.error("Exception :"+e.getMessage(),e)
}
}
代码正确生成 html 输出。
我需要在文档中添加一些参数,例如 [[AGENT_NAME]]
,稍后我将在代码中用正则表达式替换这些参数。但是 apache poi 并没有将这种模式视为单个单词,有时会拆分“[[”,"AGENT_NAME" & “]]”并在其间插入一些带有样式的标签。因此我无法编写正则表达式和替换参数。
apache poi是如何判断字界的?有办法控制吗?
经过一番努力,我终于决定写代码来解析word doc并合并拆分运行。这是代码,希望它能帮助别人
注意:我使用的模式是 ${pattern}
void mergeSplittedPatterns(XWPFDocument document)
{
List<XWPFParagraph> paragraphs = document.paragraphs
for(XWPFParagraph paragraph : paragraphs)
{
List<XWPFRun> runs = paragraph.getRuns()
int firstCharRun,closingCharRun
boolean firstCharFound = false;
boolean secondCharFoundImmediately = false;
boolean closingCharFound = false;
boolean gotoNextRun = true
boolean scan = (runs!=null && runs.size()>0)
int index = 0
while(scan)
{
gotoNextRun = true;
XWPFRun run = runs.get(index)
String runText = run.getText(0)
if(runText!=null)
for (int i = 0; i < runText.length(); i++)
{
char character = runText.charAt(i);
if(secondCharFoundImmediately)
{
closingCharFound = (character=="}")
if(closingCharFound)
{
closingCharRun = index
if(firstCharRun==closingCharRun)
{
firstCharFound = secondCharFoundImmediately = closingCharFound = false
continue;
}
else
{
String mergedText= ""
for(int j=firstCharRun;j<=closingCharRun;j++)
{
mergedText += runs.get(j).getText(0)
}
runs.get(firstCharRun).setText(mergedText,0)
for(int j=closingCharRun;j>firstCharRun;j--)
{
paragraph.removeRun(j)
}
firstCharFound = secondCharFoundImmediately = closingCharFound = gotoNextRun = false
index = firstCharRun
break;
}
}
}
else if(firstCharFound)
{
secondCharFoundImmediately = (character=="{")
if(!secondCharFoundImmediately)
{
firstCharFound = secondCharFoundImmediately = closingCharFound = false
}
}
else if(character=="$")
{
firstCharFound = true;
firstCharRun = index
}
}
if(gotoNextRun)
{
index++;
}
if(index>=runs.size())
{
scan = false;
}
}
}
}