从 Word (Docx) 读取方程式和公式到 html 并使用 java 保存数据库
Reading equations & formula from Word (Docx) to html and save database using java
我有一个 word/docx 文件,其中的方程式如下图所示
我想读取文件 word/docx 的数据并保存到我的数据库
当需要时,我可以从数据库中获取数据并显示在我的 html 页面上
我使用 apache Poi 从 docx 文件中读取数据,但它不能使用方程式
请帮助我!
Word
*.docx
个文件是 ZIP
个包含 XML
个文件的档案 Office Open XML. The formulas contained in Word
*.docx
documents are Office MathML (OMML).
不幸的是,这种 XML
格式在 Microsoft Office
之外并不为人所知。所以它不能直接在 HTML
中使用,例如。但幸运的是,它是 XML
,因此它可以使用 Transforming XML Data with XSLT. So we can transform that OMML
into MathML 进行转换,例如,可用于更广泛的用例。
通过XSLT
的转换过程主要基于转换的XSL
定义。不幸的是,创建这样一个也不是很容易。但幸运的是 Microsoft
已经完成了,如果你安装了当前的 Microsoft Office
,你可以在 %ProgramFiles%\
的 Microsoft Office
程序目录中找到这个文件 OMML2MML.XSL
。如果找不到,请进行网络研究以获取它。
因此,如果我们知道这一切,我们可以从 XWPFDocument
中获取 OMML
,将其转换为 MathML
,然后保存以备后用。
我的示例将找到的公式存储为 MathML
在 ArrayList
字符串中。您还应该能够将此字符串存储在您的数据库中。
该示例需要完整的 ooxml-schemas-1.3.jar
,如 https://poi.apache.org/faq.html#faq-N10025. This is because it uses CTOMath 中所述,较小的 poi-ooxml-schemas jar
.
未附带
Word文档:
Java代码:
import java.io.*;
import org.apache.poi.xwpf.usermodel.*;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTP;
import org.openxmlformats.schemas.officeDocument.x2006.math.CTOMath;
import org.openxmlformats.schemas.officeDocument.x2006.math.CTOMathPara;
import org.w3c.dom.Node;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamSource;
import javax.xml.transform.stream.StreamResult;
import java.awt.Desktop;
import java.util.List;
import java.util.ArrayList;
/*
needs the full ooxml-schemas-1.3.jar as mentioned in https://poi.apache.org/faq.html#faq-N10025
*/
public class WordReadFormulas {
static File stylesheet = new File("OMML2MML.XSL");
static TransformerFactory tFactory = TransformerFactory.newInstance();
static StreamSource stylesource = new StreamSource(stylesheet);
static String getMathML(CTOMath ctomath) throws Exception {
Transformer transformer = tFactory.newTransformer(stylesource);
Node node = ctomath.getDomNode();
DOMSource source = new DOMSource(node);
StringWriter stringwriter = new StringWriter();
StreamResult result = new StreamResult(stringwriter);
transformer.setOutputProperty("omit-xml-declaration", "yes");
transformer.transform(source, result);
String mathML = stringwriter.toString();
stringwriter.close();
//The native OMML2MML.XSL transforms OMML into MathML as XML having special name spaces.
//We don't need this since we want using the MathML in HTML, not in XML.
//So ideally we should changing the OMML2MML.XSL to not do so.
//But to take this example as simple as possible, we are using replace to get rid of the XML specialities.
mathML = mathML.replaceAll("xmlns:m=\"http://schemas.openxmlformats.org/officeDocument/2006/math\"", "");
mathML = mathML.replaceAll("xmlns:mml", "xmlns");
mathML = mathML.replaceAll("mml:", "");
return mathML;
}
public static void main(String[] args) throws Exception {
XWPFDocument document = new XWPFDocument(new FileInputStream("Formula.docx"));
//storing the found MathML in a AllayList of strings
List<String> mathMLList = new ArrayList<String>();
//getting the formulas out of all body elements
for (IBodyElement ibodyelement : document.getBodyElements()) {
if (ibodyelement.getElementType().equals(BodyElementType.PARAGRAPH)) {
XWPFParagraph paragraph = (XWPFParagraph)ibodyelement;
for (CTOMath ctomath : paragraph.getCTP().getOMathList()) {
mathMLList.add(getMathML(ctomath));
}
for (CTOMathPara ctomathpara : paragraph.getCTP().getOMathParaList()) {
for (CTOMath ctomath : ctomathpara.getOMathList()) {
mathMLList.add(getMathML(ctomath));
}
}
} else if (ibodyelement.getElementType().equals(BodyElementType.TABLE)) {
XWPFTable table = (XWPFTable)ibodyelement;
for (XWPFTableRow row : table.getRows()) {
for (XWPFTableCell cell : row.getTableCells()) {
for (XWPFParagraph paragraph : cell.getParagraphs()) {
for (CTOMath ctomath : paragraph.getCTP().getOMathList()) {
mathMLList.add(getMathML(ctomath));
}
for (CTOMathPara ctomathpara : paragraph.getCTP().getOMathParaList()) {
for (CTOMath ctomath : ctomathpara.getOMathList()) {
mathMLList.add(getMathML(ctomath));
}
}
}
}
}
}
}
document.close();
//creating a sample HTML file
String encoding = "UTF-8";
FileOutputStream fos = new FileOutputStream("result.html");
OutputStreamWriter writer = new OutputStreamWriter(fos, encoding);
writer.write("<!DOCTYPE html>\n");
writer.write("<html lang=\"en\">");
writer.write("<head>");
writer.write("<meta charset=\"utf-8\"/>");
//using MathJax for helping all browsers to interpret MathML
writer.write("<script type=\"text/javascript\"");
writer.write(" async src=\"https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=MML_CHTML\"");
writer.write(">");
writer.write("</script>");
writer.write("</head>");
writer.write("<body>");
writer.write("<p>Following formulas was found in Word document: </p>");
int i = 1;
for (String mathML : mathMLList) {
writer.write("<p>Formula" + i++ + ":</p>");
writer.write(mathML);
writer.write("<p/>");
}
writer.write("</body>");
writer.write("</html>");
writer.close();
Desktop.getDesktop().browse(new File("result.html").toURI());
}
}
结果:
刚刚使用 apache poi 5.0.0
测试了这段代码,它有效。 apache poi 5.0.0
需要 poi-ooxml-full-5.0.0.jar
。请阅读 https://poi.apache.org/help/faq.html#faq-N10025 以了解 apache poi
版本需要哪些 ooxml
库。
添加到@Axel Richter 的回答中,我发现很难找到所需的依赖集
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>ooxml-schemas</artifactId>
<version>1.4</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-ooxml</artifactId>
<version>3.15</version>
</dependency>
对于 Office 2019,我猜他们不提供 OMML2MML.XSL
所以这里是 link https://github.com/Versal/word2markdown/blob/master/libs/omml2mml.xsl
我有一个 word/docx 文件,其中的方程式如下图所示
我想读取文件 word/docx 的数据并保存到我的数据库 当需要时,我可以从数据库中获取数据并显示在我的 html 页面上 我使用 apache Poi 从 docx 文件中读取数据,但它不能使用方程式 请帮助我!
Word
*.docx
个文件是 ZIP
个包含 XML
个文件的档案 Office Open XML. The formulas contained in Word
*.docx
documents are Office MathML (OMML).
不幸的是,这种 XML
格式在 Microsoft Office
之外并不为人所知。所以它不能直接在 HTML
中使用,例如。但幸运的是,它是 XML
,因此它可以使用 Transforming XML Data with XSLT. So we can transform that OMML
into MathML 进行转换,例如,可用于更广泛的用例。
通过XSLT
的转换过程主要基于转换的XSL
定义。不幸的是,创建这样一个也不是很容易。但幸运的是 Microsoft
已经完成了,如果你安装了当前的 Microsoft Office
,你可以在 %ProgramFiles%\
的 Microsoft Office
程序目录中找到这个文件 OMML2MML.XSL
。如果找不到,请进行网络研究以获取它。
因此,如果我们知道这一切,我们可以从 XWPFDocument
中获取 OMML
,将其转换为 MathML
,然后保存以备后用。
我的示例将找到的公式存储为 MathML
在 ArrayList
字符串中。您还应该能够将此字符串存储在您的数据库中。
该示例需要完整的 ooxml-schemas-1.3.jar
,如 https://poi.apache.org/faq.html#faq-N10025. This is because it uses CTOMath 中所述,较小的 poi-ooxml-schemas jar
.
Word文档:
Java代码:
import java.io.*;
import org.apache.poi.xwpf.usermodel.*;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTP;
import org.openxmlformats.schemas.officeDocument.x2006.math.CTOMath;
import org.openxmlformats.schemas.officeDocument.x2006.math.CTOMathPara;
import org.w3c.dom.Node;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamSource;
import javax.xml.transform.stream.StreamResult;
import java.awt.Desktop;
import java.util.List;
import java.util.ArrayList;
/*
needs the full ooxml-schemas-1.3.jar as mentioned in https://poi.apache.org/faq.html#faq-N10025
*/
public class WordReadFormulas {
static File stylesheet = new File("OMML2MML.XSL");
static TransformerFactory tFactory = TransformerFactory.newInstance();
static StreamSource stylesource = new StreamSource(stylesheet);
static String getMathML(CTOMath ctomath) throws Exception {
Transformer transformer = tFactory.newTransformer(stylesource);
Node node = ctomath.getDomNode();
DOMSource source = new DOMSource(node);
StringWriter stringwriter = new StringWriter();
StreamResult result = new StreamResult(stringwriter);
transformer.setOutputProperty("omit-xml-declaration", "yes");
transformer.transform(source, result);
String mathML = stringwriter.toString();
stringwriter.close();
//The native OMML2MML.XSL transforms OMML into MathML as XML having special name spaces.
//We don't need this since we want using the MathML in HTML, not in XML.
//So ideally we should changing the OMML2MML.XSL to not do so.
//But to take this example as simple as possible, we are using replace to get rid of the XML specialities.
mathML = mathML.replaceAll("xmlns:m=\"http://schemas.openxmlformats.org/officeDocument/2006/math\"", "");
mathML = mathML.replaceAll("xmlns:mml", "xmlns");
mathML = mathML.replaceAll("mml:", "");
return mathML;
}
public static void main(String[] args) throws Exception {
XWPFDocument document = new XWPFDocument(new FileInputStream("Formula.docx"));
//storing the found MathML in a AllayList of strings
List<String> mathMLList = new ArrayList<String>();
//getting the formulas out of all body elements
for (IBodyElement ibodyelement : document.getBodyElements()) {
if (ibodyelement.getElementType().equals(BodyElementType.PARAGRAPH)) {
XWPFParagraph paragraph = (XWPFParagraph)ibodyelement;
for (CTOMath ctomath : paragraph.getCTP().getOMathList()) {
mathMLList.add(getMathML(ctomath));
}
for (CTOMathPara ctomathpara : paragraph.getCTP().getOMathParaList()) {
for (CTOMath ctomath : ctomathpara.getOMathList()) {
mathMLList.add(getMathML(ctomath));
}
}
} else if (ibodyelement.getElementType().equals(BodyElementType.TABLE)) {
XWPFTable table = (XWPFTable)ibodyelement;
for (XWPFTableRow row : table.getRows()) {
for (XWPFTableCell cell : row.getTableCells()) {
for (XWPFParagraph paragraph : cell.getParagraphs()) {
for (CTOMath ctomath : paragraph.getCTP().getOMathList()) {
mathMLList.add(getMathML(ctomath));
}
for (CTOMathPara ctomathpara : paragraph.getCTP().getOMathParaList()) {
for (CTOMath ctomath : ctomathpara.getOMathList()) {
mathMLList.add(getMathML(ctomath));
}
}
}
}
}
}
}
document.close();
//creating a sample HTML file
String encoding = "UTF-8";
FileOutputStream fos = new FileOutputStream("result.html");
OutputStreamWriter writer = new OutputStreamWriter(fos, encoding);
writer.write("<!DOCTYPE html>\n");
writer.write("<html lang=\"en\">");
writer.write("<head>");
writer.write("<meta charset=\"utf-8\"/>");
//using MathJax for helping all browsers to interpret MathML
writer.write("<script type=\"text/javascript\"");
writer.write(" async src=\"https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=MML_CHTML\"");
writer.write(">");
writer.write("</script>");
writer.write("</head>");
writer.write("<body>");
writer.write("<p>Following formulas was found in Word document: </p>");
int i = 1;
for (String mathML : mathMLList) {
writer.write("<p>Formula" + i++ + ":</p>");
writer.write(mathML);
writer.write("<p/>");
}
writer.write("</body>");
writer.write("</html>");
writer.close();
Desktop.getDesktop().browse(new File("result.html").toURI());
}
}
结果:
刚刚使用 apache poi 5.0.0
测试了这段代码,它有效。 apache poi 5.0.0
需要 poi-ooxml-full-5.0.0.jar
。请阅读 https://poi.apache.org/help/faq.html#faq-N10025 以了解 apache poi
版本需要哪些 ooxml
库。
添加到@Axel Richter 的回答中,我发现很难找到所需的依赖集
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>ooxml-schemas</artifactId>
<version>1.4</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-ooxml</artifactId>
<version>3.15</version>
</dependency>
对于 Office 2019,我猜他们不提供 OMML2MML.XSL
所以这里是 link https://github.com/Versal/word2markdown/blob/master/libs/omml2mml.xsl