Qt Regexp 从 Html 字符串中提取 <p> 标签

Question

我有一个 RichText，我将其来自 QTextEdit 的 Html 源存储在一个字符串中。我想做的是逐行提取所有行（我有 4-6 行）。字符串如下所示：

//html opening stuff
<p style = attributes...><span style = attributes...>My Text</span></p>
//more lines like this
//html closing stuff

所以我需要从开始 p 标签到结束 p 标签的整行（也包括 p 标签）。我检查并尝试了在这里和其他网站上找到的所有内容，但仍然没有结果。

这是我的代码（"htmlStyle" 是输入字符串）：

QStringList list;
QRegExp rx("(<p[^>]*>.*?</p>)");
int pos = 0;

while ((pos = rx.indexIn(htmlStyle, pos)) != -1) {
    list << rx.cap(1);
    pos += rx.matchedLength();
}

或者有没有其他不用正则表达式的方法？

Answer 1

HTML/XML 不是正规语法。你不能用正则表达式解析它。参见例如this question。解析 HTML 并不简单。

您可以使用 QTextDocument、QTextBlock、QTextCursor 等迭代富文本文档中的段落。所有 HTML 解析都为您处理.这正是 QTextEdit 支持的 HTML 的子集：它使用 QTextDocument 作为内部表示。您可以使用 QTextEdit::document() 直接从小部件中获取它。例如：

void iterate(QTextEdit * edit) {
   auto const & doc = *edit->document();
   for (auto block = doc.begin(); block != doc.end(); block.next()) {
      // do something with text block e.g. iterate its fragments
      for (auto fragment = block.begin(); fragment != block.end(); fragment++) {
         // do something with text fragment
      }
   }
}

与其手动错误地解析 HTML，不如探索 QTextDocument 的结构并根据需要使用它。

Answer 2

以下是纯粹的java方式，希望对您有所帮助：

int startIndex = htmlStyle.indexOf("<p>");
        int endIndex = htmlStyle.indexOf("</p>");
        while (startIndex >= 0) {
            endIndex = endIndex + 4;// to include </p> in the substring
            System.out.println(htmlStyle.substring(startIndex, endIndex));
            startIndex = htmlStyle.indexOf("<p>", startIndex + 1);
            endIndex = htmlStyle.indexOf("</p>", endIndex + 1);
        }

Answer 3

对于那些需要完整 Qt 解决方案的人，我根据@Aditya Poorna 的回答弄明白了。感谢您的提示！

代码如下：

int startIndex = htmlStyle.indexOf("<p");
int endIndex = htmlStyle.indexOf("</p>");

while (startIndex >= 0) {
    endIndex = endIndex + 4;
    QStringRef subString(&htmlStyle, startIndex, endIndex-startIndex);
    qDebug() << subString;
    startIndex = htmlStyle.indexOf("<p", startIndex + 1);
    endIndex = htmlStyle.indexOf("</p>", endIndex + 1);
}

"QStringRef subString" 从 "startIndex" 进入 "htmlStyle" 直到 "endIndex-startIndex"!

的长度

Qt Regexp 从 Html 字符串中提取 <p> 标签

Qt Regexp extract <p> tags from Html string

html

regex

qt

extract

qregexp