电子邮件的 DomCrawler filterXpath

Question

在我的项目中，我尝试使用 filterXPath 来发送电子邮件。所以我通过 IMAP 收到一封电子邮件，并将邮件正文放入我的 DomCrawler.

$crawler = new Crawler();
$crawler->addHtmlContent($mail->textHtml); //mail html content utf8

现在谈谈我的问题。我只想要邮件正文的纯文本，但仍然保留所有新行空格等 - 与邮件看起来完全相同，只是没有 html 的纯文本（仍然有 \n\r 等）。

出于这个原因，我尝试使用 $crawler->filterXPath('//body/descendant-or-self::*/text()') 来获取邮件中的每个文本节点。

但是我的测试邮件包含 html 比如：

<p>&#13;
    <u>
        <span>
            <a href="mailto:mail@example.com">
                <span style="color:#0563C1">mail@example.com</span>
            </a>
        </span>
    </u>
    <span>&#13;</span>
    <span>·</span>
    <span>
        <b>
            <a href="http://www.example.com">
                <span style="color:#0563C1">www.example.com</span>
            </a>
        </b>
    <p/>
    </span>
</p>&#13;

在我的邮件中，这看起来像 mail@example.com · www.example.com（在一行中）。

我的 filterXPath 我得到了多个节点，结果如下（多行）：

mail@example.com
· wwww.example.com

我知道  可能是问题所在，它是 \r，但由于我无法更改邮件中的 html，因此我需要其他解决方案- 如前所述，在邮件中只有一行。

请记住，我的解决方案必须适用于每封邮件 - 我不知道邮件 html 的样子 - 它每次都会改变。所以我需要一个通用的解决方案。

我已经尝试过使用 strip_tags - 这根本不会改变结果。

我目前的做法：

$crawler = new Crawler();
$crawler->addHtmlContent($mail->textHtml);

$text = "";
foreach ($crawler->filterXPath('//body/descendant-or-self::*/text()') as $element) {
    $part = trim($element->textContent);
    if($part) {
        $text .= "|".$part."|\n"; //to see whitespaces etc
    }
}
echo $text;

//OUTPUT
|mail@example.com|
|·|
| |
|www.example.com|
| |

Answer 1

请注意，您正在处理两种不同的方式来处理仅包含空白的文本节点：HTML 对于是否呈现这些节点有其自己的规则（区别主要在于块元素和内联元素之间，还包括规范化）和 XPATH 在解析器（或 DOM API）提供的文档树上工作，它有自己的关于保留或不保留那些仅空白文本节点的配置。考虑到这一点，一种解决方案可能是使用 string() 函数来获取包含电子邮件的元素的字符串值：

对于此输入：

<root>
<p>&#13;
    <u>
        <span>
            <a href="mailto:mail@example.com">
                <span style="color:#0563C1">mail@example.com</span>
            </a>
        </span>
    </u>
    <span>&#13;</span>
    <span>·</span>
    <span>
        <b>
            <a href="http://www.example.com">
                <span style="color:#0563C1">www.example.com</span>
            </a>
        </b>
    <p/>
    </span>
</p>&#13;
</root>

这个 XPath 表达式：

string(/root)

输出：





                mail@example.com




    ·



                www.example.com

入住here

Answer 2

我相信这样的事情应该有效：

$xpath = new DOMXpath($crawler);
$result = $xpath->query('(//span[not(descendant::*)])');

$text = "";
foreach ($result as $element) {
    $part = trim($element->textContent);
    if($part) {
        $text .= "|".$part."|"; //to see whitespaces etc
    }
}
echo $text;

输出：

|mail@example.com||Â·||www.example.com|

电子邮件的 DomCrawler filterXpath

DomCrawler filterXpath for emails

xpath

filter

symfony

domcrawler