在 file_get_contents 的输出中去掉 '

Question

我正在为维基百科开发一个工具。我正在尝试使用 file_get_contents 检索页面 https://de.wikipedia.org/wiki/Spezial:Linkliste/Hans_Jansen_(Arabist)。然后我通过定位列表并将其分解为 \n.

来提取所有列表项

然后我想检索以列表项命名的文章文本。为此，我

 file_get_contents(https://de.wikipedia.org/w/index.php?action=raw&title=".urlencode($article));

一切顺利，直到名为 Ka'b ibn As'ad 的文章导致对

的检索

https://de.wikipedia.org/w/index.php?action=raw&title=Ka

当我将文章名称复制为纯文本时，一切正常：

 $article = "Ka'b ibn As'ad";
 $page = "https://".$server."/w/index.php?action=raw&title=".urlencode($article);

比较手动输入和从网站检索的 $article 的 urlencode 的输出显示了差异：

  manually; Ka%27b+ibn+As%27ad
  website:  Ka%26%23039%3Bb%20ibn%20As%26%23039%3Bad

比较 htmlspecialchars() 的输出更令人印象深刻：

  manually; Ka'b ibn As'ad
  website:  Ka&#039;b ibn As&#039;ad

如何去掉那些 ' 特殊字符？显然 htmlspecialchars_decode() 不起作用。

Answer 1

htmlspecialchars_decode() 只转换 html 个有名称的实体，不转换有数字的实体。您需要为此使用 html-entity-decode()！

Getting rid of &#039; in output of file_get_contents