找到所有带有文本字符串的元素?
Find all elements with text string?
我正在尝试删除所有包含特定文本字符串的 html 元素(标签)。我有 2376 个 html 文档,所有文档都具有不同的文档类型标准。有些甚至没有文档类型(可能与这个问题无关)。
所以,我正在寻找一个显示 "How to cite this paper" 的文本字符串,我发现它包含在 <p>-tag
、<h4>-tag
或 [=17] 中=].
<p>-tag
经常是这样的,
<p style="text-align : center; color : Red; font-weight : bold;">How to cite this paper:</i></p>
<h4>-tag
经常是这样的,
<h4>How to cite this paper:</h4>Antunes, P., Costa, C.J. & Pino, J.A. (2006).
<legend>-tag
看起来像这样,
<legend style="color: white; background-color: maroon; font-size: medium; padding: .1ex .5ex; border-right: 1px solid navy; border-bottom: 1px solid navy; font-weight: bold;">How to cite this paper</legend>
手头的任务是找到这些标签并将它们从文件中删除,然后再次保存文件。我确实有更多标签要删除,但需要一些帮助来理解 HAP 和 XPath,以及如何根据标签的值或其他独特数据定位特定标签。
到目前为止,我已经用 C# 编写了这段代码,它是一个控制台应用程序。
这是我的主要内容(抱歉缩进错误),
//Variables
string Ext = "*.html";
string folder = @"D:\websites\dev.openjournal.tld\public\arkivet\";
IEnumerable<string> files = GetHTMLFiles(folder, Ext);
List<string> cite_files = new List<string>();
var doc = new HtmlDocument();
//Loop to match all html-elements to query
foreach (var file in files)
{
try
{
doc.Load(file);
cite_files.Add(doc.DocumentNode.SelectNodes("//h4[contains(., 'How to cite this paper')]").ToString());
cite_files.Add(doc.DocumentNode.SelectNodes("//p[contains(., 'How to cite this paper')]").ToString());
}
catch (Exception Ex)
{
Console.WriteLine(Ex.Message);
}
}
//Counts numbers of hits and prints data to user
int filecount = files.Count();
int citations = cite_files.Count();
Console.WriteLine("Number of files scanned: " + filecount);
Console.WriteLine("Number of citations: {0}", citations);
// Program end
Console.WriteLine("Press any key to close program....");
Console.ReadKey();
这是在目录中查找文件的私有方法,
//List all HTML-files recursively and return them to a list
public static IEnumerable<string> GetHTMLFiles(string directory, string Ext)
{
List<string> files = new List<string>();
try
{
files.AddRange(Directory.GetFiles(directory, Ext, SearchOption.AllDirectories));
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
}
return files;
}
唯一的东西似乎是 "How to cite this paper" 所以我试图找到包含这些确切单词的所有特定标签,然后将它们删除。我的记事本显示应该有 1094 个文件带有这个短语,所以我想把它们都弄下来。 :)
非常感谢任何帮助! :)
Html Agility Pack 支持LINQ selectors,这在这种情况下非常方便。根据您上面的示例给出一些 HTML:
var html =
@"<html><head></head><body>
<!-- selector match: delete these nodes -->
<p style='text-align: center; color: Red; font-weight: bold;'>How to cite this paper:</i></p>
<h4> How to cite this paper:</h4> Antunes, P., Costa, C.J. & amp; Pino, J.A. (2006).
<legend style='color: white; background-color: maroon; font-size: medium; padding: .1ex .5ex; border-right: 1px solid navy; border-bottom: 1px solid navy; font-weight: bold;'>How to cite this paper </legend>
<div><p><i><b>How to cite this paper (NESTED)</b></i></p></div>
<!-- no match: keep these nodes -->
<p>DO NOT DELETE - How to cite</p>
<h4>DO NOT DELETE - cite this paper:</h4>
<legend>DO NOT DELETE</legend>
</body></html>";
您可以创建应搜索的标签集合,select 匹配节点,然后像这样删除它们:
var tagsToDelete = new string[] { "p", "h4", "legend" };
var nodesToDelete = new List<HtmlNode>();
var document = new HtmlDocument();
document.LoadHtml(html);
foreach (var tag in tagsToDelete)
{
nodesToDelete.AddRange(
from searchText in document.DocumentNode.Descendants(tag)
where searchText.InnerText.Contains("How to cite this paper")
select searchText
);
}
foreach (var node in nodesToDelete) node.Remove();
document.Save(OUTPUT);
结果如下:
<html><head></head><body>
<!-- XPath match: delete these nodes -->
Antunes, P., Costa, C.J. & amp; Pino, J.A. (2006).
<div></div>
<!-- no match, keep these nodes -->
<p>DO NOT DELETE - How to cite</p>
<h4>DO NOT DELETE - cite this paper:</h4>
<legend>DO NOT DELETE</legend>
</body></html>
我正在尝试删除所有包含特定文本字符串的 html 元素(标签)。我有 2376 个 html 文档,所有文档都具有不同的文档类型标准。有些甚至没有文档类型(可能与这个问题无关)。
所以,我正在寻找一个显示 "How to cite this paper" 的文本字符串,我发现它包含在 <p>-tag
、<h4>-tag
或 [=17] 中=].
<p>-tag
经常是这样的,
<p style="text-align : center; color : Red; font-weight : bold;">How to cite this paper:</i></p>
<h4>-tag
经常是这样的,
<h4>How to cite this paper:</h4>Antunes, P., Costa, C.J. & Pino, J.A. (2006).
<legend>-tag
看起来像这样,
<legend style="color: white; background-color: maroon; font-size: medium; padding: .1ex .5ex; border-right: 1px solid navy; border-bottom: 1px solid navy; font-weight: bold;">How to cite this paper</legend>
手头的任务是找到这些标签并将它们从文件中删除,然后再次保存文件。我确实有更多标签要删除,但需要一些帮助来理解 HAP 和 XPath,以及如何根据标签的值或其他独特数据定位特定标签。
到目前为止,我已经用 C# 编写了这段代码,它是一个控制台应用程序。 这是我的主要内容(抱歉缩进错误),
//Variables
string Ext = "*.html";
string folder = @"D:\websites\dev.openjournal.tld\public\arkivet\";
IEnumerable<string> files = GetHTMLFiles(folder, Ext);
List<string> cite_files = new List<string>();
var doc = new HtmlDocument();
//Loop to match all html-elements to query
foreach (var file in files)
{
try
{
doc.Load(file);
cite_files.Add(doc.DocumentNode.SelectNodes("//h4[contains(., 'How to cite this paper')]").ToString());
cite_files.Add(doc.DocumentNode.SelectNodes("//p[contains(., 'How to cite this paper')]").ToString());
}
catch (Exception Ex)
{
Console.WriteLine(Ex.Message);
}
}
//Counts numbers of hits and prints data to user
int filecount = files.Count();
int citations = cite_files.Count();
Console.WriteLine("Number of files scanned: " + filecount);
Console.WriteLine("Number of citations: {0}", citations);
// Program end
Console.WriteLine("Press any key to close program....");
Console.ReadKey();
这是在目录中查找文件的私有方法,
//List all HTML-files recursively and return them to a list
public static IEnumerable<string> GetHTMLFiles(string directory, string Ext)
{
List<string> files = new List<string>();
try
{
files.AddRange(Directory.GetFiles(directory, Ext, SearchOption.AllDirectories));
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
}
return files;
}
唯一的东西似乎是 "How to cite this paper" 所以我试图找到包含这些确切单词的所有特定标签,然后将它们删除。我的记事本显示应该有 1094 个文件带有这个短语,所以我想把它们都弄下来。 :)
非常感谢任何帮助! :)
Html Agility Pack 支持LINQ selectors,这在这种情况下非常方便。根据您上面的示例给出一些 HTML:
var html =
@"<html><head></head><body>
<!-- selector match: delete these nodes -->
<p style='text-align: center; color: Red; font-weight: bold;'>How to cite this paper:</i></p>
<h4> How to cite this paper:</h4> Antunes, P., Costa, C.J. & amp; Pino, J.A. (2006).
<legend style='color: white; background-color: maroon; font-size: medium; padding: .1ex .5ex; border-right: 1px solid navy; border-bottom: 1px solid navy; font-weight: bold;'>How to cite this paper </legend>
<div><p><i><b>How to cite this paper (NESTED)</b></i></p></div>
<!-- no match: keep these nodes -->
<p>DO NOT DELETE - How to cite</p>
<h4>DO NOT DELETE - cite this paper:</h4>
<legend>DO NOT DELETE</legend>
</body></html>";
您可以创建应搜索的标签集合,select 匹配节点,然后像这样删除它们:
var tagsToDelete = new string[] { "p", "h4", "legend" };
var nodesToDelete = new List<HtmlNode>();
var document = new HtmlDocument();
document.LoadHtml(html);
foreach (var tag in tagsToDelete)
{
nodesToDelete.AddRange(
from searchText in document.DocumentNode.Descendants(tag)
where searchText.InnerText.Contains("How to cite this paper")
select searchText
);
}
foreach (var node in nodesToDelete) node.Remove();
document.Save(OUTPUT);
结果如下:
<html><head></head><body>
<!-- XPath match: delete these nodes -->
Antunes, P., Costa, C.J. & amp; Pino, J.A. (2006).
<div></div>
<!-- no match, keep these nodes -->
<p>DO NOT DELETE - How to cite</p>
<h4>DO NOT DELETE - cite this paper:</h4>
<legend>DO NOT DELETE</legend>
</body></html>