HtmlAgilityPack 的字数
Number of Words by HtmlAgilityPack
我需要获取网页上的总字数。这个方法returns的字数是336。但是我从wordcounter.net手动查的时候是1192字左右。我怎样才能只得到文章的字数?
int kelimeSayisi()
{
Uri url = new Uri("https://www.fitekran.com/hamilelik-ve-spor-hamileyken-hangi-spor-nasil-yapilir/");
WebClient client = new WebClient();
client.Encoding = System.Text.Encoding.UTF8;
string html = client.DownloadString(url);
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var kelime = doc.DocumentNode.SelectNodes("//text()").Count;
return kelime;
}
正如 HereticMonkey 在评论中提到的,您只检索文本节点的总数,因此您需要计算 InnerText
中的单词数。还有一些您最有可能想做的其他事情:
- 只看页面正文
- 排除脚本节点,这样你就不会 return JavaScript
我已经编写了您的代码的修改版本,它通过拆分 space 字符并仅将以字母开头的字符串视为单词来计算单词数:
int kelimeSayisi()
{
Uri url = new Uri("https://www.fitekran.com/hamilelik-ve-spor-hamileyken-hangi-spor-nasil-yapilir/");
WebClient client = new WebClient();
client.Encoding = System.Text.Encoding.UTF8;
string html = client.DownloadString(url);
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
char[] delimiter = new char[] {' '};
int kelime = 0;
foreach (string text in doc.DocumentNode
.SelectNodes("//body//text()[not(parent::script)]")
.Select(node => node.InnerText))
{
var words = text.Split(delimiter, StringSplitOptions.RemoveEmptyEntries)
.Where(s => Char.IsLetter(s[0]));
int wordCount = words.Count();
if (wordCount > 0)
{
Console.WriteLine(String.Join(" ", words));
kelime += wordCount;
}
}
return kelime;
}
return 的总字数为 1487,并且还将所有被视为单词的内容写入控制台,以便您查看包含的内容。 wordcounter.net 可能排除了页眉和页脚等一些内容。
我需要获取网页上的总字数。这个方法returns的字数是336。但是我从wordcounter.net手动查的时候是1192字左右。我怎样才能只得到文章的字数?
int kelimeSayisi()
{
Uri url = new Uri("https://www.fitekran.com/hamilelik-ve-spor-hamileyken-hangi-spor-nasil-yapilir/");
WebClient client = new WebClient();
client.Encoding = System.Text.Encoding.UTF8;
string html = client.DownloadString(url);
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var kelime = doc.DocumentNode.SelectNodes("//text()").Count;
return kelime;
}
正如 HereticMonkey 在评论中提到的,您只检索文本节点的总数,因此您需要计算 InnerText
中的单词数。还有一些您最有可能想做的其他事情:
- 只看页面正文
- 排除脚本节点,这样你就不会 return JavaScript
我已经编写了您的代码的修改版本,它通过拆分 space 字符并仅将以字母开头的字符串视为单词来计算单词数:
int kelimeSayisi()
{
Uri url = new Uri("https://www.fitekran.com/hamilelik-ve-spor-hamileyken-hangi-spor-nasil-yapilir/");
WebClient client = new WebClient();
client.Encoding = System.Text.Encoding.UTF8;
string html = client.DownloadString(url);
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
char[] delimiter = new char[] {' '};
int kelime = 0;
foreach (string text in doc.DocumentNode
.SelectNodes("//body//text()[not(parent::script)]")
.Select(node => node.InnerText))
{
var words = text.Split(delimiter, StringSplitOptions.RemoveEmptyEntries)
.Where(s => Char.IsLetter(s[0]));
int wordCount = words.Count();
if (wordCount > 0)
{
Console.WriteLine(String.Join(" ", words));
kelime += wordCount;
}
}
return kelime;
}
return 的总字数为 1487,并且还将所有被视为单词的内容写入控制台,以便您查看包含的内容。 wordcounter.net 可能排除了页眉和页脚等一些内容。