获取最常用的带有特殊字符的单词
Get the most used words with special characters
我想从数组中获取最常用的单词。唯一的问题是瑞典语字符(Å、Ä 和 Ö)只会显示为 �.
$string = 'This is just a test post with the Swedish characters Å, Ä, and Ö. Also as lower cased characters: å, ä, and ö.';
echo '<pre>';
print_r(array_count_values(str_word_count($string, 1, 'àáãâçêéíîóõôúÀÁÃÂÇÊÉÍÎÓÕÔÚ')));
echo '</pre>';
该代码将输出以下内容:
Array
(
[This] => 1
[is] => 1
[just] => 1
[a] => 1
[test] => 1
[post] => 1
[with] => 1
[the] => 1
[Swedish] => 1
[characters] => 2
[�] => 1
[�] => 1
[and] => 2
[�] => 1
[Also] => 1
[as] => 1
[lower] => 1
[cased] => 1
[�] => 1
[�] => 1
[�] => 1
)
我怎样才能"see"瑞典语字符和其他特殊字符?
我设法通过将 ÅåÄäÖö
添加到 àáãâçêéíîóõôúÀÁÃÂÇÊÉÍÎÓÕÔÚ
来删除 � 符号。
所有这些都是运行假设您使用的是 UTF-8。
您可以使用 preg_split()
的天真方法在任何分隔符、标点符号或控制字符上拆分字符串。
preg_split
示例:
$split = preg_split('/[\pZ\pP\pC]/u', $string, -1, PREG_SPLIT_NO_EMPTY);
print_r(array_count_values($split));
输出:
Array
(
[This] => 1
[is] => 1
[just] => 1
[a] => 1
[test] => 1
[post] => 1
[with] => 1
[the] => 1
[Swedish] => 1
[characters] => 2
[Å] => 1
[Ä] => 1
[and] => 2
[Ö] => 1
[Also] => 1
[as] => 1
[lower] => 1
[cased] => 1
[å] => 1
[ä] => 1
[ö] => 1
)
这适用于您给定的字符串,但不一定以区域设置感知的方式拆分单词。例如 "isn't" 这样的缩写会被分解成 "isn" 和 "t"。
值得庆幸的是 Intl extension 在 PHP 7.
中添加了大量功能来处理类似的事情
计划是:
*Normalize the input with Normalizer::normalize()
以确保字素都以一致的方式编码。例如 ä
可能会以几种方式进行编码,并因此进行计数:
- U+00E4 'LATIN SMALL LETTER A WITH DIAERESIS' 或
- U+0061 'LATIN SMALL LETTER A' 接着是 U+0308 'COMBINING DIAERESIS'
得到一个IntlBreakIterator
that breaks on words in a locale-dependent way via IntlBreakIterator::createWordInstance()
。这理解了给定区域设置 "word" 的组成部分,包括处理像 "isn't".
这样的缩略语
获取其 IntlPartsIterator
via IntlBreakIterator::getPartsIterator()
以便于迭代文本片段。
跳过你不关心的事情
(*请注意,无论使用何种方法分解字符串,您都可能希望执行规范化 - 在上面的 preg_split
或不管你决定做什么。)
国际示例:
$string = Normalizer::normalize($string);
$iter = IntlBreakIterator::createWordInstance("sv_SE");
$iter->setText($string);
$words = $iter->getPartsIterator();
$split = [];
foreach ($words as $word) {
// skip text fragments consisting only of a space or punctuation character
if (IntlChar::isspace($word) || IntlChar::ispunct($word)) {
continue;
}
$split[] = $word;
}
print_r(array_count_values($split));
输出:
Array
(
[This] => 1
[is] => 1
[just] => 1
[a] => 1
[test] => 1
[post] => 1
[with] => 1
[the] => 1
[Swedish] => 1
[characters] => 2
[Å] => 1
[Ä] => 1
[and] => 2
[Ö] => 1
[Also] => 1
[as] => 1
[lower] => 1
[cased] => 1
[å] => 1
[ä] => 1
[ö] => 1
)
这比较冗长,但如果您希望 ICU(支持 Intl 扩展的库)在理解单词的组成部分时完成繁重的工作,那么这可能是值得的。
这是一个使用 Unicode 标点符号的正则表达式拆分 "words" 然后只是常规数组出现次数的解决方案。
array_count_values(preg_split('/[[:punct:]\s]+/u', $string, -1, PREG_SPLIT_NO_EMPTY));
生产:
Array
(
[This] => 1
[is] => 1
[just] => 1
[a] => 1
[test] => 1
[post] => 1
[with] => 1
[the] => 1
[Swedish] => 1
[characters] => 2
[Å] => 1
[Ä] => 1
[and] => 2
[Ö] => 1
[Also] => 1
[as] => 1
[lower] => 1
[cased] => 1
[å] => 1
[ä] => 1
[ö] => 1
)
这是在 unicode 控制台中测试的,如果您使用的是浏览器,则可能需要强加编码。在您的浏览器中创建一个 <meta>
标签或设置编码,或者发送 PHP headers.
我想从数组中获取最常用的单词。唯一的问题是瑞典语字符(Å、Ä 和 Ö)只会显示为 �.
$string = 'This is just a test post with the Swedish characters Å, Ä, and Ö. Also as lower cased characters: å, ä, and ö.';
echo '<pre>';
print_r(array_count_values(str_word_count($string, 1, 'àáãâçêéíîóõôúÀÁÃÂÇÊÉÍÎÓÕÔÚ')));
echo '</pre>';
该代码将输出以下内容:
Array
(
[This] => 1
[is] => 1
[just] => 1
[a] => 1
[test] => 1
[post] => 1
[with] => 1
[the] => 1
[Swedish] => 1
[characters] => 2
[�] => 1
[�] => 1
[and] => 2
[�] => 1
[Also] => 1
[as] => 1
[lower] => 1
[cased] => 1
[�] => 1
[�] => 1
[�] => 1
)
我怎样才能"see"瑞典语字符和其他特殊字符?
我设法通过将 ÅåÄäÖö
添加到 àáãâçêéíîóõôúÀÁÃÂÇÊÉÍÎÓÕÔÚ
来删除 � 符号。
所有这些都是运行假设您使用的是 UTF-8。
您可以使用 preg_split()
的天真方法在任何分隔符、标点符号或控制字符上拆分字符串。
preg_split
示例:
$split = preg_split('/[\pZ\pP\pC]/u', $string, -1, PREG_SPLIT_NO_EMPTY);
print_r(array_count_values($split));
输出:
Array
(
[This] => 1
[is] => 1
[just] => 1
[a] => 1
[test] => 1
[post] => 1
[with] => 1
[the] => 1
[Swedish] => 1
[characters] => 2
[Å] => 1
[Ä] => 1
[and] => 2
[Ö] => 1
[Also] => 1
[as] => 1
[lower] => 1
[cased] => 1
[å] => 1
[ä] => 1
[ö] => 1
)
这适用于您给定的字符串,但不一定以区域设置感知的方式拆分单词。例如 "isn't" 这样的缩写会被分解成 "isn" 和 "t"。
值得庆幸的是 Intl extension 在 PHP 7.
中添加了大量功能来处理类似的事情计划是:
*Normalize the input with
Normalizer::normalize()
以确保字素都以一致的方式编码。例如ä
可能会以几种方式进行编码,并因此进行计数:- U+00E4 'LATIN SMALL LETTER A WITH DIAERESIS' 或
- U+0061 'LATIN SMALL LETTER A' 接着是 U+0308 'COMBINING DIAERESIS'
得到一个
IntlBreakIterator
that breaks on words in a locale-dependent way viaIntlBreakIterator::createWordInstance()
。这理解了给定区域设置 "word" 的组成部分,包括处理像 "isn't". 这样的缩略语
获取其
IntlPartsIterator
viaIntlBreakIterator::getPartsIterator()
以便于迭代文本片段。- 跳过你不关心的事情
(*请注意,无论使用何种方法分解字符串,您都可能希望执行规范化 - 在上面的 preg_split
或不管你决定做什么。)
国际示例:
$string = Normalizer::normalize($string);
$iter = IntlBreakIterator::createWordInstance("sv_SE");
$iter->setText($string);
$words = $iter->getPartsIterator();
$split = [];
foreach ($words as $word) {
// skip text fragments consisting only of a space or punctuation character
if (IntlChar::isspace($word) || IntlChar::ispunct($word)) {
continue;
}
$split[] = $word;
}
print_r(array_count_values($split));
输出:
Array
(
[This] => 1
[is] => 1
[just] => 1
[a] => 1
[test] => 1
[post] => 1
[with] => 1
[the] => 1
[Swedish] => 1
[characters] => 2
[Å] => 1
[Ä] => 1
[and] => 2
[Ö] => 1
[Also] => 1
[as] => 1
[lower] => 1
[cased] => 1
[å] => 1
[ä] => 1
[ö] => 1
)
这比较冗长,但如果您希望 ICU(支持 Intl 扩展的库)在理解单词的组成部分时完成繁重的工作,那么这可能是值得的。
这是一个使用 Unicode 标点符号的正则表达式拆分 "words" 然后只是常规数组出现次数的解决方案。
array_count_values(preg_split('/[[:punct:]\s]+/u', $string, -1, PREG_SPLIT_NO_EMPTY));
生产:
Array
(
[This] => 1
[is] => 1
[just] => 1
[a] => 1
[test] => 1
[post] => 1
[with] => 1
[the] => 1
[Swedish] => 1
[characters] => 2
[Å] => 1
[Ä] => 1
[and] => 2
[Ö] => 1
[Also] => 1
[as] => 1
[lower] => 1
[cased] => 1
[å] => 1
[ä] => 1
[ö] => 1
)
这是在 unicode 控制台中测试的,如果您使用的是浏览器,则可能需要强加编码。在您的浏览器中创建一个 <meta>
标签或设置编码,或者发送 PHP headers.