“é”、“è”和所有带有重音符号的字母未在使用 PREG_REPLACE 的 PHP 函数中显示
"é" , "è" and all letters with accents NOT DISPLAYED in a PHP function using PREG_REPLACE
我正在研究搜索引擎。我在网上找到了一个写得很好的 php 函数,可以从文本中列出关键字。该功能在英语中完美运行。但是,当我尝试将其改编为法语时,我发现数组输出中未显示“é”、“è”、“à”字母和所有带重音符号的字母。
例如,如果文本包含:"Hello Héllo" =>=>
输出 = "你好你好"
我想问题出在以下代码行中:
$text = preg_replace('/[^a-zA-Z0-9 -.]/', '', $text); // only take alphanumerical characters, but keep the spaces and dashes too…
有什么想法吗?非常感谢来自法国!
完整代码如下:
function generateKeywordsFromText($text){
// List of words NOT to be included in keywords
$stopWords = array('à','à demi','à peine','à peu près','absolument','actuellement','ainsi');
$text = preg_replace('/\s\s+/i', '', $text); // replace multiple spaces etc. in the text
$text = trim($text); // trim any extra spaces at start or end of the text
$text = preg_replace('/[^a-zA-Z0-9 -.]/', '', $text); // only take alphanumerical characters, but keep the spaces and dashes too…
$text = strtolower($text); // Make the text lowercase so that output is in lowercase and whole operation is case in sensitive.
// Find all words
preg_match_all('/\b.*?\b/i', $text, $allTheWords);
$allTheWords = $allTheWords[0];
//Now loop through the whole list and remove smaller or empty words
foreach ( $allTheWords as $key=>$item )
{
if ( $item == '' || in_array(strtolower($item), $stopWords) || strlen($item) <= 3 ) {
unset($allTheWords[$key]);
}
}
// Create array that will later have its index as keyword and value as keyword count.
$wordCountArr = array();
// Now populate this array with keywrds and the occurance count
if ( is_array($allTheWords) ) {
foreach ( $allTheWords as $key => $val ) {
$val = strtolower($val);
if ( isset($wordCountArr[$val]) ) {
$wordCountArr[$val]++;
} else {
$wordCountArr[$val] = 1;
}
}
}
// Sort array by the number of repetitions
arsort($wordCountArr);
//Keep first 10 keywords, throw other keywords
$wordCountArr = array_slice($wordCountArr, 0, 50);
// Now generate comma separated list from the array
$words="";
foreach ($wordCountArr as $key=>$value)
$words .= " " . $key ;
// Trim list of comma separated keyword list and return the list
return trim($words," ");
}
echo $contentkeywords = generateKeywordsFromText("Hello, Héllo");
您需要修复所有三个 preg_replace
调用:
$text = preg_replace('/\s{2,}/ui', '', $text); // replace multiple spaces etc. in the text
$text = preg_replace('/[^\p{L}0-9 .-]+/u', '', $text); // only take alphanumerical characters, but keep the spaces and dashes too…
// Find all words
preg_match_all('/\w+/u', $text, $allTheWords);
详情
'/\s{2,}/ui'
- 这将匹配任意两个或更多 Unicode whitespace 字符
'/[^\p{L}0-9 .-]+/u'
- 匹配除任何 Unicode 字母 (\p{L}
)、任何 ASCII 数字 (0-9
) 或 space、点或连字符 (请注意 -
必须用在字符 class) 的末尾
'/\w+/u'
匹配所有 Unicode 单词,一个或多个 letter/digit/underscore 个字符的序列。
我正在研究搜索引擎。我在网上找到了一个写得很好的 php 函数,可以从文本中列出关键字。该功能在英语中完美运行。但是,当我尝试将其改编为法语时,我发现数组输出中未显示“é”、“è”、“à”字母和所有带重音符号的字母。
例如,如果文本包含:"Hello Héllo" =>=> 输出 = "你好你好"
我想问题出在以下代码行中:
$text = preg_replace('/[^a-zA-Z0-9 -.]/', '', $text); // only take alphanumerical characters, but keep the spaces and dashes too…
有什么想法吗?非常感谢来自法国!
完整代码如下:
function generateKeywordsFromText($text){
// List of words NOT to be included in keywords
$stopWords = array('à','à demi','à peine','à peu près','absolument','actuellement','ainsi');
$text = preg_replace('/\s\s+/i', '', $text); // replace multiple spaces etc. in the text
$text = trim($text); // trim any extra spaces at start or end of the text
$text = preg_replace('/[^a-zA-Z0-9 -.]/', '', $text); // only take alphanumerical characters, but keep the spaces and dashes too…
$text = strtolower($text); // Make the text lowercase so that output is in lowercase and whole operation is case in sensitive.
// Find all words
preg_match_all('/\b.*?\b/i', $text, $allTheWords);
$allTheWords = $allTheWords[0];
//Now loop through the whole list and remove smaller or empty words
foreach ( $allTheWords as $key=>$item )
{
if ( $item == '' || in_array(strtolower($item), $stopWords) || strlen($item) <= 3 ) {
unset($allTheWords[$key]);
}
}
// Create array that will later have its index as keyword and value as keyword count.
$wordCountArr = array();
// Now populate this array with keywrds and the occurance count
if ( is_array($allTheWords) ) {
foreach ( $allTheWords as $key => $val ) {
$val = strtolower($val);
if ( isset($wordCountArr[$val]) ) {
$wordCountArr[$val]++;
} else {
$wordCountArr[$val] = 1;
}
}
}
// Sort array by the number of repetitions
arsort($wordCountArr);
//Keep first 10 keywords, throw other keywords
$wordCountArr = array_slice($wordCountArr, 0, 50);
// Now generate comma separated list from the array
$words="";
foreach ($wordCountArr as $key=>$value)
$words .= " " . $key ;
// Trim list of comma separated keyword list and return the list
return trim($words," ");
}
echo $contentkeywords = generateKeywordsFromText("Hello, Héllo");
您需要修复所有三个 preg_replace
调用:
$text = preg_replace('/\s{2,}/ui', '', $text); // replace multiple spaces etc. in the text
$text = preg_replace('/[^\p{L}0-9 .-]+/u', '', $text); // only take alphanumerical characters, but keep the spaces and dashes too…
// Find all words
preg_match_all('/\w+/u', $text, $allTheWords);
详情
'/\s{2,}/ui'
- 这将匹配任意两个或更多 Unicode whitespace 字符'/[^\p{L}0-9 .-]+/u'
- 匹配除任何 Unicode 字母 (\p{L}
)、任何 ASCII 数字 (0-9
) 或 space、点或连字符 (请注意-
必须用在字符 class) 的末尾
'/\w+/u'
匹配所有 Unicode 单词,一个或多个 letter/digit/underscore 个字符的序列。