如何使用来自 Google Natural Language API 的结果生成包含带有 PHP 的突出显示实体的原始文本的副本
How to generate a copy of the original text including the highlighted entities with PHP using results from Google Natural Language API
我正在使用 Google 自然语言 API 和 PHP 客户端库处理一些文本,我想生成复制格式的原始文本副本您可以在 Google Natural Language try out page 的屏幕截图中看到,其中突出显示了实体并具有将它们与其父术语相关联的索引:
从结果中,我得到了实体名称、提及项和 beginOffset。例如:
array (
'name' => 'Google Cloud Natural Language API',
'type' => 'OTHER',
'metadata' =>
array (
'mid' => '/g/11bc5pm43l',
'wikipedia_url' => 'https://pl.wikipedia.org/wiki/NAPI_(API)',
),
'salience' => 0.045935749999999997417177155512035824358463287353515625,
'mentions' =>
array (
0 =>
array (
'text' =>
array (
'content' => 'Google Cloud Natural Language API',
'beginOffset' => 90,
),
'type' => 'PROPER',
'sentiment' =>
array (
'magnitude' => 0.90000000000000002220446049250313080847263336181640625,
'score' => 0.90000000000000002220446049250313080847263336181640625,
),
), //and so on
我正在从结果中提取主要变量来修改原文以包含结果中的相关信息:
array (
0 => 'Google Cloud Natural Language API', //The parent term
1 => 'OTHER',
2 => 0.045935749999999997417177155512035824358463287353515625,
3 => 1.600000000000000088817841970012523233890533447265625,
4 => 0,
5 => 16, //This term has 16 associated mentions
6 =>
array ( //Array containing all of the associated mentions
0 => 'Google Cloud Natural Language API',
1 => 'Natural Language API',
2 => 'Natural Language API',
3 => 'Natural Language API',
4 => 'Natural Language API',
5 => 'REST API',
6 => 'Natural Language API',
7 => 'Natural Language API',
8 => 'Natural Language API',
9 => 'Natural Language API',
10 => 'Natural Language API',
11 => 'Natural Language API',
12 => 'Natural Language API',
13 => 'Natural Language API',
14 => 'Natural Language API',
15 => 'Natural Language API',
),
7 =>
array ( //Array containing the beginOffset of each associated mention
0 => 90,
1 => 196,
2 => 321,
3 => 463,
4 => 2421,
5 => 2447,
6 => 2946,
7 => 6167,
8 => 6414,
9 => 8958,
10 => 12039,
11 => 12168,
12 => 12256,
13 => 13179,
14 => 13294,
15 => 13802,
),
),
到目前为止,我已经试过了:
<?php
# Open file
$myfile = fopen("sampleText.txt", "r") or die("Unable to open file!");
$data = fread($myfile, filesize("sampleText.txt"));
fclose($myfile);
echo 'Original Text: <br>';
echo $data;
echo '<br>';
echo '<br>';
# Entities occurrence List
$entitiesList = array (
0 => 'Google Cloud Natural Language API',
1 => 'Natural Language API',
2 => 'Natural Language API',
3 => 'Natural Language API',
4 => 'Natural Language API',
5 => 'REST API',
6 => 'Natural Language API',
7 => 'Natural Language API',
8 => 'Natural Language API',
9 => 'Natural Language API',
10 => 'Natural Language API',
11 => 'Natural Language API',
12 => 'Natural Language API',
13 => 'Natural Language API',
14 => 'Natural Language API',
15 => 'Natural Language API',
);
# Samples of ofsetts
$ofsettList = array (
0 => 90,
1 => 196,
2 => 321,
3 => 463,
4 => 2421,
5 => 2447,
6 => 2946,
7 => 6167,
8 => 6414,
9 => 8958,
10 => 12039,
11 => 12168,
12 => 12256,
13 => 13179,
14 => 13294,
15 => 13802,
);
# Size of ofsetts List
$ofsettListLenght = sizeof($ofsettList);
# Index of the entity in the returned results
$index = 1;
# Temporal values array with new formatted string
$tempAmendedEntity = [];
for($i = 0; $i < $ofsettListLenght; $i++) {
$tempAmendedEntity[] = '('. $entitiesList[$i] . ')' . $index;
}
echo 'List of new amended Strings';
echo '<pre>', var_export($tempAmendedEntity, true), '</pre>', "\n";
echo '<br>';
echo '<br>';
// Method 1
for($i = 0; $i < $ofsettListLenght; $i++) {
$temp1 = str_replace(substr($data, $ofsettList[$i], strlen($entitiesList[$i])), $tempAmendedEntity[$i] , $data);
}
echo 'Text after method 1: <br>';
echo '<pre>', var_export($temp1, true), '</pre>', "\n";
echo '<br>';
echo '<br>';
// Method 2
$keyPairArray = [];
for($i = 0; $i < $ofsettListLenght; $i++) {
$keyPairArray[$entitiesList[$i]] = $tempAmendedEntity[$i];
}
echo 'List of key => value strings';
echo '<pre>', var_export($keyPairArray, true), '</pre>', "\n";
echo '<br>';
echo '<br>';
$temp2 = strtr($data, $keyPairArray);
echo 'Text after method 2: <br>';
echo '<pre>', var_export($temp2, true), '</pre>', "\n";
使用方法 1:并非所有新字符串值都被替换,例如 (Google Cloud Natural Language API)1 和 (REST API)1 不存在于结果。
使用方法 2:Replaces 所有与原始字符串匹配的新字符串,但包括与父术语无关的匹配字符串的出现。
如果能够仅替换以特定 'beginOffset' 开头的字符串作为新修改的字符串,那就太好了。
我用于测试的文本可以从这里下载:sampleText.txt
我发现这对我正在处理的项目很有用,因为当您从 Google Natural Language 收到答案时,JSON 响应包含实体的所有提及和原始文本,但很难想象具体提到的文本在文本中的哪个位置。所以我得到了所有返回的提及项并在一个键=>值对数组中排序,然后使用这两个函数重建原始文本,包括每个单词的偏移量作为唯一标识符,以便轻松找到该单词在文本中的位置。
$originalString 是key => value 实体数组中的实体值,$originalOfsett 是键。然后我形成了一个新的字符串,包括一个 HTML span 标签用于样式化:
function stringReplacementInfo($originalString, $originalOfsett) {
$modifiedString = " <span id=\"entity\">(" . $originalString . ")</span><span id=\"index\">" . $originalOfsett . "</span> ";
replaceString($originalString, $originalOfsett, $modifiedString);
}
然后,我将原文中修改后的字符串一一替换:
function replaceString($originalString, $originalOfsett, $modifiedString) {
global $data;
$originalStringLenght = mb_strlen($originalString, '8bit');
$preChunk = '';
$postChunk = '';
$chunk = '';
$preChunk = substr($data, 0, $originalOfsett);
$chunk = substr($data, $originalOfsett, $originalStringLenght);
$postChunk = substr($data, $originalOfsett + $originalStringLenght);
$data = $preChunk . $modifiedString . $postChunk;
}
然后我使用了一些 CSS 使其看起来更像自然语言 API 网络测试版本的结果:
#index{
font-size: 0.7em;
color: #3d1766;
}
通过偏移值更容易找到实体的每个返回提及项,最终结果如下所示:
希望这对其他人有帮助。
我正在使用 Google 自然语言 API 和 PHP 客户端库处理一些文本,我想生成复制格式的原始文本副本您可以在 Google Natural Language try out page 的屏幕截图中看到,其中突出显示了实体并具有将它们与其父术语相关联的索引:
从结果中,我得到了实体名称、提及项和 beginOffset。例如:
array (
'name' => 'Google Cloud Natural Language API',
'type' => 'OTHER',
'metadata' =>
array (
'mid' => '/g/11bc5pm43l',
'wikipedia_url' => 'https://pl.wikipedia.org/wiki/NAPI_(API)',
),
'salience' => 0.045935749999999997417177155512035824358463287353515625,
'mentions' =>
array (
0 =>
array (
'text' =>
array (
'content' => 'Google Cloud Natural Language API',
'beginOffset' => 90,
),
'type' => 'PROPER',
'sentiment' =>
array (
'magnitude' => 0.90000000000000002220446049250313080847263336181640625,
'score' => 0.90000000000000002220446049250313080847263336181640625,
),
), //and so on
我正在从结果中提取主要变量来修改原文以包含结果中的相关信息:
array (
0 => 'Google Cloud Natural Language API', //The parent term
1 => 'OTHER',
2 => 0.045935749999999997417177155512035824358463287353515625,
3 => 1.600000000000000088817841970012523233890533447265625,
4 => 0,
5 => 16, //This term has 16 associated mentions
6 =>
array ( //Array containing all of the associated mentions
0 => 'Google Cloud Natural Language API',
1 => 'Natural Language API',
2 => 'Natural Language API',
3 => 'Natural Language API',
4 => 'Natural Language API',
5 => 'REST API',
6 => 'Natural Language API',
7 => 'Natural Language API',
8 => 'Natural Language API',
9 => 'Natural Language API',
10 => 'Natural Language API',
11 => 'Natural Language API',
12 => 'Natural Language API',
13 => 'Natural Language API',
14 => 'Natural Language API',
15 => 'Natural Language API',
),
7 =>
array ( //Array containing the beginOffset of each associated mention
0 => 90,
1 => 196,
2 => 321,
3 => 463,
4 => 2421,
5 => 2447,
6 => 2946,
7 => 6167,
8 => 6414,
9 => 8958,
10 => 12039,
11 => 12168,
12 => 12256,
13 => 13179,
14 => 13294,
15 => 13802,
),
),
到目前为止,我已经试过了:
<?php
# Open file
$myfile = fopen("sampleText.txt", "r") or die("Unable to open file!");
$data = fread($myfile, filesize("sampleText.txt"));
fclose($myfile);
echo 'Original Text: <br>';
echo $data;
echo '<br>';
echo '<br>';
# Entities occurrence List
$entitiesList = array (
0 => 'Google Cloud Natural Language API',
1 => 'Natural Language API',
2 => 'Natural Language API',
3 => 'Natural Language API',
4 => 'Natural Language API',
5 => 'REST API',
6 => 'Natural Language API',
7 => 'Natural Language API',
8 => 'Natural Language API',
9 => 'Natural Language API',
10 => 'Natural Language API',
11 => 'Natural Language API',
12 => 'Natural Language API',
13 => 'Natural Language API',
14 => 'Natural Language API',
15 => 'Natural Language API',
);
# Samples of ofsetts
$ofsettList = array (
0 => 90,
1 => 196,
2 => 321,
3 => 463,
4 => 2421,
5 => 2447,
6 => 2946,
7 => 6167,
8 => 6414,
9 => 8958,
10 => 12039,
11 => 12168,
12 => 12256,
13 => 13179,
14 => 13294,
15 => 13802,
);
# Size of ofsetts List
$ofsettListLenght = sizeof($ofsettList);
# Index of the entity in the returned results
$index = 1;
# Temporal values array with new formatted string
$tempAmendedEntity = [];
for($i = 0; $i < $ofsettListLenght; $i++) {
$tempAmendedEntity[] = '('. $entitiesList[$i] . ')' . $index;
}
echo 'List of new amended Strings';
echo '<pre>', var_export($tempAmendedEntity, true), '</pre>', "\n";
echo '<br>';
echo '<br>';
// Method 1
for($i = 0; $i < $ofsettListLenght; $i++) {
$temp1 = str_replace(substr($data, $ofsettList[$i], strlen($entitiesList[$i])), $tempAmendedEntity[$i] , $data);
}
echo 'Text after method 1: <br>';
echo '<pre>', var_export($temp1, true), '</pre>', "\n";
echo '<br>';
echo '<br>';
// Method 2
$keyPairArray = [];
for($i = 0; $i < $ofsettListLenght; $i++) {
$keyPairArray[$entitiesList[$i]] = $tempAmendedEntity[$i];
}
echo 'List of key => value strings';
echo '<pre>', var_export($keyPairArray, true), '</pre>', "\n";
echo '<br>';
echo '<br>';
$temp2 = strtr($data, $keyPairArray);
echo 'Text after method 2: <br>';
echo '<pre>', var_export($temp2, true), '</pre>', "\n";
使用方法 1:并非所有新字符串值都被替换,例如 (Google Cloud Natural Language API)1 和 (REST API)1 不存在于结果。
使用方法 2:Replaces 所有与原始字符串匹配的新字符串,但包括与父术语无关的匹配字符串的出现。
如果能够仅替换以特定 'beginOffset' 开头的字符串作为新修改的字符串,那就太好了。
我用于测试的文本可以从这里下载:sampleText.txt
我发现这对我正在处理的项目很有用,因为当您从 Google Natural Language 收到答案时,JSON 响应包含实体的所有提及和原始文本,但很难想象具体提到的文本在文本中的哪个位置。所以我得到了所有返回的提及项并在一个键=>值对数组中排序,然后使用这两个函数重建原始文本,包括每个单词的偏移量作为唯一标识符,以便轻松找到该单词在文本中的位置。
$originalString 是key => value 实体数组中的实体值,$originalOfsett 是键。然后我形成了一个新的字符串,包括一个 HTML span 标签用于样式化:
function stringReplacementInfo($originalString, $originalOfsett) {
$modifiedString = " <span id=\"entity\">(" . $originalString . ")</span><span id=\"index\">" . $originalOfsett . "</span> ";
replaceString($originalString, $originalOfsett, $modifiedString);
}
然后,我将原文中修改后的字符串一一替换:
function replaceString($originalString, $originalOfsett, $modifiedString) {
global $data;
$originalStringLenght = mb_strlen($originalString, '8bit');
$preChunk = '';
$postChunk = '';
$chunk = '';
$preChunk = substr($data, 0, $originalOfsett);
$chunk = substr($data, $originalOfsett, $originalStringLenght);
$postChunk = substr($data, $originalOfsett + $originalStringLenght);
$data = $preChunk . $modifiedString . $postChunk;
}
然后我使用了一些 CSS 使其看起来更像自然语言 API 网络测试版本的结果:
#index{
font-size: 0.7em;
color: #3d1766;
}
通过偏移值更容易找到实体的每个返回提及项,最终结果如下所示:
希望这对其他人有帮助。