MITIE 模型
MITIE ner model
我一直在探索使用预训练的 MITIE 模型进行命名实体提取。无论如何我可以看看他们实际的 ner 模型而不是使用预训练模型?该模型是否开源?
Setting things up:
For starters, you can download the English Language Model which
contains Corpus of annotated text from a huge dump in a file called
total_word_feature_extractor.dat.
After that, download/clone the MITIE-Master Project from their
official Git.
If you are running Windows O.S then download CMake.
If you are running a x64 based Windows O.S, then install Visual Studio
2015 Community edition for the C++ compiler.
After downloading, the above, extract all of them into a folder.
从开始 > 所有应用程序 > Visual Studio 打开 VS 2015 开发人员命令提示符,然后导航到工具文件夹,您将在其中看到 5 个子文件夹。
下一步是构建 ner_conll、ner_stream、train_freebase_relation_detector 和 wordrep 包,方法是在 Visual Studio 开发人员命令提示符中使用以下 Cmake 命令。
像这样:
对于ner_conll:
cd "C:\Users\xyz\Documents\MITIE-master\tools\ner_conll"
i) mkdir build
ii) cd build
三)cmake -G "Visual Studio 14 2015 Win64" ..
iv) cmake --build . --config Release --target install
对于ner_stream:
cd "C:\Users\xyz\Documents\MITIE-master\tools\ner_stream"
i) mkdir build
ii) cd build
三)cmake -G "Visual Studio 14 2015 Win64" ..
iv) cmake --build . --config Release --target install
对于train_freebase_relation_detector:
cd "C:\Users\xyz\Documents\MITIE-master\tools\train_freebase_relation_detector"
i) mkdir build
ii) cd build
三)cmake -G "Visual Studio 14 2015 Win64" ..
iv) cmake --build . --config Release --target install
对于 wordrep:
cd "C:\Users\xyz\Documents\MITIE-master\tools\wordrep"
i) mkdir build
ii) cd build
三)cmake -G "Visual Studio 14 2015 Win64" ..
iv) cmake --build . --config Release --target install
构建它们后,您会收到 150-160 条警告,别担心。
现在,导航至 "C:\Users\xyz\Documents\MITIE-master\examples\cpp\train_ner"
使用 Visual Studio 手动注释文本的代码制作 JSON 文件 "data.json",如下所示:
{
"AnnotatedTextList": [
{
"text": "I want to travel from New Delhi to Bangalore tomorrow.",
"entities": [
{
"type": "FromCity",
"startPos": 5,
"length": 2
},
{
"type": "ToCity",
"startPos": 8,
"length": 1
},
{
"type": "TimeOfTravel",
"startPos": 9,
"length": 1
}
]
}
]
}
您可以添加更多的话语并对其进行注释,训练数据越多,预测的准确性就越好。
这个带注释的 JSON 也可以通过 jQuery 或 Angular 等前端工具创建。但为了简洁起见,我手工创建了它们。
现在,解析带注释的 JSON 文件并将其传递给 ner_training_instance 的 add_entity 方法。
但是 C++ 不支持反射反序列化 JSON,这就是为什么你可以使用这个库 Rapid JSON Parser。从他们的 Git 页面下载包并将其放在 "C:\Users\xyz\Documents\MITIE-master\mitielib\include\mitie"
.
下
现在我们必须自定义 train_ner_example.cpp 文件,以便解析我们注释的自定义实体 JSON 并将其传递给 MITIE 进行训练。
#include "mitie\rapidjson\document.h"
#include "mitie\ner_trainer.h"
#include <iostream>
#include <vector>
#include <list>
#include <tuple>
#include <string>
#include <map>
#include <sstream>
#include <fstream>
using namespace mitie;
using namespace dlib;
using namespace std;
using namespace rapidjson;
string ReadJSONFile(string FilePath)
{
ifstream file(FilePath);
string test;
cout << "path: " << FilePath;
try
{
std::stringstream buffer;
buffer << file.rdbuf();
test = buffer.str();
cout << test;
return test;
}
catch (exception &e)
{
throw std::exception(e.what());
}
}
//Helper function to tokenize a string based on multiple delimiters such as ,.;:- or whitspace
std::vector<string> SplitStringIntoMultipleParameters(string input, string delimiter)
{
std::stringstream stringStream(input);
std::string line;
std::vector<string> TokenizedStringVector;
while (std::getline(stringStream, line))
{
size_t prev = 0, pos;
while ((pos = line.find_first_of(delimiter, prev)) != string::npos)
{
if (pos > prev)
TokenizedStringVector.push_back(line.substr(prev, pos - prev));
prev = pos + 1;
}
if (prev < line.length())
TokenizedStringVector.push_back(line.substr(prev, string::npos));
}
return TokenizedStringVector;
}
//Parse the JSON and store into appropriate C++ containers to process it.
std::map<string, list<tuple<string, int, int>>> FindUtteranceTuple(string stringifiedJSONFromFile)
{
Document document;
cout << "stringifiedjson : " << stringifiedJSONFromFile;
document.Parse(stringifiedJSONFromFile.c_str());
const Value& a = document["AnnotatedTextList"];
assert(a.IsArray());
std::map<string, list<tuple<string, int, int>>> annotatedUtterancesMap;
for (int outerIndex = 0; outerIndex < a.Size(); outerIndex++)
{
assert(a[outerIndex].IsObject());
assert(a[outerIndex]["entities"].IsArray());
const Value &entitiesArray = a[outerIndex]["entities"];
list<tuple<string, int, int>> entitiesTuple;
for (int innerIndex = 0; innerIndex < entitiesArray.Size(); innerIndex++)
{
entitiesTuple.push_back(make_tuple(entitiesArray[innerIndex]["type"].GetString(), entitiesArray[innerIndex]["startPos"].GetInt(), entitiesArray[innerIndex]["length"].GetInt()));
}
annotatedUtterancesMap.insert(pair<string, list<tuple<string, int, int>>>(a[outerIndex]["text"].GetString(), entitiesTuple));
}
return annotatedUtterancesMap;
}
int main(int argc, char **argv)
{
try {
if (argc != 3)
{
cout << "You must give the path to the MITIE English total_word_feature_extractor.dat file." << endl;
cout << "So run this program with a command like: " << endl;
cout << "./train_ner_example ../../../MITIE-models/english/total_word_feature_extractor.dat" << endl;
return 1;
}
else
{
string filePath = argv[2];
string stringifiedJSONFromFile = ReadJSONFile(filePath);
map<string, list<tuple<string, int, int>>> annotatedUtterancesMap = FindUtteranceTuple(stringifiedJSONFromFile);
std::vector<string> tokenizedUtterances;
ner_trainer trainer(argv[1]);
for each (auto item in annotatedUtterancesMap)
{
tokenizedUtterances = SplitStringIntoMultipleParameters(item.first, " ");
mitie::ner_training_instance *currentInstance = new mitie::ner_training_instance(tokenizedUtterances);
for each (auto entity in item.second)
{
currentInstance -> add_entity(get<1>(entity), get<2>(entity), get<0>(entity).c_str());
}
// trainingInstancesList.push_back(currentInstance);
trainer.add(*currentInstance);
delete currentInstance;
}
trainer.set_num_threads(4);
named_entity_extractor ner = trainer.train();
serialize("new_ner_model.dat") << "mitie::named_entity_extractor" << ner;
const std::vector<std::string> tagstr = ner.get_tag_name_strings();
cout << "The tagger supports " << tagstr.size() << " tags:" << endl;
for (unsigned int i = 0; i < tagstr.size(); ++i)
cout << "\t" << tagstr[i] << endl;
return 0;
}
}
catch (exception &e)
{
cerr << "Failed because: " << e.what();
}
}
add_entity接受3个参数,可以是向量的标记化字符串,自定义实体类型名称,单词在句子中的起始索引和单词范围。
现在我们必须在开发人员命令提示符 Visual Studio 中使用以下命令构建 ner_train_example.cpp。
1) cd "C:\Users\xyz\Documents\MITIE-master\examples\cpp\train_ner"
2) mkdir build
3) cd build
4) cmake -G "Visual Studio 14 2015 Win64" ..
5) cmake --build . --config Release --target install
6) cd Release
7) train_ner_example "C:\Users\xyz\Documents\MITIE-master\MITIE-models\english\total_word_feature_extractor.dat" "C:\Users\xyz\Documents\MITIE-master\examples\cpp\train_ner\data.json"
成功执行上述操作后,我们将获得一个 new_ner_model.dat 文件,它是我们话语的序列化和训练版本。
现在,该 .dat 文件可以传递给 RASA 或单独使用。
将其传递给 RASA:
制作config.json文件如下:
{
"project": "demo",
"path": "C:\Users\xyz\Desktop\RASA\models",
"response_log": "C:\Users\xyz\Desktop\RASA\logs",
"pipeline": ["nlp_mitie", "tokenizer_mitie", "ner_mitie", "ner_synonyms", "intent_entity_featurizer_regex", "intent_classifier_mitie"],
"data": "C:\Users\xyz\Desktop\RASA\data\examples\rasa.json",
"mitie_file" : "C:\Users\xyz\Documents\MITIE-master\examples\cpp\train_ner\Release\new_ner_model.dat",
"fixed_model_name": "demo",
"cors_origins": ["*"],
"aws_endpoint_url": null,
"token": null,
"num_threads": 2,
"port": 5000
}
我一直在探索使用预训练的 MITIE 模型进行命名实体提取。无论如何我可以看看他们实际的 ner 模型而不是使用预训练模型?该模型是否开源?
Setting things up:
For starters, you can download the English Language Model which contains Corpus of annotated text from a huge dump in a file called total_word_feature_extractor.dat.
After that, download/clone the MITIE-Master Project from their official Git.
If you are running Windows O.S then download CMake.
If you are running a x64 based Windows O.S, then install Visual Studio 2015 Community edition for the C++ compiler.
After downloading, the above, extract all of them into a folder.
从开始 > 所有应用程序 > Visual Studio 打开 VS 2015 开发人员命令提示符,然后导航到工具文件夹,您将在其中看到 5 个子文件夹。
下一步是构建 ner_conll、ner_stream、train_freebase_relation_detector 和 wordrep 包,方法是在 Visual Studio 开发人员命令提示符中使用以下 Cmake 命令。
像这样:
对于ner_conll:
cd "C:\Users\xyz\Documents\MITIE-master\tools\ner_conll"
i) mkdir build
ii) cd build
三)cmake -G "Visual Studio 14 2015 Win64" ..
iv) cmake --build . --config Release --target install
对于ner_stream:
cd "C:\Users\xyz\Documents\MITIE-master\tools\ner_stream"
i) mkdir build
ii) cd build
三)cmake -G "Visual Studio 14 2015 Win64" ..
iv) cmake --build . --config Release --target install
对于train_freebase_relation_detector:
cd "C:\Users\xyz\Documents\MITIE-master\tools\train_freebase_relation_detector"
i) mkdir build
ii) cd build
三)cmake -G "Visual Studio 14 2015 Win64" ..
iv) cmake --build . --config Release --target install
对于 wordrep:
cd "C:\Users\xyz\Documents\MITIE-master\tools\wordrep"
i) mkdir build
ii) cd build
三)cmake -G "Visual Studio 14 2015 Win64" ..
iv) cmake --build . --config Release --target install
构建它们后,您会收到 150-160 条警告,别担心。
现在,导航至 "C:\Users\xyz\Documents\MITIE-master\examples\cpp\train_ner"
使用 Visual Studio 手动注释文本的代码制作 JSON 文件 "data.json",如下所示:
{
"AnnotatedTextList": [
{
"text": "I want to travel from New Delhi to Bangalore tomorrow.",
"entities": [
{
"type": "FromCity",
"startPos": 5,
"length": 2
},
{
"type": "ToCity",
"startPos": 8,
"length": 1
},
{
"type": "TimeOfTravel",
"startPos": 9,
"length": 1
}
]
}
]
}
您可以添加更多的话语并对其进行注释,训练数据越多,预测的准确性就越好。
这个带注释的 JSON 也可以通过 jQuery 或 Angular 等前端工具创建。但为了简洁起见,我手工创建了它们。
现在,解析带注释的 JSON 文件并将其传递给 ner_training_instance 的 add_entity 方法。
但是 C++ 不支持反射反序列化 JSON,这就是为什么你可以使用这个库 Rapid JSON Parser。从他们的 Git 页面下载包并将其放在 "C:\Users\xyz\Documents\MITIE-master\mitielib\include\mitie"
.
现在我们必须自定义 train_ner_example.cpp 文件,以便解析我们注释的自定义实体 JSON 并将其传递给 MITIE 进行训练。
#include "mitie\rapidjson\document.h"
#include "mitie\ner_trainer.h"
#include <iostream>
#include <vector>
#include <list>
#include <tuple>
#include <string>
#include <map>
#include <sstream>
#include <fstream>
using namespace mitie;
using namespace dlib;
using namespace std;
using namespace rapidjson;
string ReadJSONFile(string FilePath)
{
ifstream file(FilePath);
string test;
cout << "path: " << FilePath;
try
{
std::stringstream buffer;
buffer << file.rdbuf();
test = buffer.str();
cout << test;
return test;
}
catch (exception &e)
{
throw std::exception(e.what());
}
}
//Helper function to tokenize a string based on multiple delimiters such as ,.;:- or whitspace
std::vector<string> SplitStringIntoMultipleParameters(string input, string delimiter)
{
std::stringstream stringStream(input);
std::string line;
std::vector<string> TokenizedStringVector;
while (std::getline(stringStream, line))
{
size_t prev = 0, pos;
while ((pos = line.find_first_of(delimiter, prev)) != string::npos)
{
if (pos > prev)
TokenizedStringVector.push_back(line.substr(prev, pos - prev));
prev = pos + 1;
}
if (prev < line.length())
TokenizedStringVector.push_back(line.substr(prev, string::npos));
}
return TokenizedStringVector;
}
//Parse the JSON and store into appropriate C++ containers to process it.
std::map<string, list<tuple<string, int, int>>> FindUtteranceTuple(string stringifiedJSONFromFile)
{
Document document;
cout << "stringifiedjson : " << stringifiedJSONFromFile;
document.Parse(stringifiedJSONFromFile.c_str());
const Value& a = document["AnnotatedTextList"];
assert(a.IsArray());
std::map<string, list<tuple<string, int, int>>> annotatedUtterancesMap;
for (int outerIndex = 0; outerIndex < a.Size(); outerIndex++)
{
assert(a[outerIndex].IsObject());
assert(a[outerIndex]["entities"].IsArray());
const Value &entitiesArray = a[outerIndex]["entities"];
list<tuple<string, int, int>> entitiesTuple;
for (int innerIndex = 0; innerIndex < entitiesArray.Size(); innerIndex++)
{
entitiesTuple.push_back(make_tuple(entitiesArray[innerIndex]["type"].GetString(), entitiesArray[innerIndex]["startPos"].GetInt(), entitiesArray[innerIndex]["length"].GetInt()));
}
annotatedUtterancesMap.insert(pair<string, list<tuple<string, int, int>>>(a[outerIndex]["text"].GetString(), entitiesTuple));
}
return annotatedUtterancesMap;
}
int main(int argc, char **argv)
{
try {
if (argc != 3)
{
cout << "You must give the path to the MITIE English total_word_feature_extractor.dat file." << endl;
cout << "So run this program with a command like: " << endl;
cout << "./train_ner_example ../../../MITIE-models/english/total_word_feature_extractor.dat" << endl;
return 1;
}
else
{
string filePath = argv[2];
string stringifiedJSONFromFile = ReadJSONFile(filePath);
map<string, list<tuple<string, int, int>>> annotatedUtterancesMap = FindUtteranceTuple(stringifiedJSONFromFile);
std::vector<string> tokenizedUtterances;
ner_trainer trainer(argv[1]);
for each (auto item in annotatedUtterancesMap)
{
tokenizedUtterances = SplitStringIntoMultipleParameters(item.first, " ");
mitie::ner_training_instance *currentInstance = new mitie::ner_training_instance(tokenizedUtterances);
for each (auto entity in item.second)
{
currentInstance -> add_entity(get<1>(entity), get<2>(entity), get<0>(entity).c_str());
}
// trainingInstancesList.push_back(currentInstance);
trainer.add(*currentInstance);
delete currentInstance;
}
trainer.set_num_threads(4);
named_entity_extractor ner = trainer.train();
serialize("new_ner_model.dat") << "mitie::named_entity_extractor" << ner;
const std::vector<std::string> tagstr = ner.get_tag_name_strings();
cout << "The tagger supports " << tagstr.size() << " tags:" << endl;
for (unsigned int i = 0; i < tagstr.size(); ++i)
cout << "\t" << tagstr[i] << endl;
return 0;
}
}
catch (exception &e)
{
cerr << "Failed because: " << e.what();
}
}
add_entity接受3个参数,可以是向量的标记化字符串,自定义实体类型名称,单词在句子中的起始索引和单词范围。
现在我们必须在开发人员命令提示符 Visual Studio 中使用以下命令构建 ner_train_example.cpp。
1) cd "C:\Users\xyz\Documents\MITIE-master\examples\cpp\train_ner"
2) mkdir build
3) cd build
4) cmake -G "Visual Studio 14 2015 Win64" ..
5) cmake --build . --config Release --target install
6) cd Release
7) train_ner_example "C:\Users\xyz\Documents\MITIE-master\MITIE-models\english\total_word_feature_extractor.dat" "C:\Users\xyz\Documents\MITIE-master\examples\cpp\train_ner\data.json"
成功执行上述操作后,我们将获得一个 new_ner_model.dat 文件,它是我们话语的序列化和训练版本。
现在,该 .dat 文件可以传递给 RASA 或单独使用。
将其传递给 RASA:
制作config.json文件如下:
{
"project": "demo",
"path": "C:\Users\xyz\Desktop\RASA\models",
"response_log": "C:\Users\xyz\Desktop\RASA\logs",
"pipeline": ["nlp_mitie", "tokenizer_mitie", "ner_mitie", "ner_synonyms", "intent_entity_featurizer_regex", "intent_classifier_mitie"],
"data": "C:\Users\xyz\Desktop\RASA\data\examples\rasa.json",
"mitie_file" : "C:\Users\xyz\Documents\MITIE-master\examples\cpp\train_ner\Release\new_ner_model.dat",
"fixed_model_name": "demo",
"cors_origins": ["*"],
"aws_endpoint_url": null,
"token": null,
"num_threads": 2,
"port": 5000
}