自定义信息提取 (NER) 的最佳方法

Best Approach for Custom Information Extraction (NER)

我正在尝试从文本块中提取位置 (NER/IE) 并尝试了很多解决方案,所有这些解决方案都太不准确了 spacy、Stanford 等

我的数据集上的所有数据实际上只有大约 80-90% 准确(spacy 大概是 70%),我遇到的另一个问题是对这些实体没有任何意义的概率,所以我不知道信心无法继续进行。

我尝试了一种超级天真的方法,将我的 blob 拆分为单个单词,然后将周围的上下文提取为特征,还使用了位置地名查找(30/40k 位置地名)作为特征。然后我只使用了一个分类器 (XGDBoost),一旦我在大约 3k 个手动标记的数据点上训练分类器(总共 100k 只有 3k 个位置),结果会好得多。 states/countries 的精度为 95%,城市的精度约为 85%。

这种方法显然很糟糕,但为什么它胜过我尝试过的所有方法?我认为 NER 的黑盒方法对我的数据问题不起作用,我尝试了 spacy 自定义训练,但它似乎真的不起作用。对实体没有信心也是一种杀手,因为他们给你的可能性几乎没有意义。

有没有什么办法可以更好地解决这个问题,从而进一步提高我的结果?像 2/3/4 克那样的浅层 nlp?我的方法的另一个问题是分类器的输出不是一些顺序实体,它实际上只是分类的单词 blob,需要以某种方式聚集回一个实体,即:-> San Francisco, CA 只是 'city', 'city'、'0'、'state' 没有将它们视为同一实体的概念

空间示例:

示例 blob:

About Us - Employment Opportunities Donate Donate Now The Power of Mushrooms Enhancing Response Where We Work Map Australia Africa Asia Pacific Our Work Agriculture Anti - Trafficking and Gender - based Violence Education Emergency Response Health and Nutrition Rural and Economic Development About Us Who We Are Annual Report Newsletters Employment Opportunities Video Library Contact Us Login My Profile Donate Join Our Email List Employment Opportunities Annual Report Newsletters Policies Video Library Contact Us Employment Opportunities Current Career Opportunity Internships Volunteer Who We Are Our History Employment Opportunities with World Hope International Working in Service to the Poor Are you a professional that wants a sense of satisfaction out of your job that goes beyond words of affirmation or a pat on the back ? You could be a part of a global community serving the poor in the name of Jesus Christ . You could use your talents and resources to make a significant difference to millions . Help World Hope International give a hand up rather than a hand out . Career opportunities . Internship opportunities . Volunteer Why We Work Here World Hope International envisions a world free of poverty . Where young girls aren ’ t sold into sexual slavery . Where every child has enough to eat . Where men and women can earn a fair and honest wage , and their children aren ’ t kept from an education . Where every community in Africa has clean water . As an employee of World Hope International , these are the people you will work for . Regardless of their religious beliefs , gender , race or ethnic background , you will help shine the light of hope into the darkness of poverty , injustice and oppression . Find out more by learning about the of World Hope International and reviewing a summary of our work in the most recent history annual report . Equal Opportunity Employer World Hope International is both an equal opportunity employer and a faith - based religious organization . We hire US employees without regard to race , color , ancestry , national origin , citizenship , age , sex , marital status , parental status , membership in any labor organization , political ideology or disability of an otherwise qualified individual . We hire national employees in our countries of operation pursuant to the law of the country where we hire the employees . The status of World Hope International as an equal opportunity employer does not prevent the organization from hiring US staff based on their religious beliefs so that all US staff share the same religious commitment . Pursuant to the United States Civil Rights Act of 1964 , Section 702 ( 42 U . S . C . 2000e 1 ( a ) ) , World Hope International has the right to , and does , hire only candidates whose beliefs align with the Apostle ’ s Creed . Apostle ’ s Creed : I believe in Jesus Christ , Gods only Son , our Lord , who was conceived by the Holy Spirit , born of the Virgin Mary , suffered under Pontius Pilate , was crucified , died , and was buried ; he descended to the dead . On the third day he rose again ; he ascended into heaven , he is seated at the right hand of the Father , and he will come again to judge the living and the dead . I believe in the Holy Spirit , the holy catholic church , the communion of saints , the forgiveness of sins , the resurrection of the body , and the life everlasting . AMEN . Christian Commitment All applicants will be screened for their Christian commitment . This process will include a discussion of : The applicant ’ s spiritual journey and relationship with Jesus Christ as indicated in their statement of faith The applicant ’ s understanding and acceptance of the Apostle ’ s Creed . Statement of Faith A statement of faith describes your faith and how you see it as relevant to your involvement with World Hope International . It must include , at a minimum , a description of your spiritual disciplines ( prayer , Bible study , etc . ) and your current fellowship or place of worship . Applicants can either incorporate their statement of faith into their cover letter content or submit it as a separate document . 519 Mt Petrie Road Mackenzie , Qld 4156 1 - 800 - 967 - 534 ( World Hope ) + 61 7 3624 9977 CHEQUE Donations World Hope International ATTN : Gift Processing 519 Mt Petrie Road Mackenzie , Qld 4156 Spread the Word Stay Informed Join Email List Focused on the Mission In fiscal year 2015 , 88 % of all expenditures went to program services . Find out more . Privacy Policy | Terms of Service World Hope Australia Overseas Aid Fund is registered with the ACNC and all donations over $ 2 are tax deductible . ABN : 64 983 196 241 © 2017 WORLD HOPE INTERNATIONAL . All rights reserved .'

结果:

('US', 'GPE')
('US', 'GPE')
('US', 'GPE')
('the', 'GPE')
('United', 'GPE')
('States', 'GPE')
('Jesus', 'GPE')
('Christ', 'GPE')
('Pontius', 'GPE')
('Pilate', 'GPE')
('Faith', 'GPE')
('A', 'GPE')

你能给出 spaCy 在你的数据上的示例输出吗?国家和城市通常表现良好。您使用的是 v2 模型还是 v1?

编辑:在您的文本中,上下文通常是无关紧要的,这就是将文本分割成单个词的好处。这比将所有数据都放在一个 "blob".

中更真实地表示数据

您可能应该尝试更好地分割数据(可能通过改进 html 提取)。您可能还应该使用基于规则的流程或其他模型以某种方式对文本进行 true-case。

您将通过训练自己的分类器获得最佳结果。您可以使用 spaCy 或自定义的东西来做到这一点——无论哪种方式,训练您自己的数据都比您使用哪种模型更重要。

如今,即使是最好的基于深度学习的 NER 系统也只能达到 92.0 的 F1。基于深度学习的系统 (CNN-BiLSTM-CRF) 应该优于 Stanford CoreNLP 的普通 CRF 序列标记器。最近在集成语言模型方面取得了更多进展。你可能想看看 AllenNLP。

但是如果你想要像99.0%这样的超高准确率,你暂时需要集成基于规则的方法。

我认为基于规则的处理可能会有所帮助。例如,您可以编写一个模式,说明 "city city O , state" 应该合并到一个实体中。此外,您可能需要考虑丢弃未出现在 location/places 词典中的实体。或者丢弃不在位置字典中但在另一种类型中的实体。但我很难相信许多未知的字符串序列是您关心提取的位置地名。我认为人名最有可能不在字典中。

如果你下载他们的软件,UIUC 的 NLP 工具中有一些字典。

当 运行 StanfordCoreNLP 时,使用 ner,regexner,entitymentions 注释器将允许将连续的 NE 标签自动分组到实体中。有关管道的完整信息:https://stanfordnlp.github.io/CoreNLP/cmdline.html

此外,请记住,这些系统的开箱即用版本通常是根据过去 15 年的新闻文章进行训练的。对更接近您的数据集的数据进行再培训至关重要。最终,您最好只编写一些基于字典的提取规则。

您可以查看 Stanford CoreNLP 的 TokensRegex 和 RegexNER 功能,了解如何将 Stanford CoreNLP 用于该目的。

令牌正则表达式:https://nlp.stanford.edu/software/tokensregex.html 正则表达式:https://nlp.stanford.edu/software/regexner.html

我们在设计 My Custom NER 模型时遇到了同样的问题。有很多可用的解决方案,但我建议您阅读本文以全面了解 NER 模型和方法及其局限性。

标题命名实体识别深度学习综述

URL: https://arxiv.org/pdf/1812.09449.pdf