根据数据库中的值从电子邮件中提取字段作为训练集

Question

我收到了 480 封电子邮件，每封电子邮件都包含以下一个或所有值：-

[人，学位，working/not工作，角色]

例如，其中一封电子邮件如下所示：-

    Hi Amy,

    I wanted to discuss about Bob. I can see that he has a degree in 
    Computer Science which he got three years ago but hes still unemployed. 
    I dont know whetehr he'll be fit for the role of junior programmer at 
    our institute.
    Will get back to you on this.

    Thanks

此电子邮件对应的数据库条目如下所示

Email_123 | Bob | Computer Science | Unemployed | Junior Programmer

现在即使数据没有被标记，但我们仍然有一些数据库来查找从每封电子邮件中提取到 4 个字段中的值。现在我的问题是，我如何使用这个包含 480 封电子邮件的语料库来使用 Machine Learning/NLP 学习和提取这 4 个字段。我是否需要手动标记所有这 480 封电子邮件，例如..

I wanted to discuss about <person>Bob</person>. I can see that he has a degree in 
    <degree>Computer Science</degree> which he got....

或者有更好的方法。类似这样的东西（MarI/O - 视频游戏的机器学习）https://www.youtube.com/watch?v=qv6UVOQ0F44&t=149s

Answer 1

假设每封电子邮件的每个字段只有一个值，并且该值始终从电子邮件中逐字复制，您可以使用类似 WikiReading.

的内容

问题是 WikiReading 是在 4.7 百万个示例上训练的，所以如果你只有 480 个，那远远不足以训练一个好的模型。

我的建议是预处理您的数据集以自动添加标签，就像您的示例中那样。像这样的东西，伪 python:

entity = "Junior Programmer"
entity_type = "role"
mail = "...[text of email]..."

ind = mail.index(entity)
tagged = "{front}<{tag}>{ent}</{tag}>{back}".format(
  front=mail[0:ind],
  back=mail[ind+len(entity):],
  tag=entity_type,
  ent=entity)

您需要针对案例问题、多个匹配项等进行调整。

有了标记数据，您可以使用像 CRF 这样的传统 NER 系统。 Here 在 Python 中使用 spaCy 的教程。

根据数据库中的值从电子邮件中提取字段作为训练集

Extracting fields from an emails based on values in a database as training set

nlp

machine-learning

named-entity-recognition