正则表达式和基于规则的匹配器以提取合法引文标题和卷

Regular Expression and Rule Based Matcher to extract legal citations title and volume

我正在尝试从不一致的法律文件中提取案例标题、卷和页数。我正在使用两种算法,正则表达式和基于 spaCy 规则的匹配实体和 POS 标签(还在学习这个......)。我使用正则表达式获得了超过一半的引用(感谢下面的答案代码),但使用 spaCy 的引用为零。我的密码是

import re
import en_core_web_sm
nlp = en_core_web_sm.load()

nlp = spacy.load('en_core_web_sm')

from spacy.matcher import Matcher
m_tool = Matcher(nlp.vocab)

doc = open(file='text1.txt', mode='r', encoding='utf-8').read()
#print(text)

doc = nlp(doc)
#print([(ent.text, ent.label_) for ent in doc.ents])


p1 = [{'IS_TITLE': 'NN'}, {'LOWER': 'v'}, {'IS_PUNCT': True}, {'IS_TITLE': 'NN'}]
p2 = [{'IS_TITLE': 'NN'}, {'IS_TITLE': 'NN'}, {'LOWER': 'v'}, {'IS_PUNCT': True}, {'IS_TITLE': 'NN'}]
p3 = [{'IS_TITLE': 'NN'}, {'LOWER': 'v'}, {'IS_PUNCT': True}, {'IS_TITLE': 'NN'}, {'IS_TITLE': 'NN'},]
p4 = [{'IS_TITLE': 'NN'}, {'IS_TITLE': 'NN'}, {'LOWER': 'v'}, {'IS_PUNCT': True}, {'IS_TITLE': 'NN'}, {'IS_TITLE': 'NN'}]
p5 = [{'IS_TITLE': 'NN'}, {'IS_TITLE': 'NN'}, {'IS_TITLE': 'NN'}, {'LOWER': 'v'}, {'IS_PUNCT': True}, {'IS_TITLE': 'NN'}, {'IS_TITLE': 'NN'}, {'IS_TITLE': 'NN'}]
p6 = [{'IS_TITLE': 'NN'}, {'IS_TITLE': 'NN'}, {'LOWER': 'v'}, {'IS_PUNCT': True}, {'IS_TITLE': 'NN'}, {'IS_TITLE': 'NN'}, {'IS_TITLE': 'NN'}]
p7 = [{'IS_TITLE': 'NN'}, {'LOWER': 'v'}, {'IS_PUNCT': True}, {'IS_TITLE': 'NN'}, {'IS_TITLE': 'NN'}, {'IS_TITLE': 'NN'}]
p8 = [{'IS_TITLE': 'NN'}, {'IS_TITLE': 'NN'}, {'IS_TITLE': 'NN'}, {'LOWER': 'v'}, {'IS_PUNCT': True}, {'IS_TITLE': 'NN'}]
p9 = [{'IS_TITLE': 'NN'}, {'IS_TITLE': 'NN'}, {'IS_TITLE': 'NN'}, {'LOWER': 'v'}, {'IS_PUNCT': True}, {'IS_TITLE': 'NN'}, {'IS_TITLE': 'NN'}]
p10 = [{'label': 'PERSON'}]
P11 = [{'label': 'ORG'}, {'label': 'PERSON'}]
p12 = [{'label': 'PERSON'}, {'label': 'ORG'}]
p13 = [{'label': 'ORG'}, {'label': 'ORG'}, {'label': 'ORG'}, {'label': 'ORG'}]

m_tool.add('QBF', None, p1, p2, p3, p4, p5, p6, p6, p7, p8, p9, p10, p11, p12, p13)

phrase_matches = m_tool(doc)
print(phrase_matches)

matches = re.findall(r'(?:[A-Z]\w*\.? )+v\. .*?\d{4}\)', contents)
for match in matches:
    print(match)

我的 text1 看起来像

text1 = "material fact challenged. Brill v. Guardian Life Ins. Co. of America, 142 N.J. 520, 529 (1995)
(emphasis original).
When a movant establishes certain facts, those who would oppose the motion are under See Della v. Guard Lifal Ins. Co. of SA, 142 N.J. 420, 549 (2011)
an obligation to come forward with controverting facts. Heljon Mgmt. Corp. v. DiLeo, 55 N.J.
Super. 306, 312-13 (No Citations. This was extracted from NJ Sup..). Mere assertions and allegations in the pleadings are
insufficient to defeat motions for summary judgment. Ocean Cape Hotel Corp. v. Masefield
Corp., 63 N.J. Super. 369, 383 (App. Div. 1960). Where the party opposing summary
 "

我期待与两种算法的所有匹配,

"Brill v. Guardian Life Ins. Co. of America, 142 N.J. 520, 529 (1995)"
"Della v. Guard Lifal Ins. Co. of SA, 142 N.J. 420, 549 (2011)"
"Heljon Mgmt. Corp. v. DiLeo, 55 N.J. Super. 306, 312-13 (No Citations. This was extracted from NJ Sup..)"
"Ocean Cape Hotel Corp. v. Masefield Corp., 63 N.J. Super. 369, 383 (App. Div. 1960)"

我不确定它是否适用于所有情况,但你可以试试这个:

matches = re.findall(r"(?:[A-Z]\w*\.? )+v\. .*?\d{4}\)", contents)

它给出:

['Brill v. Guardian Life Ins. Co. of America, 142 N.J. 520, 529 (1995)',
 'Heljon Mgmt. Corp. v. DiLeo, 55 N.J. Super. 306, 312-13 (App. Div. 1959)',
 'Ocean Cape Hotel Corp. v. Masefield Corp., 63 N.J. Super. 369, 383 (App. Div. 1960)']