从数据中提取特定信息
Extracting specific information from data
如何将数据格式转换为:
James Smith was born on November 17, 1948
变成类似
的东西
("James Smith", DOB, "November 17, 1948")
无需依赖字符串的位置索引
我试过以下方法
from nltk import word_tokenize, pos_tag
new = "James Smith was born on November 17, 1948"
sentences = word_tokenize(new)
sentences = pos_tag(sentences)
grammar = "Chunk: {<NNP*><NNP*>}"
cp = nltk.RegexpParser(grammar)
result = cp.parse(sentences)
print(result)
如何进一步进行以获得所需的输出。
用 'was born on' 拆分字符串,然后 trim 空格并分配给 name 和 dob
你总是可以使用正则表达式。
正则表达式 (\S+)\s(\S+)\s\bwas born on\b\s(\S+)\s(\S+),\s(\S+)
将匹配和 return 来自上述字符串格式的数据。
这是实际操作:https://regex101.com/r/W2ykKS/1
python 中的正则表达式:
import re
regex = r"(\S+)\s(\S+)\s\bwas born on\b\s(\S+)\s(\S+),\s(\S+)"
test_str = "James Smith was born on November 17, 1948"
matches = re.search(regex, test_str)
# group 0 in a regex is the input string
print(matches.group(1)) # James
print(matches.group(2)) # Smith
print(matches.group(3)) # November
print(matches.group(4)) # 17
print(matches.group(5)) # 1948
如何将数据格式转换为:
James Smith was born on November 17, 1948
变成类似
的东西("James Smith", DOB, "November 17, 1948")
无需依赖字符串的位置索引
我试过以下方法
from nltk import word_tokenize, pos_tag
new = "James Smith was born on November 17, 1948"
sentences = word_tokenize(new)
sentences = pos_tag(sentences)
grammar = "Chunk: {<NNP*><NNP*>}"
cp = nltk.RegexpParser(grammar)
result = cp.parse(sentences)
print(result)
如何进一步进行以获得所需的输出。
用 'was born on' 拆分字符串,然后 trim 空格并分配给 name 和 dob
你总是可以使用正则表达式。
正则表达式 (\S+)\s(\S+)\s\bwas born on\b\s(\S+)\s(\S+),\s(\S+)
将匹配和 return 来自上述字符串格式的数据。
这是实际操作:https://regex101.com/r/W2ykKS/1
python 中的正则表达式:
import re
regex = r"(\S+)\s(\S+)\s\bwas born on\b\s(\S+)\s(\S+),\s(\S+)"
test_str = "James Smith was born on November 17, 1948"
matches = re.search(regex, test_str)
# group 0 in a regex is the input string
print(matches.group(1)) # James
print(matches.group(2)) # Smith
print(matches.group(3)) # November
print(matches.group(4)) # 17
print(matches.group(5)) # 1948