用 pos 创建词汇表
Create a vocabulary with pos
我想使用词性标记创建一个语义实体列表(名词、动词、标点等)。
我目前运行以下代码
import spacy
import pandas as pd
nlp = spacy.load('en_core_web_sm',disable=['ner','textcat'])
def fun(text):
doc = nlp(text)
pos = ""
for token in doc:
pos += token.pos_ + " "
return pos
df['S']= df.Text.apply(fun)
创建句子结构。
因此,例如,如果我有文本列(见下文),则此代码生成包含有关语义结构的所有信息的列 S:
Text S
0 “I will meet quite a few people, it’s well... PUNCT NOUN VERB VERB DET DET ADJ NOUN PUNCT PR...
1 Says “Cristiano Ronaldo’s family still owns”... VERB PUNCT PROPN PROPN PART NOUN ADV VERB PUNC...
2 Joe Biden plagiarized Donald Trump in his... PROPN PROPN VERB PROPN PROPN ADP DET PROP...
我想知道我是否可以通过编辑上面的代码来创建名词、动词、det、adj...的词汇表,或者我是否需要考虑其他方法。
要获取数据框中的所有实体(名词、动词等),我会考虑仅选择唯一值,以便为每个实体创建一个列表。
输出示例(它也可以在列表中而不是在数据框中)
PUNCT NOUN VERB ....
“ I will
, people meet
” family says
owns
plagiarized
你可以试试:
import spacy
import pandas as pd
nlp = spacy.load('en_core_web_sm',disable=['ner','textcat'])
texts = ['"I will meet quite a few people, it\'s well',
'Says "Cristiano Ronaldo\'s family still owns"',
'Joe Biden plagiarized Donald Trump in his...']
df = pd.DataFrame({"Text":texts})
d = dict()
def func(text):
doc = nlp(text)
for tok in doc:
if tok.pos_ not in d:
d[tok.pos_] = [tok.text]
else:
d[tok.pos_].append(tok.text)
df.Text.apply(func)
pprint(d)
{'ADJ': ['few'],
'ADP': ['in'],
'ADV': ['well', 'still'],
'AUX': ["'s"],
'DET': ['quite', 'a', 'his'],
'NOUN': ['people', 'family'],
'PART': ["'s"],
'PRON': ['I', 'it'],
'PROPN': ['Cristiano', 'Ronaldo', 'Joe', 'Biden', 'Donald', 'Trump'],
'PUNCT': ['"', ',', '"', '"', '...'],
'VERB': ['will', 'meet', 'Says', 'owns', 'plagiarized']}
请注意,您根本不需要 pandas 依赖:
docs = nlp.pipe(texts)
d = dict()
for doc in docs:
for tok in doc:
if tok.pos_ not in d:
d[tok.pos_] = [tok.text]
else:
d[tok.pos_].append(tok.text)
pprint(d)
{'ADJ': ['few'],
'ADP': ['in'],
'ADV': ['well', 'still'],
'AUX': ["'s"],
'DET': ['quite', 'a', 'his'],
'NOUN': ['people', 'family'],
'PART': ["'s"],
'PRON': ['I', 'it'],
'PROPN': ['Cristiano', 'Ronaldo', 'Joe', 'Biden', 'Donald', 'Trump'],
'PUNCT': ['"', ',', '"', '"', '...'],
'VERB': ['will', 'meet', 'Says', 'owns', 'plagiarized']}
这些将收集其 POS
下的所有令牌。
如果您只需要唯一标记列表:
texts = ['"I will will meet quite a few people, it\'s well',
'Says "Cristiano Ronaldo\'s family still owns"',
'Joe Biden plagiarized Donald Trump in his...']
docs = nlp.pipe(texts)
d = dict()
for doc in docs:
for tok in doc:
if tok.pos_ not in d:
d[tok.pos_] = [tok.text]
elif tok.text not in d[tok.pos_]:
d[tok.pos_].append(tok.text)
pprint(d)
{'ADJ': ['few'],
'ADP': ['in'],
'ADV': ['well', 'still'],
'AUX': ["'s"],
'DET': ['quite', 'a', 'his'],
'NOUN': ['people', 'family'],
'PART': ["'s"],
'PRON': ['I', 'it'],
'PROPN': ['Cristiano', 'Ronaldo', 'Joe', 'Biden', 'Donald', 'Trump'],
'PUNCT': ['"', ',', '...'],
'VERB': ['will', 'meet', 'Says', 'owns', 'plagiarized']}
我想使用词性标记创建一个语义实体列表(名词、动词、标点等)。 我目前运行以下代码
import spacy
import pandas as pd
nlp = spacy.load('en_core_web_sm',disable=['ner','textcat'])
def fun(text):
doc = nlp(text)
pos = ""
for token in doc:
pos += token.pos_ + " "
return pos
df['S']= df.Text.apply(fun)
创建句子结构。 因此,例如,如果我有文本列(见下文),则此代码生成包含有关语义结构的所有信息的列 S:
Text S
0 “I will meet quite a few people, it’s well... PUNCT NOUN VERB VERB DET DET ADJ NOUN PUNCT PR...
1 Says “Cristiano Ronaldo’s family still owns”... VERB PUNCT PROPN PROPN PART NOUN ADV VERB PUNC...
2 Joe Biden plagiarized Donald Trump in his... PROPN PROPN VERB PROPN PROPN ADP DET PROP...
我想知道我是否可以通过编辑上面的代码来创建名词、动词、det、adj...的词汇表,或者我是否需要考虑其他方法。 要获取数据框中的所有实体(名词、动词等),我会考虑仅选择唯一值,以便为每个实体创建一个列表。
输出示例(它也可以在列表中而不是在数据框中)
PUNCT NOUN VERB ....
“ I will
, people meet
” family says
owns
plagiarized
你可以试试:
import spacy
import pandas as pd
nlp = spacy.load('en_core_web_sm',disable=['ner','textcat'])
texts = ['"I will meet quite a few people, it\'s well',
'Says "Cristiano Ronaldo\'s family still owns"',
'Joe Biden plagiarized Donald Trump in his...']
df = pd.DataFrame({"Text":texts})
d = dict()
def func(text):
doc = nlp(text)
for tok in doc:
if tok.pos_ not in d:
d[tok.pos_] = [tok.text]
else:
d[tok.pos_].append(tok.text)
df.Text.apply(func)
pprint(d)
{'ADJ': ['few'],
'ADP': ['in'],
'ADV': ['well', 'still'],
'AUX': ["'s"],
'DET': ['quite', 'a', 'his'],
'NOUN': ['people', 'family'],
'PART': ["'s"],
'PRON': ['I', 'it'],
'PROPN': ['Cristiano', 'Ronaldo', 'Joe', 'Biden', 'Donald', 'Trump'],
'PUNCT': ['"', ',', '"', '"', '...'],
'VERB': ['will', 'meet', 'Says', 'owns', 'plagiarized']}
请注意,您根本不需要 pandas 依赖:
docs = nlp.pipe(texts)
d = dict()
for doc in docs:
for tok in doc:
if tok.pos_ not in d:
d[tok.pos_] = [tok.text]
else:
d[tok.pos_].append(tok.text)
pprint(d)
{'ADJ': ['few'],
'ADP': ['in'],
'ADV': ['well', 'still'],
'AUX': ["'s"],
'DET': ['quite', 'a', 'his'],
'NOUN': ['people', 'family'],
'PART': ["'s"],
'PRON': ['I', 'it'],
'PROPN': ['Cristiano', 'Ronaldo', 'Joe', 'Biden', 'Donald', 'Trump'],
'PUNCT': ['"', ',', '"', '"', '...'],
'VERB': ['will', 'meet', 'Says', 'owns', 'plagiarized']}
这些将收集其 POS
下的所有令牌。
如果您只需要唯一标记列表:
texts = ['"I will will meet quite a few people, it\'s well',
'Says "Cristiano Ronaldo\'s family still owns"',
'Joe Biden plagiarized Donald Trump in his...']
docs = nlp.pipe(texts)
d = dict()
for doc in docs:
for tok in doc:
if tok.pos_ not in d:
d[tok.pos_] = [tok.text]
elif tok.text not in d[tok.pos_]:
d[tok.pos_].append(tok.text)
pprint(d)
{'ADJ': ['few'],
'ADP': ['in'],
'ADV': ['well', 'still'],
'AUX': ["'s"],
'DET': ['quite', 'a', 'his'],
'NOUN': ['people', 'family'],
'PART': ["'s"],
'PRON': ['I', 'it'],
'PROPN': ['Cristiano', 'Ronaldo', 'Joe', 'Biden', 'Donald', 'Trump'],
'PUNCT': ['"', ',', '...'],
'VERB': ['will', 'meet', 'Says', 'owns', 'plagiarized']}