为 Wordnet.Synsets().Definition() 构建列表理解时出现 AttributeError
AttributeError when building list comprehension for Wordnet.Synsets().Definition()
首先,我是一个 python 菜鸟,我对其中一些东西的工作原理只有一半的了解。我一直在尝试为一个标记项目构建词矩阵,我希望我能自己解决这个问题,但我没有看到很多关于我的特定错误的文档。所以如果这是非常明显的事情,我先道歉。
我试图让一组函数在几个不同的变体中工作,但我不断得到 "AttributeError: 'list' has no attribute definition."
import pandas as pd
from pandas import DataFrame, Series
import nltk.data
from nltk.corpus import stopwords
from nltk.corpus import wordnet as wn
from nltk.tokenize import TreebankWordTokenizer
# Gets synsets for a given term.
def get_synset(word):
for word in wn.synsets(word):
return word.name()
#Gets definitions for a synset.
def get_def(syn):
return wn.synsets(syn).defnition()
# Creates a dataframe called sector_matrix based on another dataframe's column. Should be followed with an export.
def sector_tagger(frame):
sentences = frame.tolist()
tok_list = [tok.tokenize(w) for w in frame]
split_words = [w.lower() for sub in tok_list for w in sub]
clean_words = [w for w in split_words if w not in english_stops]
synset = [get_synset(w) for w in clean_words]
sector_matrix = DataFrame({'Categories': clean_words,
'Synsets': synset})
sec_syn = sector_matrix['Synsets'].tolist()
sector_matrix['Definition'] = [get_def(w) for w in sector_matrix['Synsets']]
return sector_matrix
在我从 excel:
读入的数据帧上调用函数
test = pd.read_excel('data.xlsx')
sector_tagger 函数是这样调用的:
agri_matrix = sector_tagger(agri['Category'])
以前的版本在填充 DataFrame 的列表理解中调用了 wn.synsets(w).definition()。另一个尝试在 Jupyter Notebook 中事后调用定义。我几乎总是得到属性错误。也就是说,当我在 sector_matrix['Synsets'] 上调用数据类型时,我得到一个 "object" 类型,当我打印该列时,我没有在项目周围看到 []。
我试过:
- 在 str() 中包装 "w"
- 调用列表推导式
函数(即删除行并在我的笔记本中调用它)
- 将 'Synsets' 列传递到新列表并围绕它构建列表理解
奇怪的是,我昨天在玩这个,并且能够直接在我的笔记本上做一些事情,但是 (a) 它很乱 (b) 没有可扩展性,并且 (c) 它不能在我将其应用到的其他类别。
agrimask = (df['Agri-Food']==1) & (df['Total']==1)
df_agri = df.loc[agrimask,['Category']]
agri_words = [tok.tokenize(a) for a in df_agri['Category']]
agri_cip_words = [a.lower() for sub in agri_words for a in sub]
agri_clean = [w for w in agri_cip_words if w not in english_stops]
df_agri_clean = DataFrame({'Category': agri_clean})
df_agri_clean = df_agri_clean[df_agri_clean != ','].replace('horticulture/horticultural','horticulture').dropna().drop_duplicates()
df_agri_clean['Synsets'] = [x[0].name() for x in df_agri_clean['Category'].apply(syn)]
df_agri_clean['Definition'] = [wn.synset(x).definition() for x in df_agri_clean['Synsets']]
df_agri_clean['Lemma'] = [wn.synset(x).lemmas()[0].name() for x in df_agri_clean['Synsets']]
df_agri_clean
Edit1:这是 link 到 sample of the data。
Edit2:另外,我使用的静态变量在这里(全部基于标准 NLTK 库):
tok = TreebankWordTokenizer()
english_stops = set(stopwords.words('english'))
french_stops = set(stopwords.words('french'))
Edit3:您可以在此处查看此代码的工作版本:Working Code
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.corpus import wordnet as wn
from nltk.tokenize import TreebankWordTokenizer as tok
english_stops = set(stopwords.words('english'))
# Gets synsets for a given term.
def get_synset(word):
for word in wn.synsets(word):
return word.name()
#Gets definitions for a synset.
def get_def(syn):
return wn.synset(syn).definition() # your definition is misspelled
# Creates a dataframe called sector_matrix based on another dataframe's column. Should be followed with an export.
def sector_tagger(frame):
tok_list = tok().tokenize(frame)
split_words = [w.lower() for w in tok_list]
clean_words = [w for w in split_words if w not in english_stops]
synset = [get_synset(w) for w in clean_words]
sector_matrix = pd.DataFrame({'Categories': clean_words,
'Synsets': synset})
sec_syn = list(sector_matrix['Synsets'])
sector_matrix['Definition'] = [get_def(w) if w != None else '' for w in sec_syn]
return sector_matrix
agri_matrix = df['Category'].apply(sector_tagger)
如果这回答了您的问题,请将其作为答案检查
get_def
的输出是一个短语列表
替代方法
def sector_tagger(frame):
mapping = [('/', ' '), ('(', ''), (')', ''), (',', '')]
for k, v in mapping:
frame = frame.replace(k, v)
tok_list = tok().tokenize(frame) # note () after tok
split_words = [w.lower() for w in tok_list]
clean_words = [w for w in split_words if w not in english_stops]
synset = [get_synset(w) for w in clean_words]
def_matrix = [get_def(w) if w != None else '' for w in synset]
return clean_words, synset, def_matrix
poo = df['Category'].apply(sector_tagger)
poo[0] =
(['agricultural', 'domestic', 'animal', 'services'],
['agricultural.a.01', 'domestic.n.01', 'animal.n.01', 'services.n.01'],
['relating to or used in or promoting agriculture or farming',
'a servant who is paid to perform menial tasks around the household',
'a living organism characterized by voluntary movement',
'performance of duties or provision of space and equipment helpful to others'])
list_clean_words = []
list_synset = []
list_def_matrix = []
for x in poo:
list_clean_words.append(x[0])
list_synset.append(x[1])
list_def_matrix.append(x[2])
agri_matrix = pd.DataFrame()
agri_matrix['Categories'] = list_clean_words
agri_matrix['Synsets'] = list_synset
agri_matrix['Definition'] = list_def_matrix
agri_matrix
Categories Synsets Definition
0 [agricultural, domestic, animal, services] [agricultural.a.01, domestic.n.01, animal.n.01... [relating to or used in or promoting agricultu...
1 [agricultural, food, products, processing] [agricultural.a.01, food.n.01, merchandise.n.0... [relating to or used in or promoting agricultu...
2 [agricultural, business, management] [agricultural.a.01, business.n.01, management.... [relating to or used in or promoting agricultu...
3 [agricultural, mechanization] [agricultural.a.01, mechanization.n.01] [relating to or used in or promoting agricultu...
4 [agricultural, production, operations] [agricultural.a.01, production.n.01, operation... [relating to or used in or promoting agricultu...
将每个列表列表拆分成一个长列表(它们是有序的)
def create_long_list_from_list_of_lists(list_of_lists):
long_list = []
for one_list in list_of_lists:
for word in one_list:
long_list.append(word)
return long_list
long_list_clean_words = create_long_list_from_list_of_lists(list_clean_words)
long_list_synset = create_long_list_from_list_of_lists(list_synset)
long_list_def_matrix = create_long_list_from_list_of_lists(list_def_matrix)
将其转换为 Uniques Categories 的 DataFrame
agri_df = pd.DataFrame.from_dict(dict([('Categories', long_list_clean_words), ('Synsets', long_list_synset), ('Definitions', long_list_def_matrix)])).drop_duplicates().reset_index(drop=True)
agri_df.head(4)
Categories Synsets Definitions
0 ceramic ceramic.n.01 an artifact made of hard brittle material prod...
1 horticultural horticultural.a.01 of or relating to the cultivation of plants
2 construction construction.n.01 the act of constructing something
3 building building.n.01 a structure that has a roof and walls and stan...
最后的笔记
import from nltk.tokenize import TreebankWordTokenizer as tok
或:
import from nltk.tokenize import word_tokenize
使用:
tok().tokenize(string_text_phrase) # text is a string phrase, not a list of words
或:
word_tokenize(string_text_phrase)
两种方法似乎产生相同的输出,即单词列表。
input = "Agricultural and domestic animal services"
output_of_both_methods = ['Agricultural', 'and', 'domestic', 'animal', 'services']
首先,我是一个 python 菜鸟,我对其中一些东西的工作原理只有一半的了解。我一直在尝试为一个标记项目构建词矩阵,我希望我能自己解决这个问题,但我没有看到很多关于我的特定错误的文档。所以如果这是非常明显的事情,我先道歉。
我试图让一组函数在几个不同的变体中工作,但我不断得到 "AttributeError: 'list' has no attribute definition."
import pandas as pd
from pandas import DataFrame, Series
import nltk.data
from nltk.corpus import stopwords
from nltk.corpus import wordnet as wn
from nltk.tokenize import TreebankWordTokenizer
# Gets synsets for a given term.
def get_synset(word):
for word in wn.synsets(word):
return word.name()
#Gets definitions for a synset.
def get_def(syn):
return wn.synsets(syn).defnition()
# Creates a dataframe called sector_matrix based on another dataframe's column. Should be followed with an export.
def sector_tagger(frame):
sentences = frame.tolist()
tok_list = [tok.tokenize(w) for w in frame]
split_words = [w.lower() for sub in tok_list for w in sub]
clean_words = [w for w in split_words if w not in english_stops]
synset = [get_synset(w) for w in clean_words]
sector_matrix = DataFrame({'Categories': clean_words,
'Synsets': synset})
sec_syn = sector_matrix['Synsets'].tolist()
sector_matrix['Definition'] = [get_def(w) for w in sector_matrix['Synsets']]
return sector_matrix
在我从 excel:
读入的数据帧上调用函数test = pd.read_excel('data.xlsx')
sector_tagger 函数是这样调用的:
agri_matrix = sector_tagger(agri['Category'])
以前的版本在填充 DataFrame 的列表理解中调用了 wn.synsets(w).definition()。另一个尝试在 Jupyter Notebook 中事后调用定义。我几乎总是得到属性错误。也就是说,当我在 sector_matrix['Synsets'] 上调用数据类型时,我得到一个 "object" 类型,当我打印该列时,我没有在项目周围看到 []。
我试过:
- 在 str() 中包装 "w"
- 调用列表推导式 函数(即删除行并在我的笔记本中调用它)
- 将 'Synsets' 列传递到新列表并围绕它构建列表理解
奇怪的是,我昨天在玩这个,并且能够直接在我的笔记本上做一些事情,但是 (a) 它很乱 (b) 没有可扩展性,并且 (c) 它不能在我将其应用到的其他类别。
agrimask = (df['Agri-Food']==1) & (df['Total']==1)
df_agri = df.loc[agrimask,['Category']]
agri_words = [tok.tokenize(a) for a in df_agri['Category']]
agri_cip_words = [a.lower() for sub in agri_words for a in sub]
agri_clean = [w for w in agri_cip_words if w not in english_stops]
df_agri_clean = DataFrame({'Category': agri_clean})
df_agri_clean = df_agri_clean[df_agri_clean != ','].replace('horticulture/horticultural','horticulture').dropna().drop_duplicates()
df_agri_clean['Synsets'] = [x[0].name() for x in df_agri_clean['Category'].apply(syn)]
df_agri_clean['Definition'] = [wn.synset(x).definition() for x in df_agri_clean['Synsets']]
df_agri_clean['Lemma'] = [wn.synset(x).lemmas()[0].name() for x in df_agri_clean['Synsets']]
df_agri_clean
Edit1:这是 link 到 sample of the data。
Edit2:另外,我使用的静态变量在这里(全部基于标准 NLTK 库):
tok = TreebankWordTokenizer()
english_stops = set(stopwords.words('english'))
french_stops = set(stopwords.words('french'))
Edit3:您可以在此处查看此代码的工作版本:Working Code
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.corpus import wordnet as wn
from nltk.tokenize import TreebankWordTokenizer as tok
english_stops = set(stopwords.words('english'))
# Gets synsets for a given term.
def get_synset(word):
for word in wn.synsets(word):
return word.name()
#Gets definitions for a synset.
def get_def(syn):
return wn.synset(syn).definition() # your definition is misspelled
# Creates a dataframe called sector_matrix based on another dataframe's column. Should be followed with an export.
def sector_tagger(frame):
tok_list = tok().tokenize(frame)
split_words = [w.lower() for w in tok_list]
clean_words = [w for w in split_words if w not in english_stops]
synset = [get_synset(w) for w in clean_words]
sector_matrix = pd.DataFrame({'Categories': clean_words,
'Synsets': synset})
sec_syn = list(sector_matrix['Synsets'])
sector_matrix['Definition'] = [get_def(w) if w != None else '' for w in sec_syn]
return sector_matrix
agri_matrix = df['Category'].apply(sector_tagger)
如果这回答了您的问题,请将其作为答案检查
get_def
的输出是一个短语列表
替代方法
def sector_tagger(frame):
mapping = [('/', ' '), ('(', ''), (')', ''), (',', '')]
for k, v in mapping:
frame = frame.replace(k, v)
tok_list = tok().tokenize(frame) # note () after tok
split_words = [w.lower() for w in tok_list]
clean_words = [w for w in split_words if w not in english_stops]
synset = [get_synset(w) for w in clean_words]
def_matrix = [get_def(w) if w != None else '' for w in synset]
return clean_words, synset, def_matrix
poo = df['Category'].apply(sector_tagger)
poo[0] =
(['agricultural', 'domestic', 'animal', 'services'],
['agricultural.a.01', 'domestic.n.01', 'animal.n.01', 'services.n.01'],
['relating to or used in or promoting agriculture or farming',
'a servant who is paid to perform menial tasks around the household',
'a living organism characterized by voluntary movement',
'performance of duties or provision of space and equipment helpful to others'])
list_clean_words = []
list_synset = []
list_def_matrix = []
for x in poo:
list_clean_words.append(x[0])
list_synset.append(x[1])
list_def_matrix.append(x[2])
agri_matrix = pd.DataFrame()
agri_matrix['Categories'] = list_clean_words
agri_matrix['Synsets'] = list_synset
agri_matrix['Definition'] = list_def_matrix
agri_matrix
Categories Synsets Definition
0 [agricultural, domestic, animal, services] [agricultural.a.01, domestic.n.01, animal.n.01... [relating to or used in or promoting agricultu...
1 [agricultural, food, products, processing] [agricultural.a.01, food.n.01, merchandise.n.0... [relating to or used in or promoting agricultu...
2 [agricultural, business, management] [agricultural.a.01, business.n.01, management.... [relating to or used in or promoting agricultu...
3 [agricultural, mechanization] [agricultural.a.01, mechanization.n.01] [relating to or used in or promoting agricultu...
4 [agricultural, production, operations] [agricultural.a.01, production.n.01, operation... [relating to or used in or promoting agricultu...
将每个列表列表拆分成一个长列表(它们是有序的)
def create_long_list_from_list_of_lists(list_of_lists):
long_list = []
for one_list in list_of_lists:
for word in one_list:
long_list.append(word)
return long_list
long_list_clean_words = create_long_list_from_list_of_lists(list_clean_words)
long_list_synset = create_long_list_from_list_of_lists(list_synset)
long_list_def_matrix = create_long_list_from_list_of_lists(list_def_matrix)
将其转换为 Uniques Categories 的 DataFrame
agri_df = pd.DataFrame.from_dict(dict([('Categories', long_list_clean_words), ('Synsets', long_list_synset), ('Definitions', long_list_def_matrix)])).drop_duplicates().reset_index(drop=True)
agri_df.head(4)
Categories Synsets Definitions
0 ceramic ceramic.n.01 an artifact made of hard brittle material prod...
1 horticultural horticultural.a.01 of or relating to the cultivation of plants
2 construction construction.n.01 the act of constructing something
3 building building.n.01 a structure that has a roof and walls and stan...
最后的笔记
import from nltk.tokenize import TreebankWordTokenizer as tok
或:
import from nltk.tokenize import word_tokenize
使用:
tok().tokenize(string_text_phrase) # text is a string phrase, not a list of words
或:
word_tokenize(string_text_phrase)
两种方法似乎产生相同的输出,即单词列表。
input = "Agricultural and domestic animal services"
output_of_both_methods = ['Agricultural', 'and', 'domestic', 'animal', 'services']