为 Wordnet.Synsets().Definition() 构建列表理解时出现 AttributeError

AttributeError when building list comprehension for Wordnet.Synsets().Definition()

首先,我是一个 python 菜鸟,我对其中一些东西的工作原理只有一半的了解。我一直在尝试为一个标记项目构建词矩阵,我希望我能自己解决这个问题,但我没有看到很多关于我的特定错误的文档。所以如果这是非常明显的事情,我先道歉。

我试图让一组函数在几个不同的变体中工作,但我不断得到 "AttributeError: 'list' has no attribute definition."

import pandas as pd
from pandas import DataFrame, Series
import nltk.data
from nltk.corpus import stopwords
from nltk.corpus import wordnet as wn
from nltk.tokenize import TreebankWordTokenizer

# Gets synsets for a given term.

def get_synset(word):
    for word in wn.synsets(word):
        return word.name()

#Gets definitions for a synset.

def get_def(syn):
    return wn.synsets(syn).defnition()

# Creates a dataframe called sector_matrix based on another dataframe's column. Should be followed with an export.

def sector_tagger(frame):
    sentences = frame.tolist()
    tok_list = [tok.tokenize(w) for w in frame]
    split_words = [w.lower() for sub in tok_list for w in sub]
    clean_words = [w for w in split_words if w not in english_stops]
    synset = [get_synset(w) for w in clean_words]
    sector_matrix = DataFrame({'Categories': clean_words,
                               'Synsets': synset})
    sec_syn = sector_matrix['Synsets'].tolist()
    sector_matrix['Definition'] = [get_def(w) for w in sector_matrix['Synsets']]
    return sector_matrix

在我从 excel:

读入的数据帧上调用函数
test = pd.read_excel('data.xlsx')

sector_tagger 函数是这样调用的:

agri_matrix = sector_tagger(agri['Category'])

以前的版本在填充 DataFrame 的列表理解中调用了 wn.synsets(w).definition()。另一个尝试在 Jupyter Notebook 中事后调用定义。我几乎总是得到属性错误。也就是说,当我在 sector_matrix['Synsets'] 上调用数据类型时,我得到一个 "object" 类型,当我打印该列时,我没有在项目周围看到 []。

我试过:

奇怪的是,我昨天在玩这个,并且能够直接在我的笔记本上做一些事情,但是 (a) 它很乱 (b) 没有可扩展性,并且 (c) 它不能在我将其应用到的其他类别。

agrimask = (df['Agri-Food']==1) & (df['Total']==1)
df_agri = df.loc[agrimask,['Category']]
agri_words = [tok.tokenize(a) for a in df_agri['Category']]
agri_cip_words = [a.lower() for sub in agri_words for a in sub]
agri_clean = [w for w in agri_cip_words if w not in english_stops]
df_agri_clean = DataFrame({'Category': agri_clean})
df_agri_clean = df_agri_clean[df_agri_clean != ','].replace('horticulture/horticultural','horticulture').dropna().drop_duplicates()
df_agri_clean['Synsets'] = [x[0].name() for x in df_agri_clean['Category'].apply(syn)]
df_agri_clean['Definition'] = [wn.synset(x).definition() for x in df_agri_clean['Synsets']]
df_agri_clean['Lemma'] = [wn.synset(x).lemmas()[0].name() for x in df_agri_clean['Synsets']]
df_agri_clean

Edit1:这是 link 到 sample of the data

Edit2:另外,我使用的静态变量在这里(全部基于标准 NLTK 库):

tok = TreebankWordTokenizer()
english_stops = set(stopwords.words('english'))
french_stops = set(stopwords.words('french'))

Edit3:您可以在此处查看此代码的工作版本:Working Code

2018-09-18_CIP.ipynb

import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.corpus import wordnet as wn
from nltk.tokenize import TreebankWordTokenizer as tok

english_stops = set(stopwords.words('english'))

# Gets synsets for a given term.
def get_synset(word):
    for word in wn.synsets(word):
        return word.name()

#Gets definitions for a synset.
def get_def(syn):
    return wn.synset(syn).definition()  # your definition is misspelled

# Creates a dataframe called sector_matrix based on another dataframe's column. Should be followed with an export.
def sector_tagger(frame):
    tok_list = tok().tokenize(frame)
    split_words = [w.lower() for w in tok_list]
    clean_words = [w for w in split_words if w not in english_stops]
    synset = [get_synset(w) for w in clean_words]
    sector_matrix = pd.DataFrame({'Categories': clean_words,
                                  'Synsets': synset})
    sec_syn = list(sector_matrix['Synsets'])
    sector_matrix['Definition'] = [get_def(w) if w != None else '' for w in sec_syn]
    return sector_matrix

agri_matrix = df['Category'].apply(sector_tagger)

如果这回答了您的问题,请将其作为答案检查

get_def 的输出是一个短语列表

替代方法

def sector_tagger(frame):
    mapping = [('/', ' '), ('(', ''), (')', ''), (',', '')]
    for k, v in mapping:
        frame = frame.replace(k, v)
    tok_list = tok().tokenize(frame)  # note () after tok
    split_words = [w.lower() for w in tok_list]
    clean_words = [w for w in split_words if w not in english_stops]
    synset = [get_synset(w) for w in clean_words]
    def_matrix = [get_def(w) if w != None else '' for w in synset]
    return clean_words, synset, def_matrix


poo = df['Category'].apply(sector_tagger)

poo[0] = 
(['agricultural', 'domestic', 'animal', 'services'],
 ['agricultural.a.01', 'domestic.n.01', 'animal.n.01', 'services.n.01'],
 ['relating to or used in or promoting agriculture or farming',
  'a servant who is paid to perform menial tasks around the household',
  'a living organism characterized by voluntary movement',
  'performance of duties or provision of space and equipment helpful to others'])

list_clean_words = []
list_synset = []
list_def_matrix = []
for x in poo:
    list_clean_words.append(x[0])
    list_synset.append(x[1])
    list_def_matrix.append(x[2])

agri_matrix = pd.DataFrame()
agri_matrix['Categories'] = list_clean_words
agri_matrix['Synsets'] = list_synset
agri_matrix['Definition'] = list_def_matrix
agri_matrix

                                    Categories      Synsets       Definition
0   [agricultural, domestic, animal, services]  [agricultural.a.01, domestic.n.01, animal.n.01...   [relating to or used in or promoting agricultu...
1   [agricultural, food, products, processing]  [agricultural.a.01, food.n.01, merchandise.n.0...   [relating to or used in or promoting agricultu...
2   [agricultural, business, management]    [agricultural.a.01, business.n.01, management....   [relating to or used in or promoting agricultu...
3   [agricultural, mechanization]   [agricultural.a.01, mechanization.n.01] [relating to or used in or promoting agricultu...
4   [agricultural, production, operations]  [agricultural.a.01, production.n.01, operation...   [relating to or used in or promoting agricultu...

将每个列表列表拆分成一个长列表(它们是有序的)

def create_long_list_from_list_of_lists(list_of_lists):
    long_list = []
    for one_list in list_of_lists:
        for word in one_list:
            long_list.append(word)
    return long_list

long_list_clean_words = create_long_list_from_list_of_lists(list_clean_words)
long_list_synset = create_long_list_from_list_of_lists(list_synset)
long_list_def_matrix = create_long_list_from_list_of_lists(list_def_matrix)

将其转换为 Uniques Categories 的 DataFrame

agri_df = pd.DataFrame.from_dict(dict([('Categories', long_list_clean_words), ('Synsets', long_list_synset), ('Definitions', long_list_def_matrix)])).drop_duplicates().reset_index(drop=True)

agri_df.head(4)

       Categories              Synsets                         Definitions
0   ceramic               ceramic.n.01  an artifact made of hard brittle material prod...
1   horticultural   horticultural.a.01  of or relating to the cultivation of plants
2   construction     construction.n.01  the act of constructing something
3   building             building.n.01  a structure that has a roof and walls and stan...

最后的笔记

import from nltk.tokenize import TreebankWordTokenizer as tok

或:

import from nltk.tokenize import word_tokenize

使用:

tok().tokenize(string_text_phrase)  # text is a string phrase, not a list of words

或:

word_tokenize(string_text_phrase)

两种方法似乎产生相同的输出,即单词列表。

input = "Agricultural and domestic animal services"

output_of_both_methods = ['Agricultural', 'and', 'domestic', 'animal', 'services']