计算评论中大量名词和 verbs/adjectives 的所有共现

Question

我有一个包含大量评论的数据框，一个包含名词词的大列表 (1000) 和另一个包含 verbs/adjectives (1000) 的大列表。

示例数据框和列表：

import pandas as pd

data = {'reviews':['Very professional operation. Room is very clean and comfortable',
                    'Daniel is the most amazing host! His place is extremely clean, and he provides everything you could possibly want (comfy bed, guidebooks & maps, mini-fridge, towels, even toiletries). He is extremely friendly and helpful.',
                    'The room is very quiet, and well decorated, very clean.',
                    'He provides the room with towels, tea, coffee and a wardrobe.',
                    'Daniel is a great host. Always recomendable.',
                    'My friend and I were very satisfied with our stay in his apartment.']}

df = pd.DataFrame(data)

nouns = ['place','Amsterdam','apartment','location','host','stay','city','room','everything','time','house',
         'area','home','’','center','restaurants','centre','Great','tram','très','minutes','walk','space','neighborhood',
         'à','station','bed','experience','hosts','Thank','bien']

verbs_adj = ['was','is','great','nice','had','clean','were','recommend','stay','are','good','perfect','comfortable',
             'have','easy','be','quiet','helpful','get','beautiful',"'s",'has','est','located','un','amazing','wonderful',]

我想创建一个字典的字典来存储每个评论中名词和 verbs/adjectives 的所有共现，例如

'Very professional operation. Room is very clean and comfortable.'

{'room': {'is': 1, 'clean': 1, 'comfortable': 1}

使用以下代码：

def count_co_occurences(reviews):
    # Iterate on each review and count
    occurences_per_review = {
        f"review_{i+1}": {
            noun: dict(Counter(review.lower().split(" ")))
            for noun in nouns
            if noun in review.lower()
        }
        for i, review in enumerate(reviews)
    }
    # Remove verb_adj not found in main list
    opr = deepcopy(occurences_per_review)
    for review, occurences in opr.items():
        for noun, counts in occurences.items():
            for verb_adj in counts.keys():
                if verb_adj not in verbs_adj:
                    del occurences_per_review[review][noun][verb_adj]
                    
    return occurences_per_review

pprint(count_co_occurences(data["reviews"]))

适用于列表和评论数量较少的情况，但在大量使用此功能时我的笔记本会崩溃 lists/large 否。的评论。我该如何修改代码来处理这个问题？

Answer 1

我认为您可能需要使用几个库来让您的生活更轻松。在这个例子中，我使用的是 nltk 和集合，当然 pandas 除外：

import pandas as pd
import nltk
from collections import Counter

data = {'reviews':['Very professional operation. Room is very clean and comfortable',
                    'Daniel is the most amazing host! His place is extremely clean, and he provides everything you could possibly want (comfy bed, guidebooks & maps, mini-fridge, towels, even toiletries). He is extremely friendly and helpful.',
                    'The room is very quiet, and well decorated, very clean.',
                    'He provides the room with towels, tea, coffee and a wardrobe.',
                    'Daniel is a great host. Always recomendable.',
                    'My friend and I were very satisfied with our stay in his apartment.']}

df = pd.DataFrame(data)

nouns = ['place','Amsterdam','apartment','location','host','stay','city','room','everything','time','house',
         'area','home','’','center','restaurants','centre','Great','tram','très','minutes','walk','space','neighborhood',
         'à','station','bed','experience','hosts','Thank','bien']

verbs_adj = ['was','is','great','nice','had','clean','were','recommend','stay','are','good','perfect','comfortable',
             'have','easy','be','quiet','helpful','get','beautiful',"'s",'has','est','located','un','amazing','wonderful',]

def buildict(x):
    occurdict={}
    tokens = nltk.word_tokenize(x)
    tokenslower = list(map(str.lower, tokens)) 
    allnouns=[word for word in tokenslower if word in nouns]
    allverbs_adj=Counter(word for word in tokenslower if word in verbs_adj)
    for noun in allnouns:
        occurdict[noun]=dict(allverbs_adj)
    return occurdict

df['words']=df['reviews'].apply(lambda x: buildict(x))

输出：

0   Very professional operation. Room is very clea...   {'room': {'is': 1, 'clean': 1, 'comfortable': 1}}
1   Daniel is the most amazing host! His place is ...   {'host': {'is': 3, 'amazing': 1, 'clean': 1, '...
2   The room is very quiet, and well decorated, ve...   {'room': {'is': 1, 'quiet': 1, 'clean': 1}}
3   He provides the room with towels, tea, coffee ...   {'room': {}}
4   Daniel is a great host. Always recomendable.    {'host': {'is': 1, 'great': 1}}
5   My friend and I were very satisfied with our s...   {'stay': {'were': 1, 'stay': 1}, 'apartment': ...

计算评论中大量名词和 verbs/adjectives 的所有共现

Counting all co-occurrences of a large list of nouns and verbs/adjectives within reviews

python

nltk

pandas