使用 pandas 来计算 csv 中的句子和单词
using pandas to count sentences and words inside a csv
我正在尝试创建一个 python 程序,它通过用户选择的 csv 文件,并根据句号或换行打印句子总数,以及所有句子的总数字。
插入文件
句子总数为:3
总字数:15
不重复的总字数为:12
data = pd.read_csv('dundun.csv', sep='\t')
words = data['sentences'].str.split(expand=True)
word_count = {}
for word in words:
count = word_count.get(word, 0)
count += 1
word_count[word] = count
print(word_count)
我正在尝试这段代码,但它为我提供了错误的字数统计输出
我的 csv 看起来像:
尝试使用:
import string
nwords = data['sentences'].str.split().map(len).sum()
nsenetences = data['sentences'].count()
nunique_words = len(set([x for i in data['senetences'].str.split().apply(lambda x: [''.join([y for y in i if y not in string.punctuation]) for i in x]).tolist() for x in i]))
对于数据框 df,计算句子数:
df['review_sentence_count'] = df['reviews'].apply(sent_tokenize).tolist()
df['review_sentence_count'] = df['review_sentence_count'].apply(len)
去除标点符号后统计字数:
string_text = df['reviews'].str
df['reviews'] = string_text.translate(str.maketrans('', '', string.punctuation))
df['review_word_count'] = df['reviews'].apply(word_tokenize).tolist()
df['review_word_count'] = df['review_word_count'].apply(len)
将包含新列的新数据写入 csv:
df.to_csv('./data/dataset.csv')
我正在尝试创建一个 python 程序,它通过用户选择的 csv 文件,并根据句号或换行打印句子总数,以及所有句子的总数字。
插入文件
句子总数为:3
总字数:15
不重复的总字数为:12
data = pd.read_csv('dundun.csv', sep='\t')
words = data['sentences'].str.split(expand=True)
word_count = {}
for word in words:
count = word_count.get(word, 0)
count += 1
word_count[word] = count
print(word_count)
我正在尝试这段代码,但它为我提供了错误的字数统计输出 我的 csv 看起来像:
尝试使用:
import string
nwords = data['sentences'].str.split().map(len).sum()
nsenetences = data['sentences'].count()
nunique_words = len(set([x for i in data['senetences'].str.split().apply(lambda x: [''.join([y for y in i if y not in string.punctuation]) for i in x]).tolist() for x in i]))
对于数据框 df,计算句子数:
df['review_sentence_count'] = df['reviews'].apply(sent_tokenize).tolist()
df['review_sentence_count'] = df['review_sentence_count'].apply(len)
去除标点符号后统计字数:
string_text = df['reviews'].str
df['reviews'] = string_text.translate(str.maketrans('', '', string.punctuation))
df['review_word_count'] = df['reviews'].apply(word_tokenize).tolist()
df['review_word_count'] = df['review_word_count'].apply(len)
将包含新列的新数据写入 csv:
df.to_csv('./data/dataset.csv')