如何计算单词的重复次数并分配一个数字并附加到数据框中
how to count the number of repetation of words and assign a number and append into dataframe
我有一个包含所有摘要和作者性别的数据集。现在我想获得所有单词的性别重复,以便我可以将其绘制为关于性别的单词重复次数的图表。
data_path = '/content/digitalhumanities - forum-and-fiction.csv'
def change_table(data_path):
df = pd.read_csv(data_path)
final = df.drop(["Title", "Author", "Season", "Year", "Keywords", "Issue No", "Volume"], axis=1)
fin = final.set_index('Gender')
return fin
change_table(data_path).T
This is the out put i got
| Gender | None | Female | Male | None | None | Male ,Female |None | Male ,Female |
|:----------|---------------------------------------------------|---------------------------------------------------|---------------------------------------------------|------------|---------------------------------------|---------------------------------------------------|---------------------------------------------------|---------------------------------------------------|---------------------------------------------------|---------------------------------------------------:|
| Abstract | This article describes Virginia Woolf's preocc... | The Amazonian region occupies a singular place... | This article examines Kipling's 1901 novel Kim... | Pamela; or | Virtue Rewarded uses a literary fo... | This article examines Nuruddin Farah's 1979 no... | Ecological catastrophe has challenged the cont... | British political fiction was a satirical genr... | The Lydgates have bought too much furniture an...
现在我怎样才能得到摘要中每个单词关于性别的重复并附加到数据框。
期望输出示例
|gender|male|female|none|
|------|----|------|----|
| This | 3| 0| 0|
| occupies | 5| 3| 0|
| examines | 6| 0| 0|
| British | 0| 0| 7|
。
.
.
使用crosstab
with splitting stacked values by DataFrame.stack
:
#removed T
df = change_table(data_path)
#reshape with split columns
df = (df.stack()
.rename_axis(('Type','Gender'))
.str.split(expand=True)
.stack()
.reset_index(name='Word'))
#explode Type by split with ,
df = df.assign(Type = df['Type'].str.split(',')).explode('Type')
#remove stpowords
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
df = df[~df['Word'].isin(stop_words)]
#remove punctuation
df['Word'] = df['Word'].str.replace(r'[^\w\s]+', '')
#get counts per Gender, Word and Type
df1 = pd.crosstab([df['Gender'], df['Word']], df['Type']).reset_index()
#or get counts per Word and Type
df2 = pd.crosstab([df['Word'], df['Type'])
我有一个包含所有摘要和作者性别的数据集。现在我想获得所有单词的性别重复,以便我可以将其绘制为关于性别的单词重复次数的图表。
data_path = '/content/digitalhumanities - forum-and-fiction.csv'
def change_table(data_path):
df = pd.read_csv(data_path)
final = df.drop(["Title", "Author", "Season", "Year", "Keywords", "Issue No", "Volume"], axis=1)
fin = final.set_index('Gender')
return fin
change_table(data_path).T
This is the out put i got
| Gender | None | Female | Male | None | None | Male ,Female |None | Male ,Female |
|:----------|---------------------------------------------------|---------------------------------------------------|---------------------------------------------------|------------|---------------------------------------|---------------------------------------------------|---------------------------------------------------|---------------------------------------------------|---------------------------------------------------|---------------------------------------------------:|
| Abstract | This article describes Virginia Woolf's preocc... | The Amazonian region occupies a singular place... | This article examines Kipling's 1901 novel Kim... | Pamela; or | Virtue Rewarded uses a literary fo... | This article examines Nuruddin Farah's 1979 no... | Ecological catastrophe has challenged the cont... | British political fiction was a satirical genr... | The Lydgates have bought too much furniture an...
现在我怎样才能得到摘要中每个单词关于性别的重复并附加到数据框。
期望输出示例
|gender|male|female|none|
|------|----|------|----|
| This | 3| 0| 0|
| occupies | 5| 3| 0|
| examines | 6| 0| 0|
| British | 0| 0| 7|
。
.
.
使用crosstab
with splitting stacked values by DataFrame.stack
:
#removed T
df = change_table(data_path)
#reshape with split columns
df = (df.stack()
.rename_axis(('Type','Gender'))
.str.split(expand=True)
.stack()
.reset_index(name='Word'))
#explode Type by split with ,
df = df.assign(Type = df['Type'].str.split(',')).explode('Type')
#remove stpowords
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
df = df[~df['Word'].isin(stop_words)]
#remove punctuation
df['Word'] = df['Word'].str.replace(r'[^\w\s]+', '')
#get counts per Gender, Word and Type
df1 = pd.crosstab([df['Gender'], df['Word']], df['Type']).reset_index()
#or get counts per Word and Type
df2 = pd.crosstab([df['Word'], df['Type'])