CleanTextEmptyString:没有提供要清理的文本。应用于数据框中的每一行

CleanTextEmptyString: No text is provided to clean. Apply on each row in a dataframe

我正在尝试将函数 cleantext 应用于数据框列的每一行。 它在没有应用功能的情况下工作完美,我得到了我想要的结果。 问题来了

import cleantext
from cleantext import clean
master_df_m['col'] = master_df_m.Presentation.apply(lambda row: clean(row))
CleanTextEmptyString: No text is provided to clean

这里没问题:


print(clean(master_df_m.Presentation[0], clean_all=True))

输出:

oper good morn name janeka confer oper time would like welcom everyon comerica second quarter earn call line place mute prevent background nois speaker remark questionandansw session oper instruct thank would like turn call ms darlen person director investor relat may begin darlen person comerica incorpor director ir thank janeka good morn welcom comerica

怎么了?我还尝试将 axis=1 放在 apply 函数的括号中。

假设您的数据框没有任何空字符串,您可以尝试这样的操作:

from cleantext import clean
import pandas as pd

df = pd.DataFrame(data={'Presentation': [' This is some kind of sentence', ' This is anoTher! kind of sentence']})
df['cleaned_text'] = df.Presentation.apply(clean)

输出:

                         Presentation        cleaned_text
0       This is some kind of sentence        kind sentenc
1   This is anoTher! kind of sentence  anoth kind sentenc

如果您想覆盖您的 Presentation 列,那么只需使用 df['Presentation']。或者使用 map:

df['Presentation'] = df['Presentation'].map(clean)

更新 1: 如果您的数据框中有空字符串,请尝试这样的操作:

df = pd.DataFrame(data={'Presentation': [' This is some kind of sentence', ' This is anoTher! kind of sentence', ""]})
df = df.replace('', 'NaN') 
# or df.loc[df.Presentation == '', 'Presentation'] = 'NaN'

df['Presentation'] = df['Presentation'].map(clean)

或:

df['Presentation'] = df.loc[df.Presentation !='', 'Presentation'].map(clean)
        Presentation
0        kind sentenc
1  anoth kind sentenc
2                 NaN

这里有一个简单的方法:

from cleantext import clean
for col in master_df_m.columns:
    master_df_m[col] = master_df_m[col].apply(lambda word: clean(word))

这将帮助您根据需要在 clean() 中指定其他参数。 https://pypi.org/project/cleantext/