Pandas dataframe 过滤掉非英文文本的行

Question

我有一个 pandas df，它有 6 列，最后一列是 input_text。我想从 df 中删除该列中包含非英文文本的所有行。我想使用 langdetect 的 detect 函数。

一些模板

from langdetect import detect
import pandas as pd

def filter_nonenglish(df):
    new_df = None  # Do some magical operations here to create the filtered df
    return new_df

df = pd.read_csv('somecsv.csv')
df_new = filter_nonenglish(df)
print('New df is: ', df_new)

注意！其他 5 列是什么并不重要。另请注意：使用 detect 非常简单：

t = 'I am very cool!'
print(detect(t))

输出为：

en

Answer 1

您可以在 df 上执行以下操作，并在 input_text 列中获取所有带有英文文本的行：

df_new = df[df.input_text.apply(detect).eq('en')]

所以基本上只是将 langdetect.detect 函数应用于 input_text 列中的值，并获取所有那些文本被检测为 "en" 的行.

Pandas dataframe 过滤掉非英文文本的行

Pandas dataframe filter out rows with non-english text

python

algorithm

nlp

nltk

pandas