检测列中的语言,但忽略不明确的值。为什么我会收到错误消息?

Detect languages in a column, but ignore ambiguous values. Why am i getting an error?

这是一个示例数据集:

ID Details
1 Here Are the Details on Facebook's Global Part...
2 Aktien New York Schluss: Moderate Verluste nac...
3 Clôture de Wall Street : Trump plombe la tend...
4 ''
5 NaN

我需要添加'Language'列,它代表'Details'列中使用的是什么语言,所以最后它看起来像这样:

ID Details Language
1 Here Are the Details on Facebook's Global Part... en
2 Aktien New York Schluss: Moderate Verluste nac... de
3 Clôture de Wall Street : Trump plombe la tend... fr
4 '' NaN
5 NaN NaN

我试过这段代码:

!pip install langdetect
from langdetect import detect, DetectorFactory
DetectorFactory.seed = 0
df2=df.dropna(subset=['Details'])
df2['Language']=df2['Details'].apply(detect)

它失败了,我猜这是因为有像 'ID'=4 这样的值的行。因此,我尝试了这个:

!pip install langdetect
from langdetect import detect, DetectorFactory
DetectorFactory.seed = 0
df2=df.dropna(subset=['Details'])
df2['Language']=df2['Details'].apply(lambda x: detect(x) if len(x)>1 else np.NaN)

但是,我还是报错了:

LangDetectException: 文本中没有特征。

您可以从您应用的函数中捕获错误和 return NaN。请注意,您可以提供 any 可调用函数,它接受一个输入和 returns 一个输出作为 .apply() 的参数,它不必是 la​​mbda

def detect_lang(x):
    if len(x) <= 1: return np.nan 
    try:
        lang = detect(x)
        if lang: return lang # Return lang if lang is not empty
    except langdetect.LangDetectException:
        pass # Don't do anything when you get an error, so you can fall through to the next line, which returns a Nan
    return np.nan  # If lang was empty or there was an error, we reach this line

df2['Language']=df2['Details].apply(detect_lang)

我不确定你为什么在那里有 if len(x)>1:只有 return NaN 原始字符串 为零时或一个字符,但我将其包含在我的 detect_lang 函数中以保持功能与您的 lambda 一致。