检测列中的语言，但忽略不明确的值。为什么我会收到错误消息？

Question

这是一个示例数据集：

ID	Details
1	Here Are the Details on Facebook's Global Part...
2	Aktien New York Schluss: Moderate Verluste nac...
3	ClÃ´ture de Wall Street : Trump plombe la tend...
4	''
5	NaN

我需要添加'Language'列，它代表'Details'列中使用的是什么语言，所以最后它看起来像这样：

ID	Details	Language
1	Here Are the Details on Facebook's Global Part...	en
2	Aktien New York Schluss: Moderate Verluste nac...	de
3	ClÃ´ture de Wall Street : Trump plombe la tend...	fr
4	''	NaN
5	NaN	NaN

我试过这段代码：

!pip install langdetect
from langdetect import detect, DetectorFactory
DetectorFactory.seed = 0
df2=df.dropna(subset=['Details'])
df2['Language']=df2['Details'].apply(detect)

它失败了，我猜这是因为有像 'ID'=4 这样的值的行。因此，我尝试了这个：

!pip install langdetect
from langdetect import detect, DetectorFactory
DetectorFactory.seed = 0
df2=df.dropna(subset=['Details'])
df2['Language']=df2['Details'].apply(lambda x: detect(x) if len(x)>1 else np.NaN)

但是，我还是报错了：

LangDetectException: 文本中没有特征。

Answer 1

您可以从您应用的函数中捕获错误和 return NaN。请注意，您可以提供 any 可调用函数，它接受一个输入和 returns 一个输出作为 .apply() 的参数，它不必是 lambda

def detect_lang(x):
    if len(x) <= 1: return np.nan 
    try:
        lang = detect(x)
        if lang: return lang # Return lang if lang is not empty
    except langdetect.LangDetectException:
        pass # Don't do anything when you get an error, so you can fall through to the next line, which returns a Nan
    return np.nan  # If lang was empty or there was an error, we reach this line

df2['Language']=df2['Details].apply(detect_lang)

我不确定你为什么在那里有 if len(x)>1：只有 return NaN 当 原始字符串 为零时或一个字符，但我将其包含在我的 detect_lang 函数中以保持功能与您的 lambda 一致。

检测列中的语言，但忽略不明确的值。为什么我会收到错误消息？

Detect languages in a column, but ignore ambiguous values. Why am i getting an error?

python

language-detection