检测列中的语言,但忽略不明确的值。为什么我会收到错误消息?
Detect languages in a column, but ignore ambiguous values. Why am i getting an error?
这是一个示例数据集:
ID
Details
1
Here Are the Details on Facebook's Global Part...
2
Aktien New York Schluss: Moderate Verluste nac...
3
Clôture de Wall Street : Trump plombe la tend...
4
''
5
NaN
我需要添加'Language'列,它代表'Details'列中使用的是什么语言,所以最后它看起来像这样:
ID
Details
Language
1
Here Are the Details on Facebook's Global Part...
en
2
Aktien New York Schluss: Moderate Verluste nac...
de
3
Clôture de Wall Street : Trump plombe la tend...
fr
4
''
NaN
5
NaN
NaN
我试过这段代码:
!pip install langdetect
from langdetect import detect, DetectorFactory
DetectorFactory.seed = 0
df2=df.dropna(subset=['Details'])
df2['Language']=df2['Details'].apply(detect)
它失败了,我猜这是因为有像 'ID'=4 这样的值的行。因此,我尝试了这个:
!pip install langdetect
from langdetect import detect, DetectorFactory
DetectorFactory.seed = 0
df2=df.dropna(subset=['Details'])
df2['Language']=df2['Details'].apply(lambda x: detect(x) if len(x)>1 else np.NaN)
但是,我还是报错了:
LangDetectException: 文本中没有特征。
您可以从您应用的函数中捕获错误和 return NaN
。请注意,您可以提供 any 可调用函数,它接受一个输入和 returns 一个输出作为 .apply()
的参数,它不必是 lambda
def detect_lang(x):
if len(x) <= 1: return np.nan
try:
lang = detect(x)
if lang: return lang # Return lang if lang is not empty
except langdetect.LangDetectException:
pass # Don't do anything when you get an error, so you can fall through to the next line, which returns a Nan
return np.nan # If lang was empty or there was an error, we reach this line
df2['Language']=df2['Details].apply(detect_lang)
我不确定你为什么在那里有 if len(x)>1
:只有 return NaN
当 原始字符串 为零时或一个字符,但我将其包含在我的 detect_lang
函数中以保持功能与您的 lambda 一致。
这是一个示例数据集:
ID | Details |
---|---|
1 | Here Are the Details on Facebook's Global Part... |
2 | Aktien New York Schluss: Moderate Verluste nac... |
3 | Clôture de Wall Street : Trump plombe la tend... |
4 | '' |
5 | NaN |
我需要添加'Language'列,它代表'Details'列中使用的是什么语言,所以最后它看起来像这样:
ID | Details | Language |
---|---|---|
1 | Here Are the Details on Facebook's Global Part... | en |
2 | Aktien New York Schluss: Moderate Verluste nac... | de |
3 | Clôture de Wall Street : Trump plombe la tend... | fr |
4 | '' | NaN |
5 | NaN | NaN |
我试过这段代码:
!pip install langdetect
from langdetect import detect, DetectorFactory
DetectorFactory.seed = 0
df2=df.dropna(subset=['Details'])
df2['Language']=df2['Details'].apply(detect)
它失败了,我猜这是因为有像 'ID'=4 这样的值的行。因此,我尝试了这个:
!pip install langdetect
from langdetect import detect, DetectorFactory
DetectorFactory.seed = 0
df2=df.dropna(subset=['Details'])
df2['Language']=df2['Details'].apply(lambda x: detect(x) if len(x)>1 else np.NaN)
但是,我还是报错了:
LangDetectException: 文本中没有特征。
您可以从您应用的函数中捕获错误和 return NaN
。请注意,您可以提供 any 可调用函数,它接受一个输入和 returns 一个输出作为 .apply()
的参数,它不必是 lambda
def detect_lang(x):
if len(x) <= 1: return np.nan
try:
lang = detect(x)
if lang: return lang # Return lang if lang is not empty
except langdetect.LangDetectException:
pass # Don't do anything when you get an error, so you can fall through to the next line, which returns a Nan
return np.nan # If lang was empty or there was an error, we reach this line
df2['Language']=df2['Details].apply(detect_lang)
我不确定你为什么在那里有 if len(x)>1
:只有 return NaN
当 原始字符串 为零时或一个字符,但我将其包含在我的 detect_lang
函数中以保持功能与您的 lambda 一致。