Wordcount Nonetype 错误 pyspark-
Wordcount Nonetype error pyspark-
我正在尝试进行一些文本分析:
def cleaning_text(sentence):
sentence=sentence.lower()
sentence=re.sub('\'','',sentence.strip())
sentence=re.sub('^\d+\/\d+|\s\d+\/\d+|\d+\-\d+\-\d+|\d+\-\w+\-\d+\s\d+\:\d+|\d+\-\w+\-\d+|\d+\/\d+\/\d+\s\d+\:\d+',' ',sentence.strip())# dates removed
sentence=re.sub(r'(.)(\/)(.)',r'',sentence.strip())
sentence=re.sub("(.*?\//)|(.*?\\)|(.*?\\)|(.*?\/)",' ',sentence.strip())
sentence=re.sub('^\d+','',sentence.strip())
sentence = re.sub('[%s]' % re.escape(string.punctuation),'',sentence.strip())
cleaned=' '.join([w for w in sentence.split() if not len(w)<2 and w not in ('no', 'sc','ln') ])
cleaned=cleaned.strip()
if(len(cleaned)<=1):
return "NA"
else:
return cleaned
org_val=udf(cleaning_text,StringType())
df_new =df.withColumn("cleaned_short_desc", org_val(df["symptom_short_description_"]))
df_new =df_new.withColumn("cleaned_long_desc", org_val(df_new["long_description"]))
longWordsDF = (df_new.select(explode(split('cleaned_long_desc',' ')).alias('word'))
longWordsDF.count()
我收到以下错误。
File "<stdin>", line 2, in cleaning_text
AttributeError: 'NoneType' object has no attribute 'lower'
我想执行字数统计,但任何一种聚合函数都会给我一个错误。
我尝试了以下操作:
sentence=sentence.encode("ascii", "ignore")
在 cleaning_text 函数中添加了这条语句
df.dropna()
还是一样的问题,不知如何解决。
您似乎在某些列中有空值。在cleaning_text
函数的开头加一个if,错误就会消失:
if sentence is None:
return "NA"
我正在尝试进行一些文本分析:
def cleaning_text(sentence):
sentence=sentence.lower()
sentence=re.sub('\'','',sentence.strip())
sentence=re.sub('^\d+\/\d+|\s\d+\/\d+|\d+\-\d+\-\d+|\d+\-\w+\-\d+\s\d+\:\d+|\d+\-\w+\-\d+|\d+\/\d+\/\d+\s\d+\:\d+',' ',sentence.strip())# dates removed
sentence=re.sub(r'(.)(\/)(.)',r'',sentence.strip())
sentence=re.sub("(.*?\//)|(.*?\\)|(.*?\\)|(.*?\/)",' ',sentence.strip())
sentence=re.sub('^\d+','',sentence.strip())
sentence = re.sub('[%s]' % re.escape(string.punctuation),'',sentence.strip())
cleaned=' '.join([w for w in sentence.split() if not len(w)<2 and w not in ('no', 'sc','ln') ])
cleaned=cleaned.strip()
if(len(cleaned)<=1):
return "NA"
else:
return cleaned
org_val=udf(cleaning_text,StringType())
df_new =df.withColumn("cleaned_short_desc", org_val(df["symptom_short_description_"]))
df_new =df_new.withColumn("cleaned_long_desc", org_val(df_new["long_description"]))
longWordsDF = (df_new.select(explode(split('cleaned_long_desc',' ')).alias('word'))
longWordsDF.count()
我收到以下错误。
File "<stdin>", line 2, in cleaning_text
AttributeError: 'NoneType' object has no attribute 'lower'
我想执行字数统计,但任何一种聚合函数都会给我一个错误。
我尝试了以下操作:
sentence=sentence.encode("ascii", "ignore")
在 cleaning_text 函数中添加了这条语句
df.dropna()
还是一样的问题,不知如何解决。
您似乎在某些列中有空值。在cleaning_text
函数的开头加一个if,错误就会消失:
if sentence is None:
return "NA"