Spark MLLib 的问题导致所有事物的概率和预测都相同
Issue with Spark MLLib that causes probability and prediction to be the same for everything
我正在学习如何将机器学习与 Spark MLLib 结合使用,目的是对推文进行情感分析。我从这里得到了一个情绪分析数据集:
http://thinknook.com/wp-content/uploads/2012/09/Sentiment-Analysis-Dataset.zip
该数据集包含 100 万条归类为正面或负面的推文。此数据集的第二列包含情绪,第四列包含推文。
这是我当前的 PySpark 代码:
import csv
from pyspark.sql import Row
from pyspark.sql.functions import rand
from pyspark.ml.feature import Tokenizer
from pyspark.ml.feature import StopWordsRemover
from pyspark.ml.feature import Word2Vec
from pyspark.ml.feature import CountVectorizer
from pyspark.ml.classification import LogisticRegression
data = sc.textFile("/home/omar/sentiment-train.csv")
header = data.first()
rdd = data.filter(lambda row: row != header)
r = rdd.mapPartitions(lambda x : csv.reader(x))
r2 = r.map(lambda x: (x[3], int(x[1])))
parts = r2.map(lambda x: Row(sentence=x[0], label=int(x[1])))
partsDF = spark.createDataFrame(parts)
partsDF = partsDF.orderBy(rand()).limit(10000)
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
tokenized = tokenizer.transform(partsDF)
remover = StopWordsRemover(inputCol="words", outputCol="base_words")
base_words = remover.transform(tokenized)
train_data_raw = base_words.select("base_words", "label")
word2Vec = Word2Vec(vectorSize=100, minCount=0, inputCol="base_words", outputCol="features")
model = word2Vec.fit(train_data_raw)
final_train_data = model.transform(train_data_raw)
final_train_data = final_train_data.select("label", "features")
lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
lrModel = lr.fit(final_train_data)
lrModel.transform(final_train_data).show()
我正在 PySpark 交互式 shell 上使用此命令执行此操作:
pyspark --master yarn --deploy-mode client --conf='spark.executorEnv.PYTHONHASHSEED=223'
(仅供参考:我有一个 HDFS 集群,其中包含 10 个带有 YARN、Spark 等的虚拟机)
最后一行代码的结果是这样的:
>>> lrModel.transform(final_train_data).show()
+-----+--------------------+--------------------+--------------------+----------+
|label| features| rawPrediction| probability|prediction|
+-----+--------------------+--------------------+--------------------+----------+
| 1|[0.00885206627292...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 1|[0.02994908031541...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 1|[0.03443818541709...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 0|[0.02838905728422...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 1|[0.00561632859171...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 0|[0.02029798456545...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 1|[0.02020387646293...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 1|[0.01861085715063...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 1|[0.00212163510598...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 0|[0.01254413221031...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 0|[0.01443821341672...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 0|[0.02591390228879...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 1|[0.00590923184063...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 0|[0.02487089103516...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 0|[0.00999667861365...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 0|[0.00416736607439...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 0|[0.00715923445144...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 0|[0.02524911996890...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 1|[0.01635813603934...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 0|[0.02773649083489...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
+-----+--------------------+--------------------+--------------------+----------+
only showing top 20 rows
如果我对手动创建的较小数据集执行相同操作,它就可以工作。我不知道发生了什么,整天都在处理这个问题。
有什么建议吗?
感谢您的宝贵时间!
TL;DR 十次迭代对于任何现实生活中的应用来说都是低级的。在大型和非平凡的数据集上,可能需要数千次或更多次迭代(以及调整剩余参数)才能收敛。
二项式LogisticRegressionModel
has summary
attribute, which can give you an access to a LogisticRegressionSummary
对象。在其他有用的指标中,它包含 objectiveHistory
,可用于调试训练过程:
import matplotlib.pyplot as plt
lrm = LogisticRegression(..., family="binomial").fit(df)
plt.plot(lrm.summary.objectiveHistory)
plt.show()
我正在学习如何将机器学习与 Spark MLLib 结合使用,目的是对推文进行情感分析。我从这里得到了一个情绪分析数据集: http://thinknook.com/wp-content/uploads/2012/09/Sentiment-Analysis-Dataset.zip
该数据集包含 100 万条归类为正面或负面的推文。此数据集的第二列包含情绪,第四列包含推文。
这是我当前的 PySpark 代码:
import csv
from pyspark.sql import Row
from pyspark.sql.functions import rand
from pyspark.ml.feature import Tokenizer
from pyspark.ml.feature import StopWordsRemover
from pyspark.ml.feature import Word2Vec
from pyspark.ml.feature import CountVectorizer
from pyspark.ml.classification import LogisticRegression
data = sc.textFile("/home/omar/sentiment-train.csv")
header = data.first()
rdd = data.filter(lambda row: row != header)
r = rdd.mapPartitions(lambda x : csv.reader(x))
r2 = r.map(lambda x: (x[3], int(x[1])))
parts = r2.map(lambda x: Row(sentence=x[0], label=int(x[1])))
partsDF = spark.createDataFrame(parts)
partsDF = partsDF.orderBy(rand()).limit(10000)
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
tokenized = tokenizer.transform(partsDF)
remover = StopWordsRemover(inputCol="words", outputCol="base_words")
base_words = remover.transform(tokenized)
train_data_raw = base_words.select("base_words", "label")
word2Vec = Word2Vec(vectorSize=100, minCount=0, inputCol="base_words", outputCol="features")
model = word2Vec.fit(train_data_raw)
final_train_data = model.transform(train_data_raw)
final_train_data = final_train_data.select("label", "features")
lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
lrModel = lr.fit(final_train_data)
lrModel.transform(final_train_data).show()
我正在 PySpark 交互式 shell 上使用此命令执行此操作:
pyspark --master yarn --deploy-mode client --conf='spark.executorEnv.PYTHONHASHSEED=223'
(仅供参考:我有一个 HDFS 集群,其中包含 10 个带有 YARN、Spark 等的虚拟机)
最后一行代码的结果是这样的:
>>> lrModel.transform(final_train_data).show()
+-----+--------------------+--------------------+--------------------+----------+
|label| features| rawPrediction| probability|prediction|
+-----+--------------------+--------------------+--------------------+----------+
| 1|[0.00885206627292...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 1|[0.02994908031541...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 1|[0.03443818541709...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 0|[0.02838905728422...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 1|[0.00561632859171...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 0|[0.02029798456545...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 1|[0.02020387646293...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 1|[0.01861085715063...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 1|[0.00212163510598...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 0|[0.01254413221031...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 0|[0.01443821341672...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 0|[0.02591390228879...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 1|[0.00590923184063...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 0|[0.02487089103516...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 0|[0.00999667861365...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 0|[0.00416736607439...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 0|[0.00715923445144...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 0|[0.02524911996890...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 1|[0.01635813603934...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 0|[0.02773649083489...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
+-----+--------------------+--------------------+--------------------+----------+
only showing top 20 rows
如果我对手动创建的较小数据集执行相同操作,它就可以工作。我不知道发生了什么,整天都在处理这个问题。
有什么建议吗?
感谢您的宝贵时间!
TL;DR 十次迭代对于任何现实生活中的应用来说都是低级的。在大型和非平凡的数据集上,可能需要数千次或更多次迭代(以及调整剩余参数)才能收敛。
二项式LogisticRegressionModel
has summary
attribute, which can give you an access to a LogisticRegressionSummary
对象。在其他有用的指标中,它包含 objectiveHistory
,可用于调试训练过程:
import matplotlib.pyplot as plt
lrm = LogisticRegression(..., family="binomial").fit(df)
plt.plot(lrm.summary.objectiveHistory)
plt.show()