IllegalArgumentException:kmeans.fit 上你的要求失败
IllegalArgumentException: u'requirement failed' on kmeans.fit
使用 zeppelin notebook 中的 spark,我从昨天开始就收到了这个错误。
这是我的代码:
from pyspark.ml.clustering import KMeans
from pyspark.ml.feature import VectorAssembler
df = sqlContext.table("rfmdata_clust")
k = 4
# Set Kmeans input/output columns
vecAssembler = VectorAssembler(inputCols=["v1_clust", "v2_clust", "v3_clust"], outputCol="features")
featuresDf = vecAssembler.transform(df)
# Run KMeans
kmeans = KMeans().setInitMode("k-means||").setK(k)
model = kmeans.fit(featuresDf)
resultDf = model.transform(featuresDf)
# KMeans WSSSE
wssse = model.computeCost(featuresDf)
print("Within Set Sum of Squared Errors = " + str(wssse))
这是错误:
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-8890997346928959256.py", line 346, in <module>
raise Exception(traceback.format_exc())
Exception: Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-8890997346928959256.py", line 334, in <module>
exec(code)
File "<stdin>", line 8, in <module>
File "/usr/lib/spark/python/pyspark/ml/base.py", line 64, in fit
return self._fit(dataset)
File "/usr/lib/spark/python/pyspark/ml/wrapper.py", line 236, in _fit
java_model = self._fit_java(dataset)
File "/usr/lib/spark/python/pyspark/ml/wrapper.py", line 233, in _fit_java
return self._java_obj.fit(dataset._jdf)
File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 79, in deco
raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
IllegalArgumentException: u'requirement failed'
抛出错误的行是 kmeans.fit() 行。
我检查了 rfmdata_clust 数据框,它似乎一点也不奇怪。
df.printSchema()
给出:
root
|-- id: string (nullable = true)
|-- v1_clust: double (nullable = true)
|-- v2_clust: double (nullable = true)
|-- v3_clust: double (nullable = true)
featuresDf.printSchema()
给出:
root
|-- id: string (nullable = true)
|-- v1_clust: double (nullable = true)
|-- v2_clust: double (nullable = true)
|-- v3_clust: double (nullable = true)
|-- features: vector (nullable = true)
另一个有趣的地方是在featuresDf的定义下面添加featuresDf = featuresDf.limit(10000)
使得代码运行没有错误。可能跟数据大小有关?
希望这个问题已经解决了,如果没有,请试试这个
df=df.na.fill(1)
这会将所有NaN值填充为1,当然你可以选择任何其他值。
该错误是由于您的特征向量中有 NaN。
您可能需要导入必要的包。
This 应该也有帮助。
如果失败请告诉我。
使用 zeppelin notebook 中的 spark,我从昨天开始就收到了这个错误。 这是我的代码:
from pyspark.ml.clustering import KMeans
from pyspark.ml.feature import VectorAssembler
df = sqlContext.table("rfmdata_clust")
k = 4
# Set Kmeans input/output columns
vecAssembler = VectorAssembler(inputCols=["v1_clust", "v2_clust", "v3_clust"], outputCol="features")
featuresDf = vecAssembler.transform(df)
# Run KMeans
kmeans = KMeans().setInitMode("k-means||").setK(k)
model = kmeans.fit(featuresDf)
resultDf = model.transform(featuresDf)
# KMeans WSSSE
wssse = model.computeCost(featuresDf)
print("Within Set Sum of Squared Errors = " + str(wssse))
这是错误:
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-8890997346928959256.py", line 346, in <module>
raise Exception(traceback.format_exc())
Exception: Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-8890997346928959256.py", line 334, in <module>
exec(code)
File "<stdin>", line 8, in <module>
File "/usr/lib/spark/python/pyspark/ml/base.py", line 64, in fit
return self._fit(dataset)
File "/usr/lib/spark/python/pyspark/ml/wrapper.py", line 236, in _fit
java_model = self._fit_java(dataset)
File "/usr/lib/spark/python/pyspark/ml/wrapper.py", line 233, in _fit_java
return self._java_obj.fit(dataset._jdf)
File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 79, in deco
raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
IllegalArgumentException: u'requirement failed'
抛出错误的行是 kmeans.fit() 行。 我检查了 rfmdata_clust 数据框,它似乎一点也不奇怪。
df.printSchema()
给出:
root
|-- id: string (nullable = true)
|-- v1_clust: double (nullable = true)
|-- v2_clust: double (nullable = true)
|-- v3_clust: double (nullable = true)
featuresDf.printSchema()
给出:
root
|-- id: string (nullable = true)
|-- v1_clust: double (nullable = true)
|-- v2_clust: double (nullable = true)
|-- v3_clust: double (nullable = true)
|-- features: vector (nullable = true)
另一个有趣的地方是在featuresDf的定义下面添加featuresDf = featuresDf.limit(10000)
使得代码运行没有错误。可能跟数据大小有关?
希望这个问题已经解决了,如果没有,请试试这个
df=df.na.fill(1)
这会将所有NaN值填充为1,当然你可以选择任何其他值。 该错误是由于您的特征向量中有 NaN。 您可能需要导入必要的包。 This 应该也有帮助。
如果失败请告诉我。