TypeError: unsupported operand type(s) for +: 'map' and 'list' with Pyspark
TypeError: unsupported operand type(s) for +: 'map' and 'list' with Pyspark
我正在通过 jupyter notebook 中的 pyspark 示例来了解它的工作原理。我 运行 遇到了一个找不到帮助的问题。
所以,这是加载 sparkContext 和 SQLContext 后的代码:
census_data =SQLCtx.read.load('/home/john/Downloads/census.csv',
format = "com.databricks.spark.csv",
header = "true",
inferSchema = "true")
#The data looks like this:
pd.DataFrame(census_data.take(3), columns = census_data.columns)
age workclass fnlwgt education education_num marital_status occupation relationship race sex capital_gain capital_loss hours_per_week native_country income
0 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States <=50K
1 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States <=50K
2 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K
接下来我尝试使用 OneHotEncoder 标记编码:
from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler
categoricalColumns = ["workclass", "education", "marital_status", "occupation", "relationship", "race", "sex", "native_country"]
stages = []
for categoricalCol in categoricalColumns:
#indexing with StringIndexer
stringIndexer = StringIndexer(inputCol=categoricalCol,
outputCol=categoricalCol+'Index')
encoder = OneHotEncoder(inputCol=categoricalCol+'Index',
outputCol=categoricalCol+'classVec')
#Add stages
stages += [stringIndexer, encoder]
# Convert label into label indices using the StringIndexer
label_stringIdx = StringIndexer(inputCol = "income", outputCol = "label")
stages += [label_stringIdx]
所有这些 运行 都很好。当我尝试使用 vectorAssembler 时,Python 抛出错误:
# Transform all features into a vector using VectorAssembler
numericCols = ["age", "fnlwgt", "education_num", "capital_gain", "capital_loss", "hours_per_week"]
assemblerInputs = map(lambda c: c + "TypeError: unsupported operand type(s) for +: 'map' and 'list'", categoricalColumns) + numericCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]
以及完整的回溯:
TypeError Traceback (most recent call last)
<ipython-input-23-16c50b42e41c> in <module>
1 # Transform all features into a vector using VectorAssembler
2 numericCols = ["age", "fnlwgt", "education_num", "capital_gain", "capital_loss", "hours_per_week"]
----> 3 assemblerInputs = map(lambda c: c + "classVec", categoricalColumns) + numericCols
4 assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
5 stages += [assembler]
TypeError: unsupported operand type(s) for +: 'map' and 'list'
所以我猜我不能将列表对象与 lambda 函数一起使用?我希望有人知道如何处理这个问题。谢谢!
map()
returns Python 中的地图 3. 因此,将其转换为列表。
assemblerInputs = list(map(lambda c: c + "classVec", categoricalColumns)) + numericCols
这应该有效。
我正在通过 jupyter notebook 中的 pyspark 示例来了解它的工作原理。我 运行 遇到了一个找不到帮助的问题。
所以,这是加载 sparkContext 和 SQLContext 后的代码:
census_data =SQLCtx.read.load('/home/john/Downloads/census.csv',
format = "com.databricks.spark.csv",
header = "true",
inferSchema = "true")
#The data looks like this:
pd.DataFrame(census_data.take(3), columns = census_data.columns)
age workclass fnlwgt education education_num marital_status occupation relationship race sex capital_gain capital_loss hours_per_week native_country income
0 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States <=50K
1 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States <=50K
2 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K
接下来我尝试使用 OneHotEncoder 标记编码:
from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler
categoricalColumns = ["workclass", "education", "marital_status", "occupation", "relationship", "race", "sex", "native_country"]
stages = []
for categoricalCol in categoricalColumns:
#indexing with StringIndexer
stringIndexer = StringIndexer(inputCol=categoricalCol,
outputCol=categoricalCol+'Index')
encoder = OneHotEncoder(inputCol=categoricalCol+'Index',
outputCol=categoricalCol+'classVec')
#Add stages
stages += [stringIndexer, encoder]
# Convert label into label indices using the StringIndexer
label_stringIdx = StringIndexer(inputCol = "income", outputCol = "label")
stages += [label_stringIdx]
所有这些 运行 都很好。当我尝试使用 vectorAssembler 时,Python 抛出错误:
# Transform all features into a vector using VectorAssembler
numericCols = ["age", "fnlwgt", "education_num", "capital_gain", "capital_loss", "hours_per_week"]
assemblerInputs = map(lambda c: c + "TypeError: unsupported operand type(s) for +: 'map' and 'list'", categoricalColumns) + numericCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]
以及完整的回溯:
TypeError Traceback (most recent call last)
<ipython-input-23-16c50b42e41c> in <module>
1 # Transform all features into a vector using VectorAssembler
2 numericCols = ["age", "fnlwgt", "education_num", "capital_gain", "capital_loss", "hours_per_week"]
----> 3 assemblerInputs = map(lambda c: c + "classVec", categoricalColumns) + numericCols
4 assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
5 stages += [assembler]
TypeError: unsupported operand type(s) for +: 'map' and 'list'
所以我猜我不能将列表对象与 lambda 函数一起使用?我希望有人知道如何处理这个问题。谢谢!
map()
returns Python 中的地图 3. 因此,将其转换为列表。
assemblerInputs = list(map(lambda c: c + "classVec", categoricalColumns)) + numericCols
这应该有效。