如何用非数字列的模式替换空值?
How do I replace the null values with the mode of the column which is not numeric?
我的DataFrame中的Continent_Name列有空值,我想用同一列的模式替换它。
+-----------------+-----------------------+-------------------------+-----------------------+-------------------------------+--------------+
| Country_Name|Number_of_Beer_Servings|Number_of_Spirit_Servings|Number_of_Wine_servings|Pure_alcohol_Consumption_litres|Continent_Name|
+-----------------+-----------------------+-------------------------+-----------------------+-------------------------------+--------------+
| Afghanistan| 0| 0| 0| 0.0| AS|
| Albania| 89| 132| 54| 4.9| EU|
| Algeria| 25| 0| 14| 0.7| AF|
| Andorra| 245| 138| 312| 12.4| EU|
| Angola| 217| 57| 45| 5.9| AF|
|Antigua & Barbuda| 102| 128| 45| 4.9| null|
| Argentina| 193| 25| 221| 8.3| SA|
| Armenia| 21| 179| 11| 3.8| EU|
| Australia| 261| 72| 212| 10.4| OC|
| Austria| 279| 75| 191| 9.7| EU|
| Azerbaijan| 21| 46| 5| 1.3| EU|
| Bahamas| 122| 176| 51| 6.3| null|
| Bahrain| 42| 63| 7| 2.0| AS|
| Bangladesh| 0| 0| 0| 0.0| AS|
| Barbados| 143| 173| 36| 6.3| null|
| Belarus| 142| 373| 42| 14.4| EU|
| Belgium| 295| 84| 212| 10.5| EU|
| Belize| 263| 114| 8| 6.8| null|
| Benin| 34| 4| 13| 1.1| AF|
| Bhutan| 23| 0| 0| 0.4| AS|
+-----------------+-----------------------+-------------------------+-----------------------+-------------------------------+--------------+
只显示前 20 行
我试过以下方法:
for column in df_copy['Continent_Name']:
df_copy['Continent_Name'].fillna(df_copy['Continent_Name'].mode()[0], inplace=True)
出现的错误:
TypeError: Column is not iterable
正在创建下面的 DataFrame
。
df = spark.createDataFrame([('Afghanistan',0,0,0,0.0,'AS'),('Albania',89,132,54,4.9,'EU'),
('Algeria',25,0,14,0.7,'AF'),('Andorra',245,138,312,12.4,'EU'),
('Angola',217,57,45,5.9,'AF'),('Antigua&Barbuda',102,128,45,4.9,None),
('Argentina',193,25,221,8.3,'SA'),('Armenia',21,179,11,3.8,'EU'),
('Australia',261,72,212,10.4,'OC'),('Austria',279,75,191,9.7,'EU'),
('Azerbaijan',21,46,5,1.3,'EU'),('Bahamas',122,176,51,6.3,None),
('Bahrain',42,63,7,2.0,'AS'),('Bangladesh',0,0,0,0.0,'AS'),
('Barbados',143,173,36,6.3,None),('Belarus',142,373,42,14.4,'EU'),
('Belgium',295,84,212,10.5,'EU'),('Belize',263,114,8,6.8,None),
('Benin',34,4,13,1.1,'AF'),('Bhutan',23,0,0,0.4,'AS')],
['Country_Name','Number_of_Beer_Servings','Number_of_Spirit_Servings',
'Number_of_Wine_servings','Pure_alcohol_Consumption_litres',
'Continent_Name'])
因为我们打算找到 Mode
,所以我们需要寻找最常出现的 Continent_Name
的值。
df1 = df.where(col('Continent_Name').isNotNull())
Resistering
我们的 DataFrame 作为视图并将 SQL 命令应用于 group by
,然后计算 Continent_Name
.
df1.registerTempTable('table')
df2=spark.sql(
'SELECT Continent_Name, COUNT(Continent_Name) AS count FROM table GROUP BY Continent_Name ORDER BY count desc'
)
df2.show()
+--------------+-----+
|Continent_Name|count|
+--------------+-----+
| EU| 7|
| AS| 4|
| AF| 3|
| SA| 1|
| OC| 1|
+--------------+-----+
最后,return 组的 first
元素。
mode_value = df2.first()['Continent_Name']
print(mode_value)
EU
获得mode_value
后,使用.fillna()
函数填写即可。
df = df.fillna({'Continent_Name':mode_value})
df.show()
+---------------+-----------------------+-------------------------+-----------------------+-------------------------------+--------------+
| Country_Name|Number_of_Beer_Servings|Number_of_Spirit_Servings|Number_of_Wine_servings|Pure_alcohol_Consumption_litres|Continent_Name|
+---------------+-----------------------+-------------------------+-----------------------+-------------------------------+--------------+
| Afghanistan| 0| 0| 0| 0.0| AS|
| Albania| 89| 132| 54| 4.9| EU|
| Algeria| 25| 0| 14| 0.7| AF|
| Andorra| 245| 138| 312| 12.4| EU|
| Angola| 217| 57| 45| 5.9| AF|
|Antigua&Barbuda| 102| 128| 45| 4.9| EU|
| Argentina| 193| 25| 221| 8.3| SA|
| Armenia| 21| 179| 11| 3.8| EU|
| Australia| 261| 72| 212| 10.4| OC|
| Austria| 279| 75| 191| 9.7| EU|
| Azerbaijan| 21| 46| 5| 1.3| EU|
| Bahamas| 122| 176| 51| 6.3| EU|
| Bahrain| 42| 63| 7| 2.0| AS|
| Bangladesh| 0| 0| 0| 0.0| AS|
| Barbados| 143| 173| 36| 6.3| EU|
| Belarus| 142| 373| 42| 14.4| EU|
| Belgium| 295| 84| 212| 10.5| EU|
| Belize| 263| 114| 8| 6.8| EU|
| Benin| 34| 4| 13| 1.1| AF|
| Bhutan| 23| 0| 0| 0.4| AS|
+---------------+-----------------------+-------------------------+-----------------------+-------------------------------+--------------+
我的DataFrame中的Continent_Name列有空值,我想用同一列的模式替换它。
+-----------------+-----------------------+-------------------------+-----------------------+-------------------------------+--------------+
| Country_Name|Number_of_Beer_Servings|Number_of_Spirit_Servings|Number_of_Wine_servings|Pure_alcohol_Consumption_litres|Continent_Name|
+-----------------+-----------------------+-------------------------+-----------------------+-------------------------------+--------------+
| Afghanistan| 0| 0| 0| 0.0| AS|
| Albania| 89| 132| 54| 4.9| EU|
| Algeria| 25| 0| 14| 0.7| AF|
| Andorra| 245| 138| 312| 12.4| EU|
| Angola| 217| 57| 45| 5.9| AF|
|Antigua & Barbuda| 102| 128| 45| 4.9| null|
| Argentina| 193| 25| 221| 8.3| SA|
| Armenia| 21| 179| 11| 3.8| EU|
| Australia| 261| 72| 212| 10.4| OC|
| Austria| 279| 75| 191| 9.7| EU|
| Azerbaijan| 21| 46| 5| 1.3| EU|
| Bahamas| 122| 176| 51| 6.3| null|
| Bahrain| 42| 63| 7| 2.0| AS|
| Bangladesh| 0| 0| 0| 0.0| AS|
| Barbados| 143| 173| 36| 6.3| null|
| Belarus| 142| 373| 42| 14.4| EU|
| Belgium| 295| 84| 212| 10.5| EU|
| Belize| 263| 114| 8| 6.8| null|
| Benin| 34| 4| 13| 1.1| AF|
| Bhutan| 23| 0| 0| 0.4| AS|
+-----------------+-----------------------+-------------------------+-----------------------+-------------------------------+--------------+
只显示前 20 行
我试过以下方法:
for column in df_copy['Continent_Name']:
df_copy['Continent_Name'].fillna(df_copy['Continent_Name'].mode()[0], inplace=True)
出现的错误:
TypeError: Column is not iterable
正在创建下面的 DataFrame
。
df = spark.createDataFrame([('Afghanistan',0,0,0,0.0,'AS'),('Albania',89,132,54,4.9,'EU'),
('Algeria',25,0,14,0.7,'AF'),('Andorra',245,138,312,12.4,'EU'),
('Angola',217,57,45,5.9,'AF'),('Antigua&Barbuda',102,128,45,4.9,None),
('Argentina',193,25,221,8.3,'SA'),('Armenia',21,179,11,3.8,'EU'),
('Australia',261,72,212,10.4,'OC'),('Austria',279,75,191,9.7,'EU'),
('Azerbaijan',21,46,5,1.3,'EU'),('Bahamas',122,176,51,6.3,None),
('Bahrain',42,63,7,2.0,'AS'),('Bangladesh',0,0,0,0.0,'AS'),
('Barbados',143,173,36,6.3,None),('Belarus',142,373,42,14.4,'EU'),
('Belgium',295,84,212,10.5,'EU'),('Belize',263,114,8,6.8,None),
('Benin',34,4,13,1.1,'AF'),('Bhutan',23,0,0,0.4,'AS')],
['Country_Name','Number_of_Beer_Servings','Number_of_Spirit_Servings',
'Number_of_Wine_servings','Pure_alcohol_Consumption_litres',
'Continent_Name'])
因为我们打算找到 Mode
,所以我们需要寻找最常出现的 Continent_Name
的值。
df1 = df.where(col('Continent_Name').isNotNull())
Resistering
我们的 DataFrame 作为视图并将 SQL 命令应用于 group by
,然后计算 Continent_Name
.
df1.registerTempTable('table')
df2=spark.sql(
'SELECT Continent_Name, COUNT(Continent_Name) AS count FROM table GROUP BY Continent_Name ORDER BY count desc'
)
df2.show()
+--------------+-----+
|Continent_Name|count|
+--------------+-----+
| EU| 7|
| AS| 4|
| AF| 3|
| SA| 1|
| OC| 1|
+--------------+-----+
最后,return 组的 first
元素。
mode_value = df2.first()['Continent_Name']
print(mode_value)
EU
获得mode_value
后,使用.fillna()
函数填写即可。
df = df.fillna({'Continent_Name':mode_value})
df.show()
+---------------+-----------------------+-------------------------+-----------------------+-------------------------------+--------------+
| Country_Name|Number_of_Beer_Servings|Number_of_Spirit_Servings|Number_of_Wine_servings|Pure_alcohol_Consumption_litres|Continent_Name|
+---------------+-----------------------+-------------------------+-----------------------+-------------------------------+--------------+
| Afghanistan| 0| 0| 0| 0.0| AS|
| Albania| 89| 132| 54| 4.9| EU|
| Algeria| 25| 0| 14| 0.7| AF|
| Andorra| 245| 138| 312| 12.4| EU|
| Angola| 217| 57| 45| 5.9| AF|
|Antigua&Barbuda| 102| 128| 45| 4.9| EU|
| Argentina| 193| 25| 221| 8.3| SA|
| Armenia| 21| 179| 11| 3.8| EU|
| Australia| 261| 72| 212| 10.4| OC|
| Austria| 279| 75| 191| 9.7| EU|
| Azerbaijan| 21| 46| 5| 1.3| EU|
| Bahamas| 122| 176| 51| 6.3| EU|
| Bahrain| 42| 63| 7| 2.0| AS|
| Bangladesh| 0| 0| 0| 0.0| AS|
| Barbados| 143| 173| 36| 6.3| EU|
| Belarus| 142| 373| 42| 14.4| EU|
| Belgium| 295| 84| 212| 10.5| EU|
| Belize| 263| 114| 8| 6.8| EU|
| Benin| 34| 4| 13| 1.1| AF|
| Bhutan| 23| 0| 0| 0.4| AS|
+---------------+-----------------------+-------------------------+-----------------------+-------------------------------+--------------+