如何用非数字列的模式替换空值？

Question

我的DataFrame中的Continent_Name列有空值，我想用同一列的模式替换它。

+-----------------+-----------------------+-------------------------+-----------------------+-------------------------------+--------------+
|     Country_Name|Number_of_Beer_Servings|Number_of_Spirit_Servings|Number_of_Wine_servings|Pure_alcohol_Consumption_litres|Continent_Name|
+-----------------+-----------------------+-------------------------+-----------------------+-------------------------------+--------------+
|      Afghanistan|                      0|                        0|                      0|                            0.0|            AS|
|          Albania|                     89|                      132|                     54|                            4.9|            EU|
|          Algeria|                     25|                        0|                     14|                            0.7|            AF|
|          Andorra|                    245|                      138|                    312|                           12.4|            EU|
|           Angola|                    217|                       57|                     45|                            5.9|            AF|
|Antigua & Barbuda|                    102|                      128|                     45|                            4.9|          null|
|        Argentina|                    193|                       25|                    221|                            8.3|            SA|
|          Armenia|                     21|                      179|                     11|                            3.8|            EU|
|        Australia|                    261|                       72|                    212|                           10.4|            OC|
|          Austria|                    279|                       75|                    191|                            9.7|            EU|
|       Azerbaijan|                     21|                       46|                      5|                            1.3|            EU|
|          Bahamas|                    122|                      176|                     51|                            6.3|          null|
|          Bahrain|                     42|                       63|                      7|                            2.0|            AS|
|       Bangladesh|                      0|                        0|                      0|                            0.0|            AS|
|         Barbados|                    143|                      173|                     36|                            6.3|          null|
|          Belarus|                    142|                      373|                     42|                           14.4|            EU|
|          Belgium|                    295|                       84|                    212|                           10.5|            EU|
|           Belize|                    263|                      114|                      8|                            6.8|          null|
|            Benin|                     34|                        4|                     13|                            1.1|            AF|
|           Bhutan|                     23|                        0|                      0|                            0.4|            AS|
+-----------------+-----------------------+-------------------------+-----------------------+-------------------------------+--------------+

只显示前 20 行

我试过以下方法：

for column in df_copy['Continent_Name']:
    df_copy['Continent_Name'].fillna(df_copy['Continent_Name'].mode()[0], inplace=True)

出现的错误：

TypeError: Column is not iterable

Answer 1

正在创建下面的 DataFrame。

df = spark.createDataFrame([('Afghanistan',0,0,0,0.0,'AS'),('Albania',89,132,54,4.9,'EU'),
                            ('Algeria',25,0,14,0.7,'AF'),('Andorra',245,138,312,12.4,'EU'),
                            ('Angola',217,57,45,5.9,'AF'),('Antigua&Barbuda',102,128,45,4.9,None),
                            ('Argentina',193,25,221,8.3,'SA'),('Armenia',21,179,11,3.8,'EU'),
                            ('Australia',261,72,212,10.4,'OC'),('Austria',279,75,191,9.7,'EU'),
                            ('Azerbaijan',21,46,5,1.3,'EU'),('Bahamas',122,176,51,6.3,None),
                            ('Bahrain',42,63,7,2.0,'AS'),('Bangladesh',0,0,0,0.0,'AS'),
                            ('Barbados',143,173,36,6.3,None),('Belarus',142,373,42,14.4,'EU'),
                            ('Belgium',295,84,212,10.5,'EU'),('Belize',263,114,8,6.8,None),
                            ('Benin',34,4,13,1.1,'AF'),('Bhutan',23,0,0,0.4,'AS')],
                            ['Country_Name','Number_of_Beer_Servings','Number_of_Spirit_Servings',
                             'Number_of_Wine_servings','Pure_alcohol_Consumption_litres',
                             'Continent_Name'])

因为我们打算找到 Mode，所以我们需要寻找最常出现的 Continent_Name 的值。

df1 = df.where(col('Continent_Name').isNotNull())

Resistering 我们的 DataFrame 作为视图并将 SQL 命令应用于 group by，然后计算 Continent_Name.

df1.registerTempTable('table')
df2=spark.sql(
    'SELECT Continent_Name, COUNT(Continent_Name) AS count FROM table GROUP BY Continent_Name ORDER BY count desc'
)
df2.show()
+--------------+-----+
|Continent_Name|count|
+--------------+-----+
|            EU|    7|
|            AS|    4|
|            AF|    3|
|            SA|    1|
|            OC|    1|
+--------------+-----+

最后，return 组的 first 元素。

mode_value = df2.first()['Continent_Name']
print(mode_value)
     EU

获得mode_value后，使用.fillna()函数填写即可。

df = df.fillna({'Continent_Name':mode_value})
df.show()
+---------------+-----------------------+-------------------------+-----------------------+-------------------------------+--------------+
|   Country_Name|Number_of_Beer_Servings|Number_of_Spirit_Servings|Number_of_Wine_servings|Pure_alcohol_Consumption_litres|Continent_Name|
+---------------+-----------------------+-------------------------+-----------------------+-------------------------------+--------------+
|    Afghanistan|                      0|                        0|                      0|                            0.0|            AS|
|        Albania|                     89|                      132|                     54|                            4.9|            EU|
|        Algeria|                     25|                        0|                     14|                            0.7|            AF|
|        Andorra|                    245|                      138|                    312|                           12.4|            EU|
|         Angola|                    217|                       57|                     45|                            5.9|            AF|
|Antigua&Barbuda|                    102|                      128|                     45|                            4.9|            EU|
|      Argentina|                    193|                       25|                    221|                            8.3|            SA|
|        Armenia|                     21|                      179|                     11|                            3.8|            EU|
|      Australia|                    261|                       72|                    212|                           10.4|            OC|
|        Austria|                    279|                       75|                    191|                            9.7|            EU|
|     Azerbaijan|                     21|                       46|                      5|                            1.3|            EU|
|        Bahamas|                    122|                      176|                     51|                            6.3|            EU|
|        Bahrain|                     42|                       63|                      7|                            2.0|            AS|
|     Bangladesh|                      0|                        0|                      0|                            0.0|            AS|
|       Barbados|                    143|                      173|                     36|                            6.3|            EU|
|        Belarus|                    142|                      373|                     42|                           14.4|            EU|
|        Belgium|                    295|                       84|                    212|                           10.5|            EU|
|         Belize|                    263|                      114|                      8|                            6.8|            EU|
|          Benin|                     34|                        4|                     13|                            1.1|            AF|
|         Bhutan|                     23|                        0|                      0|                            0.4|            AS|
+---------------+-----------------------+-------------------------+-----------------------+-------------------------------+--------------+

如何用非数字列的模式替换空值？

How do I replace the null values with the mode of the column which is not numeric?

python

dataframe

apache-spark-sql

pyspark

pyspark-dataframes