使用列表理解创建新列 Python

Question

我正在尝试创建一个包含城市名称的新列。我还有一个列表，其中包含所需的城市名称以及城市名称位于不同列名称下的 CSV 文件。

我想做的是检查列表中的城市名称是否存在于 CSV 文件的特定列范围内，并在新的城市列中填写该特定城市名称。

我的代码是：

 
 
import pandas as pd
import numpy as np
 
City_Name_List = ['Amsterdam', 'Antwerp', 'Brussels', 'Ghent', 'Asheville', 'Austin', 'Boston', 'Broward County', 
                  'Cambridge', 'Chicago', 'Clark County Nv', 'Columbus', 'Denver', 'Hawaii', 'Jersey City', 'Los Angeles', 
                  'Nashville', 'New Orleans', 'New York City', 'Oakland', 'Pacific Grove', 'Portland', 'Rhode Island', 'Salem Or', 'San Diego']
 
 
data = {'host_identity_verified':['t','t','t','t','t','t','t','t','t','t'],
      'neighbourhood':['Amsterdam, North Holland, Netherlands', 'Amsterdam, North Holland, Netherlands', 'NaN',
                       'Amsterdam, North Holland, Netherlands', 'Amsterdam, North Holland, Netherlands',
                        'Amsterdam, North Holland, Netherlands', 'Amsterdam, North Holland, Netherlands', 'NaN',
                        'Amsterdam, North Holland, Netherlands', 'Amsterdam, North Holland, Netherlands'],
      'neighbourhood_cleansed':['Oostelijk Havengebied - Indische Buurt', 'Centrum-Oost', 'Centrum-West', 'Centrum-West', 'Centrum-West',
                                'Oostelijk Havengebied - Indische Buurt', 'Centrum-Oost', 'Centrum-West', 'Centrum-West', 'Centrum-West'],
     'neighbourhood_group_cleansed': ['NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN'],
      'latitude':[ 52.36575, 52.36509, 52.37297, 52.38761, 52.36719, 52.36575, 52.36509, 52.37297, 52.38761, 52.36719]}
 
df = pd.DataFrame(data)
 
 
df['City']  = [x for x in City_Name_List if x in df.loc[:,'host_identity_verified':'latitude'].values][0]

当我运行代码时，我收到此消息：

Traceback (most recent call last):
  File "C:/Users/YAZAN/PycharmProjects/Yazan_Work/try.py", line 63, in <module>
    df['City'] = [x for x in City_Name_List if x in df.loc[:,'host_identity_verified':'latitude'].values][0]
IndexError: list index out of range

这是因为数据中City Amsterdam后面跟着其他词的脸。

我希望我的输出如下：

0    Amsterdam
1    Amsterdam
2    Amsterdam
3    Amsterdam
4    Amsterdam
5    Amsterdam
6    Amsterdam
7    Amsterdam
8    Amsterdam
9    Amsterdam
Name: City, dtype: object

我不懈地尝试解决这个问题。我尝试使用 endswith、startswith、正则表达式，但无济于事。我可能错误地使用了这两种方法。我希望有人能帮助我。

Answer 1

问题是，当您说 x in df.loc[] 时，您不是在检查城市名称是否在每个特定字符串中，而是在检查城市名称是否在整个系列中，而事实并非如此。你需要的是这样的：

df['city'] = [x if x in City_Name_list else '' for x[0] in df['neighbourhood'].str.split(',')]

这会将 df['neighborhood'] 中的每一行沿逗号和 return 第一个值拆分，然后检查该值是否在您的城市名称列表中，如果是，则将其放置在 'city' 系列中。

Answer 2

df['City'] = df['neighbourhood'].apply(lambda x: [i for i in x.split(',') if i in City_Name_List])
df['City'] = df['City'].apply(lambda x: "" if len(x) == 0 else x[0])

Answer 3

使用`Pandas.DataFrame.Apply`

的基本解决方案

df['City'] = df.apply(
    lambda row: [x if x in row.loc['neighbourhood'] for x in City_Name_List][0],
    axis=1
)

执行上述操作后，如果在每行的 'neighbourhood' 列中找到一个城市，df['city'] 将包含一个城市（由其包含在 City_Name_List 中定义）。

修改后的解决方案

你可以更明确一点，我指定 City 应该填充在每行的 'neighbourhood' 字段中第一次出现 , 之前出现的第一个子字符串上。如果 'neighbourhood' 列在结构上可靠地统一，这可能是个好主意，因为它可以帮助减轻由类似命名的城市、作为 City_Name_List 中其他城市的子串的城市等引起的任何不良行为。

df['City'] = df.apply(
    lambda row: [x if x in row.loc['neighbourhood'].split(',')[0] for x in City_Name_List][0],
    axis=1
)

注意：以上解决方案只是您如何解决所遇到问题的示例。它们没有考虑到对异常、边缘情况等的正确处理。您应该一如既往地注意在代码中考虑这些因素。

使用列表理解创建新列 Python

Create New Column With List Comprehension Python

python

list-comprehension

list

dataframe

pandas

使用`Pandas.DataFrame.Apply`

修改后的解决方案

使用列表理解创建新列 Python

Create New Column With List Comprehension Python

python

list-comprehension

list

dataframe

pandas

使用Pandas.DataFrame.Apply

修改后的解决方案

使用`Pandas.DataFrame.Apply`