从句子列中提取新特征 - Python

Extract new feature from sentence column - Python

我有两个数据框:

city_state 数据帧

    city        state
0   huntsville  alabama
1   montgomery  alabama
2   birmingham  alabama
3   mobile      alabama
4   dothan      alabama
5   chicago     illinois
6   boise       idaho
7   des moines  iowa

和句子数据框

    sentence
0   marthy was born in dothan
1   michelle reads some books at her home
2   hasan is highschool student in chicago
3   hartford of the west is the nickname of des moines

我想从名为 city 的句子数据框中提取新特征。如果句子中包含来自列 city_state['city'] 的某个 city 的名称,则该列 citysentence 中提取,如果它不包含某个 [=18] 的名称=] 它的值将为 Null。

预期的新数据帧将是这样的:

    sentence                                        city
0   marthy was born in dothan                       dothan
1   michelle reads some books at her home           Null
2   hasan is highschool student in chicago          chicago
3   capital of dream is the motto of des moines     des moines

我有运行这个代码

sentence['city'] ={}

for city in city_state.city:
    for text in sentence.sentence:
        words = text.split()
        for word in words:
            if word == city:
                sentence['city'].append(city)
                break
    else:
        sentence['city'].append(None)

但是这段代码的结果是这样的

ValueError: Length of values does not match length of index

如果你有类似案例的特征工程经验,你能给我一些建议如何为预期结果编写正确的代码吗?

谢谢

注意: 这是错误的完整日志

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-205-8a9038a015ee> in <module>
----> 1 sentence['city'] ={}
      2 
      3 for city in city_state.city:
      4     for text in sentence.sentence:
      5         words = text.split()

~\Anaconda3\lib\site-packages\pandas\core\frame.py in __setitem__(self, key, value)
   3117         else:
   3118             # set column
-> 3119             self._set_item(key, value)
   3120 
   3121     def _setitem_slice(self, key, value):

~\Anaconda3\lib\site-packages\pandas\core\frame.py in _set_item(self, key, value)
   3192 
   3193         self._ensure_valid_index(value)
-> 3194         value = self._sanitize_column(key, value)
   3195         NDFrame._set_item(self, key, value)
   3196 

~\Anaconda3\lib\site-packages\pandas\core\frame.py in _sanitize_column(self, key, value, broadcast)
   3389 
   3390             # turn me into an ndarray
-> 3391             value = _sanitize_index(value, self.index, copy=False)
   3392             if not isinstance(value, (np.ndarray, Index)):
   3393                 if isinstance(value, list) and len(value) > 0:

~\Anaconda3\lib\site-packages\pandas\core\series.py in _sanitize_index(data, index, copy)
   3999 
   4000     if len(data) != len(index):
-> 4001         raise ValueError('Length of values does not match length of ' 'index')
   4002 
   4003     if isinstance(data, ABCIndexClass) and not copy:

ValueError: Length of values does not match length of index

像这样的东西可以工作。我会自己尝试,但我在 phone.

sentence_cities =[]
cities = city_state.city

for text in sentence.sentence:
    [sentence_cities.append(word) if word in cities else sentence_cities.append(None) for word in text.split()]

sentence['city'] = sentence_cities

一些快速而肮脏的应用,尚未在大型数据帧上测试过,因此请谨慎使用。 先定义一个提取城市名的函数:

def ex_city(col, cities):
    output = []
    for w in cities:
        if w in col:
            output.append(w)
    return ','.join(output) if output else None

然后将它应用到你的句子数据框

city_list = city_state.city.unique().tolist()
sentence['city'] = sentence['sentence'].apply(lambda x: ex_city(x, city_list))

sdf = sentence dataframecdf=city_state dataframe

des moines 在执行 str.split 时将是一个问题,因为它的名称中有一个 space。

首先(或最后,需要测试)得到那个城市

sdf.loc[sdf['sentence'].str.contains('des moines'), 'city'] = 'des moines'

然后剩下的

def get_city(sentence, cities):
    for word in sentence.split(' '):
        if sentence in cities:
           return word
    return None

cities = cdf['city'].tolist()
sdf['city'] = sdf['sentence'].apply(lambda x: get_city(x, cities))