从句子列中提取新特征 - Python
Extract new feature from sentence column - Python
我有两个数据框:
city_state
数据帧
city state
0 huntsville alabama
1 montgomery alabama
2 birmingham alabama
3 mobile alabama
4 dothan alabama
5 chicago illinois
6 boise idaho
7 des moines iowa
和句子数据框
sentence
0 marthy was born in dothan
1 michelle reads some books at her home
2 hasan is highschool student in chicago
3 hartford of the west is the nickname of des moines
我想从名为 city 的句子数据框中提取新特征。如果句子中包含来自列 city_state['city']
的某个 city
的名称,则该列 city
从 sentence
中提取,如果它不包含某个 [=18] 的名称=] 它的值将为 Null。
预期的新数据帧将是这样的:
sentence city
0 marthy was born in dothan dothan
1 michelle reads some books at her home Null
2 hasan is highschool student in chicago chicago
3 capital of dream is the motto of des moines des moines
我有运行这个代码
sentence['city'] ={}
for city in city_state.city:
for text in sentence.sentence:
words = text.split()
for word in words:
if word == city:
sentence['city'].append(city)
break
else:
sentence['city'].append(None)
但是这段代码的结果是这样的
ValueError: Length of values does not match length of index
如果你有类似案例的特征工程经验,你能给我一些建议如何为预期结果编写正确的代码吗?
谢谢
注意:
这是错误的完整日志
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-205-8a9038a015ee> in <module>
----> 1 sentence['city'] ={}
2
3 for city in city_state.city:
4 for text in sentence.sentence:
5 words = text.split()
~\Anaconda3\lib\site-packages\pandas\core\frame.py in __setitem__(self, key, value)
3117 else:
3118 # set column
-> 3119 self._set_item(key, value)
3120
3121 def _setitem_slice(self, key, value):
~\Anaconda3\lib\site-packages\pandas\core\frame.py in _set_item(self, key, value)
3192
3193 self._ensure_valid_index(value)
-> 3194 value = self._sanitize_column(key, value)
3195 NDFrame._set_item(self, key, value)
3196
~\Anaconda3\lib\site-packages\pandas\core\frame.py in _sanitize_column(self, key, value, broadcast)
3389
3390 # turn me into an ndarray
-> 3391 value = _sanitize_index(value, self.index, copy=False)
3392 if not isinstance(value, (np.ndarray, Index)):
3393 if isinstance(value, list) and len(value) > 0:
~\Anaconda3\lib\site-packages\pandas\core\series.py in _sanitize_index(data, index, copy)
3999
4000 if len(data) != len(index):
-> 4001 raise ValueError('Length of values does not match length of ' 'index')
4002
4003 if isinstance(data, ABCIndexClass) and not copy:
ValueError: Length of values does not match length of index
像这样的东西可以工作。我会自己尝试,但我在 phone.
sentence_cities =[]
cities = city_state.city
for text in sentence.sentence:
[sentence_cities.append(word) if word in cities else sentence_cities.append(None) for word in text.split()]
sentence['city'] = sentence_cities
一些快速而肮脏的应用,尚未在大型数据帧上测试过,因此请谨慎使用。
先定义一个提取城市名的函数:
def ex_city(col, cities):
output = []
for w in cities:
if w in col:
output.append(w)
return ','.join(output) if output else None
然后将它应用到你的句子数据框
city_list = city_state.city.unique().tolist()
sentence['city'] = sentence['sentence'].apply(lambda x: ex_city(x, city_list))
让sdf = sentence dataframe
和cdf=city_state dataframe
des moines
在执行 str.split
时将是一个问题,因为它的名称中有一个 space。
首先(或最后,需要测试)得到那个城市
sdf.loc[sdf['sentence'].str.contains('des moines'), 'city'] = 'des moines'
然后剩下的
def get_city(sentence, cities):
for word in sentence.split(' '):
if sentence in cities:
return word
return None
cities = cdf['city'].tolist()
sdf['city'] = sdf['sentence'].apply(lambda x: get_city(x, cities))
我有两个数据框:
city_state
数据帧
city state
0 huntsville alabama
1 montgomery alabama
2 birmingham alabama
3 mobile alabama
4 dothan alabama
5 chicago illinois
6 boise idaho
7 des moines iowa
和句子数据框
sentence
0 marthy was born in dothan
1 michelle reads some books at her home
2 hasan is highschool student in chicago
3 hartford of the west is the nickname of des moines
我想从名为 city 的句子数据框中提取新特征。如果句子中包含来自列 city_state['city']
的某个 city
的名称,则该列 city
从 sentence
中提取,如果它不包含某个 [=18] 的名称=] 它的值将为 Null。
预期的新数据帧将是这样的:
sentence city
0 marthy was born in dothan dothan
1 michelle reads some books at her home Null
2 hasan is highschool student in chicago chicago
3 capital of dream is the motto of des moines des moines
我有运行这个代码
sentence['city'] ={}
for city in city_state.city:
for text in sentence.sentence:
words = text.split()
for word in words:
if word == city:
sentence['city'].append(city)
break
else:
sentence['city'].append(None)
但是这段代码的结果是这样的
ValueError: Length of values does not match length of index
如果你有类似案例的特征工程经验,你能给我一些建议如何为预期结果编写正确的代码吗?
谢谢
注意: 这是错误的完整日志
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-205-8a9038a015ee> in <module>
----> 1 sentence['city'] ={}
2
3 for city in city_state.city:
4 for text in sentence.sentence:
5 words = text.split()
~\Anaconda3\lib\site-packages\pandas\core\frame.py in __setitem__(self, key, value)
3117 else:
3118 # set column
-> 3119 self._set_item(key, value)
3120
3121 def _setitem_slice(self, key, value):
~\Anaconda3\lib\site-packages\pandas\core\frame.py in _set_item(self, key, value)
3192
3193 self._ensure_valid_index(value)
-> 3194 value = self._sanitize_column(key, value)
3195 NDFrame._set_item(self, key, value)
3196
~\Anaconda3\lib\site-packages\pandas\core\frame.py in _sanitize_column(self, key, value, broadcast)
3389
3390 # turn me into an ndarray
-> 3391 value = _sanitize_index(value, self.index, copy=False)
3392 if not isinstance(value, (np.ndarray, Index)):
3393 if isinstance(value, list) and len(value) > 0:
~\Anaconda3\lib\site-packages\pandas\core\series.py in _sanitize_index(data, index, copy)
3999
4000 if len(data) != len(index):
-> 4001 raise ValueError('Length of values does not match length of ' 'index')
4002
4003 if isinstance(data, ABCIndexClass) and not copy:
ValueError: Length of values does not match length of index
像这样的东西可以工作。我会自己尝试,但我在 phone.
sentence_cities =[]
cities = city_state.city
for text in sentence.sentence:
[sentence_cities.append(word) if word in cities else sentence_cities.append(None) for word in text.split()]
sentence['city'] = sentence_cities
一些快速而肮脏的应用,尚未在大型数据帧上测试过,因此请谨慎使用。 先定义一个提取城市名的函数:
def ex_city(col, cities):
output = []
for w in cities:
if w in col:
output.append(w)
return ','.join(output) if output else None
然后将它应用到你的句子数据框
city_list = city_state.city.unique().tolist()
sentence['city'] = sentence['sentence'].apply(lambda x: ex_city(x, city_list))
让sdf = sentence dataframe
和cdf=city_state dataframe
des moines
在执行 str.split
时将是一个问题,因为它的名称中有一个 space。
首先(或最后,需要测试)得到那个城市
sdf.loc[sdf['sentence'].str.contains('des moines'), 'city'] = 'des moines'
然后剩下的
def get_city(sentence, cities):
for word in sentence.split(' '):
if sentence in cities:
return word
return None
cities = cdf['city'].tolist()
sdf['city'] = sdf['sentence'].apply(lambda x: get_city(x, cities))