为从同一数据框中提取的变量向数据框中添加新记录
Adding new records to a dataframe for variables extracted from the same dataframe
我正在尝试合并数据集中的变量。
我有这样的东西:
import pandas as pd
import numpy as np
data = np.array([[160,90,'skirt_trousers', 'tight_comfy'],[180,100,'trousers_skirt', 'long_short']])
dford = pd.DataFrame(data, columns = ['height','size','order', 'preference'])
我正在努力让它变成这样:
dataForTarget = np.array([['o1',160,90,'skirt', 'tight'],['o2', 180,100,'trousers', 'long'],['o1',160,90,'trousers', 'comfy'],['o2', 180,100,'skirt', 'short']])
Targetdford = pd.DataFrame(dataForTarget, columns = ['orderID','height','size','order', 'preference'])
作为第一步,我从字符串中提取了尽可能多的数据,
然后清理它们:
variables = dford.columns.tolist()
variables.append('ord1')
secondord = dford.order.str.extractall (r'_(.*)')
secondord = secondord.unstack()
secondord.columns = secondord.columns.droplevel()
dford1 = dford.join(secondord)
dford1. columns = variables
dford1.order = dford1.order.str.replace(r'(_.*)','')
variables = dford1.columns.tolist()
variables.append('pref1')
secondpref = dford.preference.str.extractall (r'_(.*)')
secondpref = secondpref.unstack()
secondpref.columns = secondpref.columns.droplevel()
dford2 = dford1.join(secondpref)
dford2. columns = variables
dford2.order = dford2.order.str.replace(r'(_.*)','')
这让我来到这里:
在这个阶段,我不知道如何将这些新信息添加为观察结果(按行)。
我能想出的最好办法如下,但失败了,因为索引包含
重复条目。但即使它没有失败,我怀疑它会
仅在我尝试填写缺失值时才有用。
但是我一无所获。
dford3 = dford2.rename(columns = {'ord1': 'order', 'pref1': 'preference'})
dford3= dford3.stack()
dford3= dford3.unstack()
使用Series.str.split
with DataFrame.stack
and concat
for new DataFrame and add to original by DataFrame.join
:
df = pd.concat([dford.pop('order').str.split('_', expand=True).stack().rename('order'),
dford.pop('preference').str.split('_', expand=True).stack().rename('preference')], axis=1)
dford = (dford.join(df.reset_index(level=1)).rename_axis('orderID')
.reset_index()
.sort_values(['level_1','orderID'])
.drop('level_1', 1)
.reset_index(drop=True)
.assign(orderID = lambda x: 'o' + x['orderID'].add(1).astype('str')))
print (dford)
orderID height size order preference
0 o1 160 90 skirt tight
1 o2 180 100 trousers long
2 o1 160 90 trousers comfy
3 o2 180 100 skirt short
使用DataFrame.apply
+ Series.str.split
。
将生成的数据帧与 pd.concat
and use Series.map
连接起来以创建 Hight
和 Size
系列:
df=pd.concat([df.T for df in dford[['order','preference']].apply(lambda x: x.str.split('_',expand=True),axis=1)]).rename_axis(index='OrderID').reset_index()
df['height']=df['OrderID'].map(dford['height'])
df['size']=df['OrderID'].map(dford['size'])
print(df)
OrderID order preference height size
0 0 skirt tight 160 90
1 1 trousers comfy 180 100
2 0 trousers long 160 90
3 1 skirt short 180 100
最后在OrderID
列加一,加上字符o
df['OrderID']='o'+df['OrderID'].add(1).astype('str')
print(df)
OrderID order preference height size
0 o1 skirt tight 160 90
1 o2 trousers comfy 180 100
2 o1 trousers long 160 90
3 o2 skirt short 180 100
我正在尝试合并数据集中的变量。 我有这样的东西:
import pandas as pd
import numpy as np
data = np.array([[160,90,'skirt_trousers', 'tight_comfy'],[180,100,'trousers_skirt', 'long_short']])
dford = pd.DataFrame(data, columns = ['height','size','order', 'preference'])
我正在努力让它变成这样:
dataForTarget = np.array([['o1',160,90,'skirt', 'tight'],['o2', 180,100,'trousers', 'long'],['o1',160,90,'trousers', 'comfy'],['o2', 180,100,'skirt', 'short']])
Targetdford = pd.DataFrame(dataForTarget, columns = ['orderID','height','size','order', 'preference'])
作为第一步,我从字符串中提取了尽可能多的数据, 然后清理它们:
variables = dford.columns.tolist()
variables.append('ord1')
secondord = dford.order.str.extractall (r'_(.*)')
secondord = secondord.unstack()
secondord.columns = secondord.columns.droplevel()
dford1 = dford.join(secondord)
dford1. columns = variables
dford1.order = dford1.order.str.replace(r'(_.*)','')
variables = dford1.columns.tolist()
variables.append('pref1')
secondpref = dford.preference.str.extractall (r'_(.*)')
secondpref = secondpref.unstack()
secondpref.columns = secondpref.columns.droplevel()
dford2 = dford1.join(secondpref)
dford2. columns = variables
dford2.order = dford2.order.str.replace(r'(_.*)','')
这让我来到这里:
在这个阶段,我不知道如何将这些新信息添加为观察结果(按行)。
我能想出的最好办法如下,但失败了,因为索引包含 重复条目。但即使它没有失败,我怀疑它会 仅在我尝试填写缺失值时才有用。
但是我一无所获。
dford3 = dford2.rename(columns = {'ord1': 'order', 'pref1': 'preference'})
dford3= dford3.stack()
dford3= dford3.unstack()
使用Series.str.split
with DataFrame.stack
and concat
for new DataFrame and add to original by DataFrame.join
:
df = pd.concat([dford.pop('order').str.split('_', expand=True).stack().rename('order'),
dford.pop('preference').str.split('_', expand=True).stack().rename('preference')], axis=1)
dford = (dford.join(df.reset_index(level=1)).rename_axis('orderID')
.reset_index()
.sort_values(['level_1','orderID'])
.drop('level_1', 1)
.reset_index(drop=True)
.assign(orderID = lambda x: 'o' + x['orderID'].add(1).astype('str')))
print (dford)
orderID height size order preference
0 o1 160 90 skirt tight
1 o2 180 100 trousers long
2 o1 160 90 trousers comfy
3 o2 180 100 skirt short
使用DataFrame.apply
+ Series.str.split
。
将生成的数据帧与 pd.concat
and use Series.map
连接起来以创建 Hight
和 Size
系列:
df=pd.concat([df.T for df in dford[['order','preference']].apply(lambda x: x.str.split('_',expand=True),axis=1)]).rename_axis(index='OrderID').reset_index()
df['height']=df['OrderID'].map(dford['height'])
df['size']=df['OrderID'].map(dford['size'])
print(df)
OrderID order preference height size
0 0 skirt tight 160 90
1 1 trousers comfy 180 100
2 0 trousers long 160 90
3 1 skirt short 180 100
最后在OrderID
列加一,加上字符o
df['OrderID']='o'+df['OrderID'].add(1).astype('str')
print(df)
OrderID order preference height size
0 o1 skirt tight 160 90
1 o2 trousers comfy 180 100
2 o1 trousers long 160 90
3 o2 skirt short 180 100