graphlab 从现有 sframe 添加变量列
graphlab adding variable columns from existing sframe
我有一个 SFrame 例如
a | b
-----
2 | 31 4 5
0 | 1 9
1 | 2 84
现在我想得到以下结果
a | b | c | d | e
----------------------
2 | 31 4 5 | 31|4 | 5
0 | 1 9 | 1 | 9 | 0
1 | 2 84 | 2 | 84 | 0
知道怎么做吗?或者我可能必须使用其他一些工具?
谢谢
使用pandas:
In [409]: sf
Out[409]:
Columns:
a int
b str
Rows: 3
Data:
+---+--------+
| a | b |
+---+--------+
| 2 | 31 4 5 |
| 0 | 1 9 |
| 1 | 2 84 |
+---+--------+
[3 rows x 2 columns]
In [410]: df = sf.to_dataframe()
In [411]: newdf = pd.DataFrame(df.b.str.split().tolist(), columns = ['c', 'd', 'e']).fillna('0')
In [412]: df.join(newdf)
Out[412]:
a b c d e
0 2 31 4 5 31 4 5
1 0 1 9 1 9 0
2 1 2 84 2 84 0
转换回 SFrame:
In [498]: SFrame(df.join(newdf))
Out[498]:
Columns:
a int
b str
c str
d str
e str
Rows: 3
Data:
+---+--------+----+----+---+
| a | b | c | d | e |
+---+--------+----+----+---+
| 2 | 31 4 5 | 31 | 4 | 5 |
| 0 | 1 9 | 1 | 9 | 0 |
| 1 | 2 84 | 2 | 84 | 0 |
+---+--------+----+----+---+
[3 rows x 5 columns]
如果你想要integers/floats,你也可以这样做:
In [506]: newdf = pd.DataFrame(map(lambda x: [int(y) for y in x], df.b.str.split().tolist()), columns = ['c', 'd', 'e'])
In [507]: newdf
Out[507]:
c d e
0 31 4 5.0
1 1 9 NaN
2 2 84 NaN
In [508]: SFrame(df.join(newdf))
Out[508]:
Columns:
a int
b str
c int
d int
e float
Rows: 3
Data:
+---+--------+----+----+-----+
| a | b | c | d | e |
+---+--------+----+----+-----+
| 2 | 31 4 5 | 31 | 4 | 5.0 |
| 0 | 1 9 | 1 | 9 | nan |
| 1 | 2 84 | 2 | 84 | nan |
+---+--------+----+----+-----+
[3 rows x 5 columns]
def customsplit(string,column):
val = string.split(' ')
diff = column - len(val)
val += ['0']*diff
return val
a = sf['b'].apply(lambda x: customsplit(x,3))
sf['c'] = [i[0] for i in a]
sf['d'] = [i[1] for i in a]
sf['e'] = [i[2] for i in a]
sf
输出:
a | b | c | d | e
----------------------
2 | 31 4 5 | 31|4 | 5
0 | 1 9 | 1 | 9 | 0
1 | 2 84 | 2 | 84 | 0
这可以通过 SFrame 本身完成,而不是使用 Pandas。只需利用'unpack'函数。
Pandas提供了多种处理数据集的函数,但是将SFrame转换成PandasDataFrame并不方便,反之亦然。
如果您处理超过 10 GB 的数据,Pandas 将无法正确处理数据集。 (但是 SFrame 可以)
# your SFrame
sf=sframe.SFrame({'a' : [2,0,1], 'b' : [[31,4,5],[1,9,],[2,84,]]})
# just use 'unpack()' function
sf2= sf.unpack('b')
# change the column names
sf2.rename({'b.0':'c', 'b.1':'d', 'b.2':'e'})
# filling-up the missing values to zero
sf2 = sf2['e'].fillna(0)
# merge the original SFrame and new SFrame
sf.join(sf2, 'a')
我有一个 SFrame 例如
a | b
-----
2 | 31 4 5
0 | 1 9
1 | 2 84
现在我想得到以下结果
a | b | c | d | e
----------------------
2 | 31 4 5 | 31|4 | 5
0 | 1 9 | 1 | 9 | 0
1 | 2 84 | 2 | 84 | 0
知道怎么做吗?或者我可能必须使用其他一些工具?
谢谢
使用pandas:
In [409]: sf
Out[409]:
Columns:
a int
b str
Rows: 3
Data:
+---+--------+
| a | b |
+---+--------+
| 2 | 31 4 5 |
| 0 | 1 9 |
| 1 | 2 84 |
+---+--------+
[3 rows x 2 columns]
In [410]: df = sf.to_dataframe()
In [411]: newdf = pd.DataFrame(df.b.str.split().tolist(), columns = ['c', 'd', 'e']).fillna('0')
In [412]: df.join(newdf)
Out[412]:
a b c d e
0 2 31 4 5 31 4 5
1 0 1 9 1 9 0
2 1 2 84 2 84 0
转换回 SFrame:
In [498]: SFrame(df.join(newdf))
Out[498]:
Columns:
a int
b str
c str
d str
e str
Rows: 3
Data:
+---+--------+----+----+---+
| a | b | c | d | e |
+---+--------+----+----+---+
| 2 | 31 4 5 | 31 | 4 | 5 |
| 0 | 1 9 | 1 | 9 | 0 |
| 1 | 2 84 | 2 | 84 | 0 |
+---+--------+----+----+---+
[3 rows x 5 columns]
如果你想要integers/floats,你也可以这样做:
In [506]: newdf = pd.DataFrame(map(lambda x: [int(y) for y in x], df.b.str.split().tolist()), columns = ['c', 'd', 'e'])
In [507]: newdf
Out[507]:
c d e
0 31 4 5.0
1 1 9 NaN
2 2 84 NaN
In [508]: SFrame(df.join(newdf))
Out[508]:
Columns:
a int
b str
c int
d int
e float
Rows: 3
Data:
+---+--------+----+----+-----+
| a | b | c | d | e |
+---+--------+----+----+-----+
| 2 | 31 4 5 | 31 | 4 | 5.0 |
| 0 | 1 9 | 1 | 9 | nan |
| 1 | 2 84 | 2 | 84 | nan |
+---+--------+----+----+-----+
[3 rows x 5 columns]
def customsplit(string,column):
val = string.split(' ')
diff = column - len(val)
val += ['0']*diff
return val
a = sf['b'].apply(lambda x: customsplit(x,3))
sf['c'] = [i[0] for i in a]
sf['d'] = [i[1] for i in a]
sf['e'] = [i[2] for i in a]
sf
输出:
a | b | c | d | e
----------------------
2 | 31 4 5 | 31|4 | 5
0 | 1 9 | 1 | 9 | 0
1 | 2 84 | 2 | 84 | 0
这可以通过 SFrame 本身完成,而不是使用 Pandas。只需利用'unpack'函数。
Pandas提供了多种处理数据集的函数,但是将SFrame转换成PandasDataFrame并不方便,反之亦然。
如果您处理超过 10 GB 的数据,Pandas 将无法正确处理数据集。 (但是 SFrame 可以)
# your SFrame
sf=sframe.SFrame({'a' : [2,0,1], 'b' : [[31,4,5],[1,9,],[2,84,]]})
# just use 'unpack()' function
sf2= sf.unpack('b')
# change the column names
sf2.rename({'b.0':'c', 'b.1':'d', 'b.2':'e'})
# filling-up the missing values to zero
sf2 = sf2['e'].fillna(0)
# merge the original SFrame and new SFrame
sf.join(sf2, 'a')