仅取消 pandas 数据框中列的一部分
Unmelt only part of a column from pandas dataframe
我有以下示例数据框:
df = pd.DataFrame(data = {'RecordID' : [1,1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5], 'DisplayLabel' : ['Source','Test','Value 1','Value 2','Value3','Source','Test','Value 1','Value 2','Source','Test','Value 1','Value 2','Source','Test','Value 1','Value 2','Source','Test','Value 1','Value 2'],
'Value' : ['Web','Logic','S','I','Complete','Person','Voice','>20','P','Mail','OCR','A','I','Dictation','Understandable','S','I','Web','Logic','R','S']})
创建此数据框:
+-------+----------+---------------+----------------+
| Index | RecordID | Display Label | Value |
+-------+----------+---------------+----------------+
| 0 | 1 | Source | Web |
| 1 | 1 | Test | Logic |
| 2 | 1 | Value 1 | S |
| 3 | 1 | Value 2 | I |
| 4 | 1 | Value 3 | Complete |
| 5 | 2 | Source | Person |
| 6 | 2 | Test | Voice |
| 7 | 2 | Value 1 | >20 |
| 8 | 2 | Value 2 | P |
| 9 | 3 | Source | Mail |
| 10 | 3 | Test | OCR |
| 11 | 3 | Value 1 | A |
| 12 | 3 | Value 2 | I |
| 13 | 4 | Source | Dictation |
| 14 | 4 | Test | Understandable |
| 15 | 4 | Value 1 | S |
| 16 | 4 | Value 2 | I |
| 17 | 5 | Source | Web |
| 18 | 5 | Test | Logic |
| 19 | 5 | Value 1 | R |
| 20 | 5 | Value 2 | S |
+-------+----------+---------------+----------------+
我正在尝试 "unmelt" 虽然不完全是源和测试列到新的数据框列中,但它看起来像这样:
+-------+----------+-----------+----------------+---------------+----------+
| Index | RecordID | Source | Test | Result | Value |
+-------+----------+-----------+----------------+---------------+----------+
| 0 | 1 | Web | Logic | Value 1 | S |
| 1 | 1 | Web | Logic | Value 2 | I |
| 2 | 1 | Web | Logic | Value 3 | Complete |
| 3 | 2 | Person | Voice | Value 1 | >20 |
| 4 | 2 | Person | Voice | Value 2 | P |
| 5 | 3 | Mail | OCR | Value 1 | A |
| 6 | 3 | Mail | OCR | Value 2 | I |
| 7 | 4 | Dictation | Understandable | Value 1 | S |
| 8 | 4 | Dictation | Understandable | Value 2 | I |
| 9 | 5 | Web | Logic | Value 1 | R |
| 10 | 5 | Web | Logic | Value 2 | S |
+-------+----------+-----------+----------------+---------------+----------+
据我了解,pivot 和 melt 将处理整个 DisplayLabel 列,而不仅仅是某些值。
任何帮助将不胜感激,因为我已经阅读了 Pandas Melt and the Pandas Pivot 以及一些关于 Whosebug 的参考资料,但我似乎无法找到快速完成此操作的方法。
谢谢!
set_index
、unstack
,然后 melt
df.set_index(['RecordID', 'DisplayLabel']).Value.unstack().reset_index() \
.melt(['RecordID', 'Source', 'Test'], var_name='Result', value_name='Value') \
.sort_values('RecordID').dropna(subset=['Value'])
RecordID Source Test Result Value
0 1 Web Logic Value 1 S
5 1 Web Logic Value 2 I
10 1 Web Logic Value 3 Complete
1 2 Person Voice Value 1 >20
6 2 Person Voice Value 2 P
2 3 Mail OCR Value 1 A
7 3 Mail OCR Value 2 I
3 4 Dictation Understandable Value 1 S
8 4 Dictation Understandable Value 2 I
4 5 Web Logic Value 1 R
9 5 Web Logic Value 2 S
groupby
的自定义函数
def f(t):
name, df = t
d = dict(zip(df['DisplayLabel'], df['Value']))
source = d.pop('Source')
test = d.pop('Test')
result, value = zip(*d.items())
return pd.DataFrame(
dict(RecordID=name, Source=source, Test=test, Result=result, Value=value)
)
pd.concat(map(f, df.groupby('RecordID')))
RecordID Source Test Result Value
0 1 Web Logic Value 1 S
1 1 Web Logic Value 2 I
2 1 Web Logic Value 3 Complete
0 2 Person Voice Value 1 >20
1 2 Person Voice Value 2 P
0 3 Mail OCR Value 1 A
1 3 Mail OCR Value 2 I
0 4 Dictation Understandable Value 1 S
1 4 Dictation Understandable Value 2 I
0 5 Web Logic Value 1 R
1 5 Web Logic Value 2 S
设置
df = pd.DataFrame(data={
'RecordID': [1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5],
'DisplayLabel': [
'Source', 'Test', 'Value 1', 'Value 2', 'Value 3',
'Source', 'Test', 'Value 1', 'Value 2',
'Source', 'Test', 'Value 1', 'Value 2',
'Source', 'Test', 'Value 1', 'Value 2',
'Source', 'Test', 'Value 1', 'Value 2'
],
'Value': [
'Web', 'Logic', 'S', 'I', 'Complete',
'Person', 'Voice', '>20', 'P',
'Mail', 'OCR', 'A', 'I',
'Dictation', 'Understandable', 'S', 'I',
'Web', 'Logic', 'R', 'S'
]
})
我们可以通过应用逻辑和数据透视来实现您的结果,我们通过检查 DisplayLabel
是否包含 Value
来拆分您的数据,然后我们 join
将它们重新组合在一起:
mask = df['DisplayLabel'].str.contains('Value')
df2 = df[~mask].pivot(index='RecordID', columns='DisplayLabel', values='Value')
dfpiv = (
df[mask].rename(columns={'DisplayLabel':'Result'})
.set_index('RecordID')
.join(df2)
.reset_index()
)
RecordID Result Value Source Test
0 1 Value 1 S Web Logic
1 1 Value 2 I Web Logic
2 1 Value3 Complete Web Logic
3 2 Value 1 >20 Person Voice
4 2 Value 2 P Person Voice
5 3 Value 1 A Mail OCR
6 3 Value 2 I Mail OCR
7 4 Value 1 S Dictation Understandable
8 4 Value 2 I Dictation Understandable
9 5 Value 1 R Web Logic
10 5 Value 2 S Web Logic
如果您想要准确的列顺序作为示例,请使用 DataFrame.reindex
:
dfpiv.reindex(columns=['RecordID', 'Source', 'Test', 'Result', 'Value'])
RecordID Source Test Result Value
0 1 Web Logic Value 1 S
1 1 Web Logic Value 2 I
2 1 Web Logic Value3 Complete
3 2 Person Voice Value 1 >20
4 2 Person Voice Value 2 P
5 3 Mail OCR Value 1 A
6 3 Mail OCR Value 2 I
7 4 Dictation Understandable Value 1 S
8 4 Dictation Understandable Value 2 I
9 5 Web Logic Value 1 R
10 5 Web Logic Value 2 S
详细-一步一步:
# mask all rows where "Value" is in column DisplayLabel
mask = df['DisplayLabel'].str.contains('Value')
0 False
1 False
2 True
3 True
4 True
5 False
6 False
7 True
8 True
9 False
10 False
11 True
12 True
13 False
14 False
15 True
16 True
17 False
18 False
19 True
20 True
Name: DisplayLabel, dtype: bool
# select all rows which do NOT have "Value" in DisplayLabel
df[~mask]
RecordID DisplayLabel Value
0 1 Source Web
1 1 Test Logic
5 2 Source Person
6 2 Test Voice
9 3 Source Mail
10 3 Test OCR
13 4 Source Dictation
14 4 Test Understandable
17 5 Source Web
18 5 Test Logic
# pivot the values in DisplayLabel to columns
df2 = df[~mask].pivot(index='RecordID', columns='DisplayLabel', values='Value')
DisplayLabel Source Test
RecordID
1 Web Logic
2 Person Voice
3 Mail OCR
4 Dictation Understandable
5 Web Logic
df[mask].rename(columns={'DisplayLabel':'Result'}) # rename the column DisplayLabel to Result
.set_index('RecordID') # set RecordId as index so we can join df2
.join(df2) # join df2 back to our dataframe based RecordId
.reset_index() # reset index so we get RecordId back as column
RecordID Result Value Source Test
0 1 Value 1 S Web Logic
1 1 Value 2 I Web Logic
2 1 Value3 Complete Web Logic
3 2 Value 1 >20 Person Voice
4 2 Value 2 P Person Voice
5 3 Value 1 A Mail OCR
6 3 Value 2 I Mail OCR
7 4 Value 1 S Dictation Understandable
8 4 Value 2 I Dictation Understandable
9 5 Value 1 R Web Logic
10 5 Value 2 S Web Logic
我尝试了一种不同的方法,首先 pivot
ing 使用 unstack
,然后部分转换 wide_to_long
(抱歉,如果效率不高,但这似乎得到了所需的输出)
# first converting all long to wide
df2 = df.set_index(['RecordID','DisplayLabel']).unstack()
# flattening the unstacked columns
df2.columns = df2.columns.to_series().str.join('_')
df2.columns = df2.columns.str.replace('Value_','',regex=True) #just removing the junk in the column name
df2 = df2.reset_index() #resetting index to access RecordID
df2 = (pd.melt(df2,id_vars=['RecordID',"Source","Test"],var_name='Result', value_name='Value')
.sort_values(['RecordID',"Source","Test"])
.dropna()
.reset_index())
index RecordID Source Test Result Value
0 0 1 Web Logic Value 1 S
1 5 1 Web Logic Value 2 I
2 10 1 Web Logic Value 3 Complete
3 1 2 Person Voice Value 1 >20
4 6 2 Person Voice Value 2 P
5 2 3 Mail OCR Value 1 A
6 7 3 Mail OCR Value 2 I
7 3 4 Dictation Understandable Value 1 S
8 8 4 Dictation Understandable Value 2 I
9 4 5 Web Logic Value 1 R
10 9 5 Web Logic Value 2 S
我有以下示例数据框:
df = pd.DataFrame(data = {'RecordID' : [1,1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5], 'DisplayLabel' : ['Source','Test','Value 1','Value 2','Value3','Source','Test','Value 1','Value 2','Source','Test','Value 1','Value 2','Source','Test','Value 1','Value 2','Source','Test','Value 1','Value 2'],
'Value' : ['Web','Logic','S','I','Complete','Person','Voice','>20','P','Mail','OCR','A','I','Dictation','Understandable','S','I','Web','Logic','R','S']})
创建此数据框:
+-------+----------+---------------+----------------+
| Index | RecordID | Display Label | Value |
+-------+----------+---------------+----------------+
| 0 | 1 | Source | Web |
| 1 | 1 | Test | Logic |
| 2 | 1 | Value 1 | S |
| 3 | 1 | Value 2 | I |
| 4 | 1 | Value 3 | Complete |
| 5 | 2 | Source | Person |
| 6 | 2 | Test | Voice |
| 7 | 2 | Value 1 | >20 |
| 8 | 2 | Value 2 | P |
| 9 | 3 | Source | Mail |
| 10 | 3 | Test | OCR |
| 11 | 3 | Value 1 | A |
| 12 | 3 | Value 2 | I |
| 13 | 4 | Source | Dictation |
| 14 | 4 | Test | Understandable |
| 15 | 4 | Value 1 | S |
| 16 | 4 | Value 2 | I |
| 17 | 5 | Source | Web |
| 18 | 5 | Test | Logic |
| 19 | 5 | Value 1 | R |
| 20 | 5 | Value 2 | S |
+-------+----------+---------------+----------------+
我正在尝试 "unmelt" 虽然不完全是源和测试列到新的数据框列中,但它看起来像这样:
+-------+----------+-----------+----------------+---------------+----------+
| Index | RecordID | Source | Test | Result | Value |
+-------+----------+-----------+----------------+---------------+----------+
| 0 | 1 | Web | Logic | Value 1 | S |
| 1 | 1 | Web | Logic | Value 2 | I |
| 2 | 1 | Web | Logic | Value 3 | Complete |
| 3 | 2 | Person | Voice | Value 1 | >20 |
| 4 | 2 | Person | Voice | Value 2 | P |
| 5 | 3 | Mail | OCR | Value 1 | A |
| 6 | 3 | Mail | OCR | Value 2 | I |
| 7 | 4 | Dictation | Understandable | Value 1 | S |
| 8 | 4 | Dictation | Understandable | Value 2 | I |
| 9 | 5 | Web | Logic | Value 1 | R |
| 10 | 5 | Web | Logic | Value 2 | S |
+-------+----------+-----------+----------------+---------------+----------+
据我了解,pivot 和 melt 将处理整个 DisplayLabel 列,而不仅仅是某些值。
任何帮助将不胜感激,因为我已经阅读了 Pandas Melt and the Pandas Pivot 以及一些关于 Whosebug 的参考资料,但我似乎无法找到快速完成此操作的方法。
谢谢!
set_index
、unstack
,然后 melt
df.set_index(['RecordID', 'DisplayLabel']).Value.unstack().reset_index() \
.melt(['RecordID', 'Source', 'Test'], var_name='Result', value_name='Value') \
.sort_values('RecordID').dropna(subset=['Value'])
RecordID Source Test Result Value
0 1 Web Logic Value 1 S
5 1 Web Logic Value 2 I
10 1 Web Logic Value 3 Complete
1 2 Person Voice Value 1 >20
6 2 Person Voice Value 2 P
2 3 Mail OCR Value 1 A
7 3 Mail OCR Value 2 I
3 4 Dictation Understandable Value 1 S
8 4 Dictation Understandable Value 2 I
4 5 Web Logic Value 1 R
9 5 Web Logic Value 2 S
groupby
的自定义函数
def f(t):
name, df = t
d = dict(zip(df['DisplayLabel'], df['Value']))
source = d.pop('Source')
test = d.pop('Test')
result, value = zip(*d.items())
return pd.DataFrame(
dict(RecordID=name, Source=source, Test=test, Result=result, Value=value)
)
pd.concat(map(f, df.groupby('RecordID')))
RecordID Source Test Result Value
0 1 Web Logic Value 1 S
1 1 Web Logic Value 2 I
2 1 Web Logic Value 3 Complete
0 2 Person Voice Value 1 >20
1 2 Person Voice Value 2 P
0 3 Mail OCR Value 1 A
1 3 Mail OCR Value 2 I
0 4 Dictation Understandable Value 1 S
1 4 Dictation Understandable Value 2 I
0 5 Web Logic Value 1 R
1 5 Web Logic Value 2 S
设置
df = pd.DataFrame(data={
'RecordID': [1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5],
'DisplayLabel': [
'Source', 'Test', 'Value 1', 'Value 2', 'Value 3',
'Source', 'Test', 'Value 1', 'Value 2',
'Source', 'Test', 'Value 1', 'Value 2',
'Source', 'Test', 'Value 1', 'Value 2',
'Source', 'Test', 'Value 1', 'Value 2'
],
'Value': [
'Web', 'Logic', 'S', 'I', 'Complete',
'Person', 'Voice', '>20', 'P',
'Mail', 'OCR', 'A', 'I',
'Dictation', 'Understandable', 'S', 'I',
'Web', 'Logic', 'R', 'S'
]
})
我们可以通过应用逻辑和数据透视来实现您的结果,我们通过检查 DisplayLabel
是否包含 Value
来拆分您的数据,然后我们 join
将它们重新组合在一起:
mask = df['DisplayLabel'].str.contains('Value')
df2 = df[~mask].pivot(index='RecordID', columns='DisplayLabel', values='Value')
dfpiv = (
df[mask].rename(columns={'DisplayLabel':'Result'})
.set_index('RecordID')
.join(df2)
.reset_index()
)
RecordID Result Value Source Test
0 1 Value 1 S Web Logic
1 1 Value 2 I Web Logic
2 1 Value3 Complete Web Logic
3 2 Value 1 >20 Person Voice
4 2 Value 2 P Person Voice
5 3 Value 1 A Mail OCR
6 3 Value 2 I Mail OCR
7 4 Value 1 S Dictation Understandable
8 4 Value 2 I Dictation Understandable
9 5 Value 1 R Web Logic
10 5 Value 2 S Web Logic
如果您想要准确的列顺序作为示例,请使用 DataFrame.reindex
:
dfpiv.reindex(columns=['RecordID', 'Source', 'Test', 'Result', 'Value'])
RecordID Source Test Result Value
0 1 Web Logic Value 1 S
1 1 Web Logic Value 2 I
2 1 Web Logic Value3 Complete
3 2 Person Voice Value 1 >20
4 2 Person Voice Value 2 P
5 3 Mail OCR Value 1 A
6 3 Mail OCR Value 2 I
7 4 Dictation Understandable Value 1 S
8 4 Dictation Understandable Value 2 I
9 5 Web Logic Value 1 R
10 5 Web Logic Value 2 S
详细-一步一步:
# mask all rows where "Value" is in column DisplayLabel
mask = df['DisplayLabel'].str.contains('Value')
0 False
1 False
2 True
3 True
4 True
5 False
6 False
7 True
8 True
9 False
10 False
11 True
12 True
13 False
14 False
15 True
16 True
17 False
18 False
19 True
20 True
Name: DisplayLabel, dtype: bool
# select all rows which do NOT have "Value" in DisplayLabel
df[~mask]
RecordID DisplayLabel Value
0 1 Source Web
1 1 Test Logic
5 2 Source Person
6 2 Test Voice
9 3 Source Mail
10 3 Test OCR
13 4 Source Dictation
14 4 Test Understandable
17 5 Source Web
18 5 Test Logic
# pivot the values in DisplayLabel to columns
df2 = df[~mask].pivot(index='RecordID', columns='DisplayLabel', values='Value')
DisplayLabel Source Test
RecordID
1 Web Logic
2 Person Voice
3 Mail OCR
4 Dictation Understandable
5 Web Logic
df[mask].rename(columns={'DisplayLabel':'Result'}) # rename the column DisplayLabel to Result
.set_index('RecordID') # set RecordId as index so we can join df2
.join(df2) # join df2 back to our dataframe based RecordId
.reset_index() # reset index so we get RecordId back as column
RecordID Result Value Source Test
0 1 Value 1 S Web Logic
1 1 Value 2 I Web Logic
2 1 Value3 Complete Web Logic
3 2 Value 1 >20 Person Voice
4 2 Value 2 P Person Voice
5 3 Value 1 A Mail OCR
6 3 Value 2 I Mail OCR
7 4 Value 1 S Dictation Understandable
8 4 Value 2 I Dictation Understandable
9 5 Value 1 R Web Logic
10 5 Value 2 S Web Logic
我尝试了一种不同的方法,首先 pivot
ing 使用 unstack
,然后部分转换 wide_to_long
(抱歉,如果效率不高,但这似乎得到了所需的输出)
# first converting all long to wide
df2 = df.set_index(['RecordID','DisplayLabel']).unstack()
# flattening the unstacked columns
df2.columns = df2.columns.to_series().str.join('_')
df2.columns = df2.columns.str.replace('Value_','',regex=True) #just removing the junk in the column name
df2 = df2.reset_index() #resetting index to access RecordID
df2 = (pd.melt(df2,id_vars=['RecordID',"Source","Test"],var_name='Result', value_name='Value')
.sort_values(['RecordID',"Source","Test"])
.dropna()
.reset_index())
index RecordID Source Test Result Value
0 0 1 Web Logic Value 1 S
1 5 1 Web Logic Value 2 I
2 10 1 Web Logic Value 3 Complete
3 1 2 Person Voice Value 1 >20
4 6 2 Person Voice Value 2 P
5 2 3 Mail OCR Value 1 A
6 7 3 Mail OCR Value 2 I
7 3 4 Dictation Understandable Value 1 S
8 8 4 Dictation Understandable Value 2 I
9 4 5 Web Logic Value 1 R
10 9 5 Web Logic Value 2 S