经过一些过滤器后从现有 Dataframe 获取 Dataframe
Getting Dataframe from existing Dataframe after some filter
我有一个类似这样的数据框。
Aman Aggarwal Amar Jannela Vipin Kumar Roshan Pati
BlackBuck DJ CHETAS WOW Editions MensXP
Transport/Freight Musician/Band Furniture News/Media Website
Like Like Like Like
NaN NaN NaN NaN
GiveMeSport NaN 500 Startups No Abuse KG
News/Media Website Celina Jaitly Internet/Software Community
Like Actor/Director Like Liked
NaN Like NaN NaN
NaN NaN Jitendra Kumar Monogatari Series
Anushka Sharma Durjoy Datta Actor/Director TV Show
Actor/Director Author Liked Like
Like Like NaN NaN
NaN NaN NaN NaN
显然 NaN 是原始 csv 文件中的空行。我必须从 this.Column 名称中提取两个数据帧作为新数据帧中每一行的第一个元素,并且该列的 page_name(BlackBuck) 元素作为相应行的进一步元素。像这样。
Aman Aggarwal BlackBuck GiveMeSport Anushka Sharma
Amar Jannela DJ CHETAS Celina Jaitly Durjoy Datta
Vipin Kumar WOW Editions 500 Startups Jitendra Kumar
Roshan Pati MensXP No Abuse KGP Monogatari Series
第二个数据帧也类似这样
Aman Aggarwal Transport/Freight News/Media Website Actor/Director
Amar Jannela Musician/Band Actor/Director Author
Vipin Kumar Furniture Internet/Software Actor/Director
Roshan Pati News/Media Website Community TV Show
真正的问题是存在任意的 NaN 值,有些地方 ank 也可能是 like/liked 但唯一的问题是 name(BlackBuck) 和 category(Transport/Freight) 是 together.Since我的 coe 无法识别哪个是 page_name 哪个是类别。所以可能我必须先为每一列分别删除 NaN 值和 'Like' 和 'Liked',然后相应地对齐并转置。如何在 python2.7 中有效地做到这一点。
您显然必须逐列进行,因为名称和类别没有对齐。我使用 apply
逐列处理,并过滤掉空值或字符串列表中的值以避免:
filter = ['Like', 'Liked']
df.apply(lambda column:
column[~(column.isnull() | column.isin(filter))].reset_index(drop=True)
)
请注意,这也可以,但我不太相信它:
import numpy as np
filter = [np.nan, 'Like', 'Liked']
df.apply(lambda column: column[~column.isin(filter)].reset_index(drop=True))
输出:
Aman Aggarwal Amar Jannela Vipin Kumar Roshan Pati
0 BlackBuck DJ CHETAS WOW Editions MensXP
1 Transport/Freight Musician/Band Furniture News/Media Website
2 GiveMeSport Celina Jaitly 500 Startups No Abuse KG
3 News/Media Website Actor/Director Internet/Software Community
4 Anushka Sharma Durjoy Datta Jitendra Kumar Monogatari Series
5 Actor/Director Author Actor/Director TV Show
备注
- 在
column.str.contains('Like')
之前测试 column.isnull()
很重要,否则后者会因空值而失败。
- 您需要重置索引,否则结果将与原始索引对齐,这正是您不希望的。
我有一个类似这样的数据框。
Aman Aggarwal Amar Jannela Vipin Kumar Roshan Pati
BlackBuck DJ CHETAS WOW Editions MensXP
Transport/Freight Musician/Band Furniture News/Media Website
Like Like Like Like
NaN NaN NaN NaN
GiveMeSport NaN 500 Startups No Abuse KG
News/Media Website Celina Jaitly Internet/Software Community
Like Actor/Director Like Liked
NaN Like NaN NaN
NaN NaN Jitendra Kumar Monogatari Series
Anushka Sharma Durjoy Datta Actor/Director TV Show
Actor/Director Author Liked Like
Like Like NaN NaN
NaN NaN NaN NaN
显然 NaN 是原始 csv 文件中的空行。我必须从 this.Column 名称中提取两个数据帧作为新数据帧中每一行的第一个元素,并且该列的 page_name(BlackBuck) 元素作为相应行的进一步元素。像这样。
Aman Aggarwal BlackBuck GiveMeSport Anushka Sharma
Amar Jannela DJ CHETAS Celina Jaitly Durjoy Datta
Vipin Kumar WOW Editions 500 Startups Jitendra Kumar
Roshan Pati MensXP No Abuse KGP Monogatari Series
第二个数据帧也类似这样
Aman Aggarwal Transport/Freight News/Media Website Actor/Director
Amar Jannela Musician/Band Actor/Director Author
Vipin Kumar Furniture Internet/Software Actor/Director
Roshan Pati News/Media Website Community TV Show
真正的问题是存在任意的 NaN 值,有些地方 ank 也可能是 like/liked 但唯一的问题是 name(BlackBuck) 和 category(Transport/Freight) 是 together.Since我的 coe 无法识别哪个是 page_name 哪个是类别。所以可能我必须先为每一列分别删除 NaN 值和 'Like' 和 'Liked',然后相应地对齐并转置。如何在 python2.7 中有效地做到这一点。
您显然必须逐列进行,因为名称和类别没有对齐。我使用 apply
逐列处理,并过滤掉空值或字符串列表中的值以避免:
filter = ['Like', 'Liked']
df.apply(lambda column:
column[~(column.isnull() | column.isin(filter))].reset_index(drop=True)
)
请注意,这也可以,但我不太相信它:
import numpy as np
filter = [np.nan, 'Like', 'Liked']
df.apply(lambda column: column[~column.isin(filter)].reset_index(drop=True))
输出:
Aman Aggarwal Amar Jannela Vipin Kumar Roshan Pati
0 BlackBuck DJ CHETAS WOW Editions MensXP
1 Transport/Freight Musician/Band Furniture News/Media Website
2 GiveMeSport Celina Jaitly 500 Startups No Abuse KG
3 News/Media Website Actor/Director Internet/Software Community
4 Anushka Sharma Durjoy Datta Jitendra Kumar Monogatari Series
5 Actor/Director Author Actor/Director TV Show
备注
- 在
column.str.contains('Like')
之前测试column.isnull()
很重要,否则后者会因空值而失败。 - 您需要重置索引,否则结果将与原始索引对齐,这正是您不希望的。