经过一些过滤器后从现有 Dataframe 获取 Dataframe

Getting Dataframe from existing Dataframe after some filter

我有一个类似这样的数据框。

Aman Aggarwal      Amar Jannela   Vipin Kumar       Roshan Pati
BlackBuck          DJ CHETAS      WOW Editions      MensXP
Transport/Freight  Musician/Band  Furniture         News/Media Website
Like               Like           Like              Like
NaN                NaN            NaN               NaN   
GiveMeSport        NaN            500 Startups      No Abuse KG
News/Media Website Celina Jaitly  Internet/Software Community
Like               Actor/Director Like              Liked
NaN                Like           NaN               NaN
NaN                NaN            Jitendra Kumar    Monogatari Series
Anushka Sharma     Durjoy Datta   Actor/Director    TV Show
Actor/Director     Author         Liked             Like
Like               Like           NaN               NaN
NaN                NaN            NaN               NaN

显然 NaN 是原始 csv 文件中的空行。我必须从 this.Column 名称中提取两个数据帧作为新数据帧中每一行的第一个元素,并且该列的 page_name(BlackBuck) 元素作为相应行的进一步元素。像这样。

Aman Aggarwal     BlackBuck        GiveMeSport    Anushka Sharma 
Amar Jannela      DJ CHETAS        Celina Jaitly  Durjoy Datta 
Vipin Kumar       WOW Editions     500 Startups   Jitendra Kumar
Roshan Pati       MensXP           No Abuse KGP   Monogatari Series

第二个数据帧也类似这样

Aman Aggarwal   Transport/Freight  News/Media Website  Actor/Director
Amar Jannela       Musician/Band      Actor/Director          Author
Vipin Kumar           Furniture   Internet/Software  Actor/Director
Roshan Pati  News/Media Website           Community         TV Show

真正的问题是存在任意的 NaN 值,有些地方 ank 也可能是 like/liked 但唯一的问题是 name(BlackBuck) 和 category(Transport/Freight) 是 together.Since我的 coe 无法识别哪个是 page_name 哪个是类别。所以可能我必须先为每一列分别删除 NaN 值和 'Like' 和 'Liked',然后相应地对齐并转置。如何在 python2.7 中有效地做到这一点。

您显然必须逐列进行,因为名称和类别没有对齐。我使用 apply 逐列处理,并过滤掉空值或字符串列表中的值以避免:

filter = ['Like', 'Liked']

df.apply(lambda column: 
    column[~(column.isnull() | column.isin(filter))].reset_index(drop=True)
)

请注意,这也可以,但我不太相信它:

import numpy as np
filter = [np.nan, 'Like', 'Liked']

df.apply(lambda column: column[~column.isin(filter)].reset_index(drop=True))

输出:

        Aman Aggarwal    Amar Jannela        Vipin Kumar         Roshan Pati
0           BlackBuck       DJ CHETAS       WOW Editions              MensXP
1   Transport/Freight   Musician/Band          Furniture  News/Media Website
2         GiveMeSport   Celina Jaitly       500 Startups         No Abuse KG
3  News/Media Website  Actor/Director  Internet/Software           Community
4      Anushka Sharma    Durjoy Datta     Jitendra Kumar   Monogatari Series
5      Actor/Director          Author     Actor/Director             TV Show

备注

  • column.str.contains('Like') 之前测试 column.isnull() 很重要,否则后者会因空值而失败。
  • 您需要重置索引,否则结果将与原始索引对齐,这正是您不希望的。