如何根据 python 中的条件从数据框中删除行?
How to drop rows from a dataframe based on condition in python?
所以我有一个 CSV 文件,其中包含以下方式的数据:
|Variable |Time |Value|
|A1 |Jan | 33 |
| |Feb | 21 |
| |Mar | 08 |
| |Apr | 17 |
| |May | 04 |
| |Jun | 43 |
| |Jul | 40 |
| |Aug | 37 |
| |Sep | 30 |
| |Oct | 46 |
| |Nov | 10 |
| |Dec | 13 |
| B1 |Jan | 20 |
| |Feb | 11 |
| |Mar | 02 |
| |Apr | 18 |
| |May | 10 |
| |Jun | 35 |
| |Jul | 45 |
| |Aug | 32 |
| |Sep | 39 |
| |Oct | 42 |
| |Nov | 15 |
| |Dec | 18 |
就这样一直进行到A10和B10
我只需要A,1月到12月的时间,以及B对应的值和掉落值,怎么办?条件是什么?
两种不同的方法:
如果列宽是固定的:
df = pd.read_fwf('file.csv', colspecs=[(1,9), (11,16), (17, 22)])
df = df[df.replace('', np.nan).ffill()['Variable'].str.startswith('A')]
print(df)
输出:
Variable Time Value
0 A1 Jan 33
1 Feb 21
2 Mar 8
3 Apr 17
4 May 4
5 Jun 43
6 Jul 40
7 Aug 37
8 Sep 30
9 Oct 46
10 Nov 10
11 Dec 13
如果东西比较脏:
with open('file.csv', 'r') as f:
df = pd.DataFrame([[y.strip() for y in x.split('|')[1:4]] for x in f.readlines() if x.strip()])
df.columns = df.iloc[0].values
df = df.drop(0).reset_index(drop=True)
df['Value'] = pd.to_numeric(df['Value'])
print(df)
输出:
Variable Time Value
0 A1 Jan 33
1 Feb 21
2 Mar 8
3 Apr 17
4 May 4
5 Jun 43
6 Jul 40
7 Aug 37
8 Sep 30
9 Oct 46
10 Nov 10
11 Dec 13
12 B1 Jan 20
13 Feb 11
14 Mar 2
15 Apr 18
16 May 10
17 Jun 35
18 Jul 45
19 Aug 32
20 Sep 39
21 Oct 42
22 Nov 15
23 Dec 18
假设您的数据按照您描述的方式排列,并进行如下推断
使用 pandas' ffill()
估算变量列以方便如下所示的所需选择。
sample = pd.read_csv('sample.csv')
sample['Variable'].ffill(axis=0,inplace=True)
sample = sample.loc[sample['Variable'].str.startswith('A')]
n_months = 12
indexes_to_impute_as_empty = list(range(0,len(sample),n_months))
sample.loc[indexes_to_impute_as_empty,'temp_Variable'] = sample.loc[indexes_to_impute_as_empty,'Variable']
sample['Variable'] = sample['temp_Variable']
sample.drop(columns=['temp_Variable'],inplace=True)
sample.replace(np.nan,"",inplace=True)
sample
所以我有一个 CSV 文件,其中包含以下方式的数据:
|Variable |Time |Value|
|A1 |Jan | 33 |
| |Feb | 21 |
| |Mar | 08 |
| |Apr | 17 |
| |May | 04 |
| |Jun | 43 |
| |Jul | 40 |
| |Aug | 37 |
| |Sep | 30 |
| |Oct | 46 |
| |Nov | 10 |
| |Dec | 13 |
| B1 |Jan | 20 |
| |Feb | 11 |
| |Mar | 02 |
| |Apr | 18 |
| |May | 10 |
| |Jun | 35 |
| |Jul | 45 |
| |Aug | 32 |
| |Sep | 39 |
| |Oct | 42 |
| |Nov | 15 |
| |Dec | 18 |
就这样一直进行到A10和B10
我只需要A,1月到12月的时间,以及B对应的值和掉落值,怎么办?条件是什么?
两种不同的方法:
如果列宽是固定的:
df = pd.read_fwf('file.csv', colspecs=[(1,9), (11,16), (17, 22)])
df = df[df.replace('', np.nan).ffill()['Variable'].str.startswith('A')]
print(df)
输出:
Variable Time Value
0 A1 Jan 33
1 Feb 21
2 Mar 8
3 Apr 17
4 May 4
5 Jun 43
6 Jul 40
7 Aug 37
8 Sep 30
9 Oct 46
10 Nov 10
11 Dec 13
如果东西比较脏:
with open('file.csv', 'r') as f:
df = pd.DataFrame([[y.strip() for y in x.split('|')[1:4]] for x in f.readlines() if x.strip()])
df.columns = df.iloc[0].values
df = df.drop(0).reset_index(drop=True)
df['Value'] = pd.to_numeric(df['Value'])
print(df)
输出:
Variable Time Value
0 A1 Jan 33
1 Feb 21
2 Mar 8
3 Apr 17
4 May 4
5 Jun 43
6 Jul 40
7 Aug 37
8 Sep 30
9 Oct 46
10 Nov 10
11 Dec 13
12 B1 Jan 20
13 Feb 11
14 Mar 2
15 Apr 18
16 May 10
17 Jun 35
18 Jul 45
19 Aug 32
20 Sep 39
21 Oct 42
22 Nov 15
23 Dec 18
假设您的数据按照您描述的方式排列,并进行如下推断
使用 pandas' ffill()
估算变量列以方便如下所示的所需选择。
sample = pd.read_csv('sample.csv')
sample['Variable'].ffill(axis=0,inplace=True)
sample = sample.loc[sample['Variable'].str.startswith('A')]
n_months = 12
indexes_to_impute_as_empty = list(range(0,len(sample),n_months))
sample.loc[indexes_to_impute_as_empty,'temp_Variable'] = sample.loc[indexes_to_impute_as_empty,'Variable']
sample['Variable'] = sample['temp_Variable']
sample.drop(columns=['temp_Variable'],inplace=True)
sample.replace(np.nan,"",inplace=True)
sample