使用 cumsum 查找独特的章节
Using cumsum to find unique chapters
我有这样一个数据框:
df = pd.DataFrame()
text secFlag
0 book 1
1 headings 1
2 chapter 1
3 one 1
4 page 0
5 one 0
6 text 0
7 chapter 1
8 two 1
9 page 0
10 two 0
11 text 0
12 page 0
13 three 0
10 text 0
11 chapter 1
12 three 1
13 something 0
我想求出总和,这样我就可以用 运行 索引号标记属于特定章节的所有页面。
**Desired output**
text secFlag chapter
0 book 1 1
1 headings 1 1
2 chapter 1 2
3 one 1 2
4 page 0 2
5 one 0 2
6 text 0 2
7 chapter 1 3
8 two 1 3
9 page 0 3
10 two 0 3
11 text 0 3
12 page 0 3
13 three 0 3
10 text 0 3
11 chapter 1 4
12 three 1 4
13 something 0 4
这是我试过的:
df['chapter'] = ((df['secFlag'].shift(-1) == 1)).cumsum()
但是,这并没有给我想要的输出,因为只要节标志中的值为 1,它就会递增。请注意,多个单词是文本的一部分,章节标题通常会有多个单词。
你能推荐一个简单的方法来完成这个吗?
谢谢
如果需要在secFlag
中先1
标记解决方案是:
df['chapter'] = ((df['secFlag'] == 1) & (df['secFlag'] != df['secFlag'].shift())).cumsum()
print (df)
text secFlag chapter
0 book 1 1
1 headings 1 1
2 chapter 1 1
3 one 1 1
4 page 0 1
5 one 0 1
6 text 0 1
7 chapter 1 2
8 two 1 2
9 page 0 2
10 two 0 2
11 text 0 2
12 page 0 2
13 three 0 2
10 text 0 2
11 chapter 1 3
12 three 1 3
13 something 0 3
详情:
a = (df['secFlag'] == 1)
b = (df['secFlag'] != df['secFlag'].shift())
c = a & b
d = c.cumsum()
print (pd.concat([df,a,b,c,d],
axis=1,
keys=('orig','==1','!=shifted','chained by &','cumsum')))
orig ==1 !=shifted chained by & cumsum
text secFlag secFlag secFlag secFlag secFlag
0 book 1 True True True 1
1 headings 1 True False False 1
2 chapter 1 True False False 1
3 one 1 True False False 1
4 page 0 False True False 1
5 one 0 False False False 1
6 text 0 False False False 1
7 chapter 1 True True True 2
8 two 1 True False False 2
9 page 0 False True False 2
10 two 0 False False False 2
11 text 0 False False False 2
12 page 0 False False False 2
13 three 0 False False False 2
10 text 0 False False False 2
11 chapter 1 True True True 3
12 three 1 True False False 3
13 something 0 False True False 3
我有这样一个数据框:
df = pd.DataFrame()
text secFlag
0 book 1
1 headings 1
2 chapter 1
3 one 1
4 page 0
5 one 0
6 text 0
7 chapter 1
8 two 1
9 page 0
10 two 0
11 text 0
12 page 0
13 three 0
10 text 0
11 chapter 1
12 three 1
13 something 0
我想求出总和,这样我就可以用 运行 索引号标记属于特定章节的所有页面。
**Desired output**
text secFlag chapter
0 book 1 1
1 headings 1 1
2 chapter 1 2
3 one 1 2
4 page 0 2
5 one 0 2
6 text 0 2
7 chapter 1 3
8 two 1 3
9 page 0 3
10 two 0 3
11 text 0 3
12 page 0 3
13 three 0 3
10 text 0 3
11 chapter 1 4
12 three 1 4
13 something 0 4
这是我试过的:
df['chapter'] = ((df['secFlag'].shift(-1) == 1)).cumsum()
但是,这并没有给我想要的输出,因为只要节标志中的值为 1,它就会递增。请注意,多个单词是文本的一部分,章节标题通常会有多个单词。
你能推荐一个简单的方法来完成这个吗? 谢谢
如果需要在secFlag
中先1
标记解决方案是:
df['chapter'] = ((df['secFlag'] == 1) & (df['secFlag'] != df['secFlag'].shift())).cumsum()
print (df)
text secFlag chapter
0 book 1 1
1 headings 1 1
2 chapter 1 1
3 one 1 1
4 page 0 1
5 one 0 1
6 text 0 1
7 chapter 1 2
8 two 1 2
9 page 0 2
10 two 0 2
11 text 0 2
12 page 0 2
13 three 0 2
10 text 0 2
11 chapter 1 3
12 three 1 3
13 something 0 3
详情:
a = (df['secFlag'] == 1)
b = (df['secFlag'] != df['secFlag'].shift())
c = a & b
d = c.cumsum()
print (pd.concat([df,a,b,c,d],
axis=1,
keys=('orig','==1','!=shifted','chained by &','cumsum')))
orig ==1 !=shifted chained by & cumsum
text secFlag secFlag secFlag secFlag secFlag
0 book 1 True True True 1
1 headings 1 True False False 1
2 chapter 1 True False False 1
3 one 1 True False False 1
4 page 0 False True False 1
5 one 0 False False False 1
6 text 0 False False False 1
7 chapter 1 True True True 2
8 two 1 True False False 2
9 page 0 False True False 2
10 two 0 False False False 2
11 text 0 False False False 2
12 page 0 False False False 2
13 three 0 False False False 2
10 text 0 False False False 2
11 chapter 1 True True True 3
12 three 1 True False False 3
13 something 0 False True False 3