Python pandas concatenate: join="inner" 适用于玩具数据，不适用于真实数据

Question

我正在处理主题建模数据，其中我有一个数据框，其中包含一小部分主题及其对每个文档或作者的分数（称为 "scores"），另一个数据框包含前三个所有 250 个主题的单词（称为 "words"）。

我正在尝试以某种方式组合两个数据框，以便在 "scores" 中有一个额外的列，其中来自 "words" 的前三个词出现在每个主题中"scores"。这对于将数据可视化为热图很有用，因为 seaborn 或 pyplot 会自动从此类数据框中选取标签。

我尝试了多种合并和连接命令，但没有得到想要的结果。奇怪的是：根据我对相关文档和那里示例的理解，什么似乎是最合乎逻辑的命令（即在两个 df 上使用 concat 和 axis=1 和 join="inner"），适用于玩具数据，但不适用于我的真实数据。

这是我的玩具数据以及我用来生成它和进行合并的代码：

import pandas as pd

## Defining the two data frames
scores = pd.DataFrame({'author1': ['1.00', '1.50'],
                    'author2': ['2.75', '1.20'],
                    'author3': ['0.55', '1.25'],
                    'author4': ['0.95', '1.3']},
                     index=[1, 3])                     

words = pd.DataFrame({'words': ['cadavre','fenêtre','musique','mariage']},
                     index=[0, 1, 2, 3])

## Inspecting the two dataframes
print("\n==scores==\n", scores)
print("\n==words==\n", words)

## Merging the dataframes
merged = pd.concat([scores, words], axis=1, join="inner")

## Check the result
print("\n==merged==\n", merged)

这是预期的输出：

==scores==
   author1 author2 author3 author4
1    1.00    2.75    0.55    0.95
3    1.50    1.20    1.25     1.3

==words==
      words
0  cadavre
1  fenêtre
2  musique
3  mariage

==merged==
   author1 author2 author3 author4    words
1    1.00    2.75    0.55    0.95  fenêtre
3    1.50    1.20    1.25     1.3  mariage

这正是我想用我的真实数据完成的。尽管这两个数据框看起来与测试数据没有什么不同，但合并后我得到了一个空数据框。

这是我真实数据中的一个小例子：

someScores（完整 table）：

      blanche  policier
108  0.003028  0.017494
71   0.002997  0.016956
115  0.029324  0.016127
187  0.004867  0.017631
122  0.002948  0.015118

firstWords（仅前 5 行；索引变为 249，"someScores" 中的所有索引条目在 "firstwords" 中具有等效项）：

                               topicwords
0              château-pays-intendant (0)
1                 esclave-palais-race (1)
2                  linge-voisin-chose (2)
3          question-messieurs-réponse (3)
4        prince-princesse-monseigneur (4)
5               arbre-branche-feuille (5)

我的合并命令：

dataToPlot = pd.concat([someScores, firstWords], axis=1, join="inner")

以及生成的数据框（空）！

Empty DataFrame
Columns: [blanche, policier, topicwords]
Index: []

我尝试了很多变体，比如使用 merge 代替，或者创建额外的列来复制索引，然后合并那些具有 left_on 和 right_on 的列，但是我要么得到相同的结果，或者我只是在 "topicwords" 列中得到 NaN。

任何提示和帮助将不胜感激！

Answer 1

仅内部联接 returns 行，其索引存在于两个数据框中。考虑 someScores ( 108 71 115 187 122 ) 和 firstWords ( 0 1 2 3 4 5 的行索引) 在行索引中不包含公共值结果是一个空数据框。

要么正确设置这些索引，要么指定不同的加入条件。
您可以通过检查 index

中的公共值来确认问题

someScores.index.intersection(firstWords.index)

不同的加入策略参考documentation。

Python pandas concatenate: join="inner" 适用于玩具数据，不适用于真实数据

Python pandas concatenate: join="inner" works on toy data, not on real data

python

merge

concat

pandas