使用 Pandas 的 SurveyMonkey 数据格式化

SurveyMonkey data formatting using Pandas

我有一项调查需要分析,该调查由参与者在 SurveyMonkey 上完成。不幸的是,数据的组织方式并不理想,因为每个问题的每个分类响应都有自己的列。

例如,这里是数据框中一个响应的前几行:

     How long have you been participating in the Garden Awards Program?  \
0                                           One year                   
1                                                NaN                   
2                                                NaN                   
3                                                NaN                   
4                                                NaN                   

  Unnamed: 10 Unnamed: 11      Unnamed: 12  \
0   2-3 years   4-5 years  5 or more years   
1         NaN         NaN              NaN   
2         NaN   4-5 years              NaN   
3   2-3 years         NaN              NaN   
4         NaN         NaN  5 or more years   

  How did you initially learn of the Garden Awards Program?  \
0              I nominated my garden to be evaluated          
1                                                NaN          
2              I nominated my garden to be evaluated          
3                                                NaN          
4                                                NaN          

                                         Unnamed: 14  etc...
0  A friend or family member nominated my garden ...  
1  A friend or family member nominated my garden ...  
2                                                NaN  
3                                                NaN  
4                                                NaN  

这个问题 How long have you been participating in the Garden Awards Program? 有有效的回答:one year2-3 years 等,并且都在第一行找到,作为键值对应哪一列.这是第一个问题。 (与 How did you initially learn of the Garden Awards Program? 类似,其中有效响应为:I nominated my garden to be evaluatedA friend or family member nominated my garden 等)。

第二个问题是每个分类响应的附加列都是 Unnamed: N,其中 N 是与所有问题关联的类别一样多的列。

在我开始重新映射和 flattening/collapsing 每个问题的列变成一个列之前,我想知道是否有任何其他方法可以使用 Pandas 处理像这样呈现的调查数据。我所有的搜索都指向 SurveyMonkey API,但我看不出它有什么用。

我猜我需要将列展平,因此,如果有人可以提出一种方法,那就太好了。我在想,有一种方法可以通过抓取相邻的列来继续抓取属于分类响应的所有列,直到 Unnamed 不再出现在列名中,但我不知道该怎么做。

我将使用以下 DataFrame(可以从 here 下载 CSV 文件):

     Q1 Unnamed: 2 Unnamed: 3    Q2 Unnamed: 5 Unnamed: 6    Q3 Unnamed: 7 Unnamed: 8
0  A1-A       A1-B       A1-C  A2-A       A2-B       A2-C  A3-A       A4-B       A3-C
1  A1-A        NaN        NaN   NaN       A2-B        NaN   NaN        NaN       A3-C
2   NaN       A1-B        NaN  A2-A        NaN        NaN   NaN       A4-B        NaN
3   NaN        NaN       A1-C   NaN       A2-B        NaN  A3-A        NaN        NaN
4   NaN       A1-B        NaN   NaN        NaN       A2-C   NaN        NaN       A3-C
5  A1-A        NaN        NaN   NaN       A2-B        NaN  A3-A        NaN        NaN

主要假设:

  1. 名称不以 Unnamed 开头的每一列实际上是问题的标题
  2. 问题标题之间的列表示列间隔左端问题的选项

解决方案概述:

  1. 查找每个问题开始和结束位置的索引
  2. 将每个问题拼合到一个列中 (pd.Series)
  3. 将问题列重新合并在一起

实施(第 1 部分):

indices = [i for i, c in enumerate(df.columns) if not c.startswith('Unnamed')]
questions = [c for c in df.columns if not c.startswith('Unnamed')]
slices = [slice(i, j) for i, j in zip(indices, indices[1:] + [None])]

您可以看到,像下面这样遍历切片,您会得到一个 DataFrame 对应于每个问题:

for q in slices:
    print(df.iloc[:, q])  # Use `display` if using Jupyter

实施(第 2-3 部分):

def parse_response(s):
    try:
        return s[~s.isnull()][0]
    except IndexError:
        return np.nan

data = [df.iloc[:, q].apply(parse_response, axis=1)[1:] for q in slices]
df = pd.concat(data, axis=1)
df.columns = questions

输出:

     Q1    Q2    Q3
1  A1-A  A2-B  A3-C
2  A1-B  A2-A  A4-B
3  A1-C  A2-B  A3-A
4  A1-B  A2-C  A3-C
5  A1-A  A2-B  A3-A