使用 Pandas 的 SurveyMonkey 数据格式化
SurveyMonkey data formatting using Pandas
我有一项调查需要分析,该调查由参与者在 SurveyMonkey 上完成。不幸的是,数据的组织方式并不理想,因为每个问题的每个分类响应都有自己的列。
例如,这里是数据框中一个响应的前几行:
How long have you been participating in the Garden Awards Program? \
0 One year
1 NaN
2 NaN
3 NaN
4 NaN
Unnamed: 10 Unnamed: 11 Unnamed: 12 \
0 2-3 years 4-5 years 5 or more years
1 NaN NaN NaN
2 NaN 4-5 years NaN
3 2-3 years NaN NaN
4 NaN NaN 5 or more years
How did you initially learn of the Garden Awards Program? \
0 I nominated my garden to be evaluated
1 NaN
2 I nominated my garden to be evaluated
3 NaN
4 NaN
Unnamed: 14 etc...
0 A friend or family member nominated my garden ...
1 A friend or family member nominated my garden ...
2 NaN
3 NaN
4 NaN
这个问题 How long have you been participating in the Garden Awards Program?
有有效的回答:one year
、2-3 years
等,并且都在第一行找到,作为键值对应哪一列.这是第一个问题。 (与 How did you initially learn of the Garden Awards Program?
类似,其中有效响应为:I nominated my garden to be evaluated
、A friend or family member nominated my garden
等)。
第二个问题是每个分类响应的附加列都是 Unnamed: N
,其中 N 是与所有问题关联的类别一样多的列。
在我开始重新映射和 flattening/collapsing 每个问题的列变成一个列之前,我想知道是否有任何其他方法可以使用 Pandas 处理像这样呈现的调查数据。我所有的搜索都指向 SurveyMonkey API,但我看不出它有什么用。
我猜我需要将列展平,因此,如果有人可以提出一种方法,那就太好了。我在想,有一种方法可以通过抓取相邻的列来继续抓取属于分类响应的所有列,直到 Unnamed
不再出现在列名中,但我不知道该怎么做。
我将使用以下 DataFrame
(可以从 here 下载 CSV 文件):
Q1 Unnamed: 2 Unnamed: 3 Q2 Unnamed: 5 Unnamed: 6 Q3 Unnamed: 7 Unnamed: 8
0 A1-A A1-B A1-C A2-A A2-B A2-C A3-A A4-B A3-C
1 A1-A NaN NaN NaN A2-B NaN NaN NaN A3-C
2 NaN A1-B NaN A2-A NaN NaN NaN A4-B NaN
3 NaN NaN A1-C NaN A2-B NaN A3-A NaN NaN
4 NaN A1-B NaN NaN NaN A2-C NaN NaN A3-C
5 A1-A NaN NaN NaN A2-B NaN A3-A NaN NaN
主要假设:
- 名称不以
Unnamed
开头的每一列实际上是问题的标题
- 问题标题之间的列表示列间隔左端问题的选项
解决方案概述:
- 查找每个问题开始和结束位置的索引
- 将每个问题拼合到一个列中 (
pd.Series
)
- 将问题列重新合并在一起
实施(第 1 部分):
indices = [i for i, c in enumerate(df.columns) if not c.startswith('Unnamed')]
questions = [c for c in df.columns if not c.startswith('Unnamed')]
slices = [slice(i, j) for i, j in zip(indices, indices[1:] + [None])]
您可以看到,像下面这样遍历切片,您会得到一个 DataFrame
对应于每个问题:
for q in slices:
print(df.iloc[:, q]) # Use `display` if using Jupyter
实施(第 2-3 部分):
def parse_response(s):
try:
return s[~s.isnull()][0]
except IndexError:
return np.nan
data = [df.iloc[:, q].apply(parse_response, axis=1)[1:] for q in slices]
df = pd.concat(data, axis=1)
df.columns = questions
输出:
Q1 Q2 Q3
1 A1-A A2-B A3-C
2 A1-B A2-A A4-B
3 A1-C A2-B A3-A
4 A1-B A2-C A3-C
5 A1-A A2-B A3-A
我有一项调查需要分析,该调查由参与者在 SurveyMonkey 上完成。不幸的是,数据的组织方式并不理想,因为每个问题的每个分类响应都有自己的列。
例如,这里是数据框中一个响应的前几行:
How long have you been participating in the Garden Awards Program? \
0 One year
1 NaN
2 NaN
3 NaN
4 NaN
Unnamed: 10 Unnamed: 11 Unnamed: 12 \
0 2-3 years 4-5 years 5 or more years
1 NaN NaN NaN
2 NaN 4-5 years NaN
3 2-3 years NaN NaN
4 NaN NaN 5 or more years
How did you initially learn of the Garden Awards Program? \
0 I nominated my garden to be evaluated
1 NaN
2 I nominated my garden to be evaluated
3 NaN
4 NaN
Unnamed: 14 etc...
0 A friend or family member nominated my garden ...
1 A friend or family member nominated my garden ...
2 NaN
3 NaN
4 NaN
这个问题 How long have you been participating in the Garden Awards Program?
有有效的回答:one year
、2-3 years
等,并且都在第一行找到,作为键值对应哪一列.这是第一个问题。 (与 How did you initially learn of the Garden Awards Program?
类似,其中有效响应为:I nominated my garden to be evaluated
、A friend or family member nominated my garden
等)。
第二个问题是每个分类响应的附加列都是 Unnamed: N
,其中 N 是与所有问题关联的类别一样多的列。
在我开始重新映射和 flattening/collapsing 每个问题的列变成一个列之前,我想知道是否有任何其他方法可以使用 Pandas 处理像这样呈现的调查数据。我所有的搜索都指向 SurveyMonkey API,但我看不出它有什么用。
我猜我需要将列展平,因此,如果有人可以提出一种方法,那就太好了。我在想,有一种方法可以通过抓取相邻的列来继续抓取属于分类响应的所有列,直到 Unnamed
不再出现在列名中,但我不知道该怎么做。
我将使用以下 DataFrame
(可以从 here 下载 CSV 文件):
Q1 Unnamed: 2 Unnamed: 3 Q2 Unnamed: 5 Unnamed: 6 Q3 Unnamed: 7 Unnamed: 8
0 A1-A A1-B A1-C A2-A A2-B A2-C A3-A A4-B A3-C
1 A1-A NaN NaN NaN A2-B NaN NaN NaN A3-C
2 NaN A1-B NaN A2-A NaN NaN NaN A4-B NaN
3 NaN NaN A1-C NaN A2-B NaN A3-A NaN NaN
4 NaN A1-B NaN NaN NaN A2-C NaN NaN A3-C
5 A1-A NaN NaN NaN A2-B NaN A3-A NaN NaN
主要假设:
- 名称不以
Unnamed
开头的每一列实际上是问题的标题 - 问题标题之间的列表示列间隔左端问题的选项
解决方案概述:
- 查找每个问题开始和结束位置的索引
- 将每个问题拼合到一个列中 (
pd.Series
) - 将问题列重新合并在一起
实施(第 1 部分):
indices = [i for i, c in enumerate(df.columns) if not c.startswith('Unnamed')]
questions = [c for c in df.columns if not c.startswith('Unnamed')]
slices = [slice(i, j) for i, j in zip(indices, indices[1:] + [None])]
您可以看到,像下面这样遍历切片,您会得到一个 DataFrame
对应于每个问题:
for q in slices:
print(df.iloc[:, q]) # Use `display` if using Jupyter
实施(第 2-3 部分):
def parse_response(s):
try:
return s[~s.isnull()][0]
except IndexError:
return np.nan
data = [df.iloc[:, q].apply(parse_response, axis=1)[1:] for q in slices]
df = pd.concat(data, axis=1)
df.columns = questions
输出:
Q1 Q2 Q3
1 A1-A A2-B A3-C
2 A1-B A2-A A4-B
3 A1-C A2-B A3-A
4 A1-B A2-C A3-C
5 A1-A A2-B A3-A