我将如何根据列中的值折叠行?
How would I collapse rows based on a value in a column?
我将在这里更详细地描述我的意思。
假设我有一个数据 sheet 看起来像这样:
+-----------+---------+---------+---------+---------+---------+---------+--------------+
| | Person1 | Person2 | Person4 | Person4 | Person5 | Person6 | City |
+-----------+---------+---------+---------+---------+---------+---------+--------------+
| January | - | - | Yes | - | Yes | - | SanFrancisco |
| Febuary | Yes | - | - | - | - | - | SanFrancisco |
| March | - | - | - | - | - | - | SanFrancisco |
| April | - | - | - | - | - | - | NewYork |
| May | Yes | - | - | - | - | - | NewYork |
| June | - | - | - | - | - | - | NewYork |
| July | - | - | - | - | Yes | - | NewYork |
| August | - | - | - | - | - | - | NewYork |
| September | - | - | - | - | - | - | Miami |
| November | - | - | - | - | - | Yes | Miami |
| December | - | - | - | - | - | - | Miami |
+-----------+---------+---------+---------+---------+---------+---------+--------------+
忽略 Whosebug 格式的 ascii,这是一个简单的传播sheet,根据 6 个人在哪个城市去过哪些月份来跟踪他们。
我只想知道,哪些人去过哪些城市。有效地压缩列表如下所示:
+---------+---------+---------+---------+---------+---------+--------------+
| Person1 | Person2 | Person4 | Person4 | Person5 | Person6 | City |
+---------+---------+---------+---------+---------+---------+--------------+
| Yes | - | Yes | - | Yes | - | SanFrancisco |
| Yes | - | - | - | Yes | - | NewYork |
| - | - | - | - | - | Yes | Miami |
+---------+---------+---------+---------+---------+---------+--------------+
每一行只有一个城市,包含访问过的人。是否有最佳方法来执行此操作,或者更确切地说,是否有某种 tr(squeeze)/sed 工具已经可以执行此操作?如果我必须编写代码,最佳逻辑是什么?
您在这里尝试做的事情的正确术语是聚合。根据我的经验,collapse 这个词并不常用于此操作。
我正在这里学习 python,所以可能有更好的方法,但我已经使用 pandas
module, specifically the DataFrame
类型让它工作:
import pandas;
import re;
df = pandas.DataFrame({
'Date':['January','Febuary','March','April','May','June','July','August','September','November','December'],
'Person1':['-','Yes','-','-','Yes','-','-','-','-','-','-'],
'Person2':['-','-','-','-','-','-','-','-','-','-','-'],
'Person3':['Yes','-','-','-','-','-','-','-','-','-','-'],
'Person4':['-','-','-','-','-','-','-','-','-','-','-'],
'Person5':['Yes','-','-','-','-','-','Yes','-','-','-','-'],
'Person6':['-','-','-','-','-','-','-','-','-','Yes','-'],
'City':['SanFrancisco','SanFrancisco','SanFrancisco','NewYork','NewYork','NewYork','NewYork','NewYork','Miami','Miami','Miami']
});
df.groupby('City').agg({k:lambda x: 'Yes' if 'Yes' in x.values else '-' for k in filter(lambda x:re.search(r'^Person',x),df.keys())});
## Person2 Person3 Person1 Person6 Person4 Person5
## City
## Miami - - - Yes - -
## NewYork - - Yes - - Yes
## SanFrancisco - Yes Yes - - Yes
此外,我强烈建议您查看 R programming language,这是一个优秀且越来越普遍的统计、图形和通用数据分析平台,非常适合使用 Excel 风格表格数据。这些类型的数据格式转换在 R 中肯定更自然,尽管学习曲线相当陡峭。这是 R 实现:
df <- read.csv(stringsAsFactors=F,text=
'Date,Person1,Person2,Person3,Person4,Person5,Person6,City
January,-,-,Yes,-,Yes,-,SanFrancisco
Febuary,Yes,-,-,-,-,-,SanFrancisco
March,-,-,-,-,-,-,SanFrancisco
April,-,-,-,-,-,-,NewYork
May,Yes,-,-,-,-,-,NewYork
June,-,-,-,-,-,-,NewYork
July,-,-,-,-,Yes,-,NewYork
August,-,-,-,-,-,-,NewYork
September,-,-,-,-,-,-,Miami
November,-,-,-,-,-,Yes,Miami
December,-,-,-,-,-,-,Miami'
);
aggregate(.~City,df[-1L],function(x) if (any(x=='Yes')) 'Yes' else '-');
## City Person1 Person2 Person3 Person4 Person5 Person6
## 1 Miami - - - - - Yes
## 2 NewYork Yes - - - Yes -
## 3 SanFrancisco Yes - Yes - Yes -
$ cat tst.awk
function prt() {
if ( prev != "" ) {
for (i=2;i<=NF;i++) {
printf "%s%s", vals[i], (i<NF ? OFS : ORS)
}
}
delete vals
}
BEGIN { FS=OFS="," }
$NF != prev { prt() }
{
for (i=1;i<=NF;i++) {
vals[i] = (vals[i] ~ /[[:alpha:]]/ ? vals[i] : $i)
}
prev = $NF
}
END { prt() }
$ awk -f tst.awk file
Person1,Person2,Person4,Person4,Person5,Person6,City
Yes,-,Yes,-,Yes,-,SanFrancisco
Yes,-,-,-,Yes,-,NewYork
-,-,-,-,-,Yes,Miami
以上假定您的输入格式实际上是 CSV,如下所示:
$ cat file
Month,Person1,Person2,Person4,Person4,Person5,Person6,City
January,-,-,Yes,-,Yes,-,SanFrancisco
Febuary,Yes,-,-,-,-,-,SanFrancisco
March,-,-,-,-,-,-,SanFrancisco
April,-,-,-,-,-,-,NewYork
May,Yes,-,-,-,-,-,NewYork
June,-,-,-,-,-,-,NewYork
July,-,-,-,-,Yes,-,NewYork
August,-,-,-,-,-,-,NewYork
September,-,-,-,-,-,-,Miami
November,-,-,-,-,-,Yes,Miami
December,-,-,-,-,-,-,Miami
并且您想要 CSV 输出。
我将在这里更详细地描述我的意思。 假设我有一个数据 sheet 看起来像这样:
+-----------+---------+---------+---------+---------+---------+---------+--------------+
| | Person1 | Person2 | Person4 | Person4 | Person5 | Person6 | City |
+-----------+---------+---------+---------+---------+---------+---------+--------------+
| January | - | - | Yes | - | Yes | - | SanFrancisco |
| Febuary | Yes | - | - | - | - | - | SanFrancisco |
| March | - | - | - | - | - | - | SanFrancisco |
| April | - | - | - | - | - | - | NewYork |
| May | Yes | - | - | - | - | - | NewYork |
| June | - | - | - | - | - | - | NewYork |
| July | - | - | - | - | Yes | - | NewYork |
| August | - | - | - | - | - | - | NewYork |
| September | - | - | - | - | - | - | Miami |
| November | - | - | - | - | - | Yes | Miami |
| December | - | - | - | - | - | - | Miami |
+-----------+---------+---------+---------+---------+---------+---------+--------------+
忽略 Whosebug 格式的 ascii,这是一个简单的传播sheet,根据 6 个人在哪个城市去过哪些月份来跟踪他们。
我只想知道,哪些人去过哪些城市。有效地压缩列表如下所示:
+---------+---------+---------+---------+---------+---------+--------------+
| Person1 | Person2 | Person4 | Person4 | Person5 | Person6 | City |
+---------+---------+---------+---------+---------+---------+--------------+
| Yes | - | Yes | - | Yes | - | SanFrancisco |
| Yes | - | - | - | Yes | - | NewYork |
| - | - | - | - | - | Yes | Miami |
+---------+---------+---------+---------+---------+---------+--------------+
每一行只有一个城市,包含访问过的人。是否有最佳方法来执行此操作,或者更确切地说,是否有某种 tr(squeeze)/sed 工具已经可以执行此操作?如果我必须编写代码,最佳逻辑是什么?
您在这里尝试做的事情的正确术语是聚合。根据我的经验,collapse 这个词并不常用于此操作。
我正在这里学习 python,所以可能有更好的方法,但我已经使用 pandas
module, specifically the DataFrame
类型让它工作:
import pandas;
import re;
df = pandas.DataFrame({
'Date':['January','Febuary','March','April','May','June','July','August','September','November','December'],
'Person1':['-','Yes','-','-','Yes','-','-','-','-','-','-'],
'Person2':['-','-','-','-','-','-','-','-','-','-','-'],
'Person3':['Yes','-','-','-','-','-','-','-','-','-','-'],
'Person4':['-','-','-','-','-','-','-','-','-','-','-'],
'Person5':['Yes','-','-','-','-','-','Yes','-','-','-','-'],
'Person6':['-','-','-','-','-','-','-','-','-','Yes','-'],
'City':['SanFrancisco','SanFrancisco','SanFrancisco','NewYork','NewYork','NewYork','NewYork','NewYork','Miami','Miami','Miami']
});
df.groupby('City').agg({k:lambda x: 'Yes' if 'Yes' in x.values else '-' for k in filter(lambda x:re.search(r'^Person',x),df.keys())});
## Person2 Person3 Person1 Person6 Person4 Person5
## City
## Miami - - - Yes - -
## NewYork - - Yes - - Yes
## SanFrancisco - Yes Yes - - Yes
此外,我强烈建议您查看 R programming language,这是一个优秀且越来越普遍的统计、图形和通用数据分析平台,非常适合使用 Excel 风格表格数据。这些类型的数据格式转换在 R 中肯定更自然,尽管学习曲线相当陡峭。这是 R 实现:
df <- read.csv(stringsAsFactors=F,text=
'Date,Person1,Person2,Person3,Person4,Person5,Person6,City
January,-,-,Yes,-,Yes,-,SanFrancisco
Febuary,Yes,-,-,-,-,-,SanFrancisco
March,-,-,-,-,-,-,SanFrancisco
April,-,-,-,-,-,-,NewYork
May,Yes,-,-,-,-,-,NewYork
June,-,-,-,-,-,-,NewYork
July,-,-,-,-,Yes,-,NewYork
August,-,-,-,-,-,-,NewYork
September,-,-,-,-,-,-,Miami
November,-,-,-,-,-,Yes,Miami
December,-,-,-,-,-,-,Miami'
);
aggregate(.~City,df[-1L],function(x) if (any(x=='Yes')) 'Yes' else '-');
## City Person1 Person2 Person3 Person4 Person5 Person6
## 1 Miami - - - - - Yes
## 2 NewYork Yes - - - Yes -
## 3 SanFrancisco Yes - Yes - Yes -
$ cat tst.awk
function prt() {
if ( prev != "" ) {
for (i=2;i<=NF;i++) {
printf "%s%s", vals[i], (i<NF ? OFS : ORS)
}
}
delete vals
}
BEGIN { FS=OFS="," }
$NF != prev { prt() }
{
for (i=1;i<=NF;i++) {
vals[i] = (vals[i] ~ /[[:alpha:]]/ ? vals[i] : $i)
}
prev = $NF
}
END { prt() }
$ awk -f tst.awk file
Person1,Person2,Person4,Person4,Person5,Person6,City
Yes,-,Yes,-,Yes,-,SanFrancisco
Yes,-,-,-,Yes,-,NewYork
-,-,-,-,-,Yes,Miami
以上假定您的输入格式实际上是 CSV,如下所示:
$ cat file
Month,Person1,Person2,Person4,Person4,Person5,Person6,City
January,-,-,Yes,-,Yes,-,SanFrancisco
Febuary,Yes,-,-,-,-,-,SanFrancisco
March,-,-,-,-,-,-,SanFrancisco
April,-,-,-,-,-,-,NewYork
May,Yes,-,-,-,-,-,NewYork
June,-,-,-,-,-,-,NewYork
July,-,-,-,-,Yes,-,NewYork
August,-,-,-,-,-,-,NewYork
September,-,-,-,-,-,-,Miami
November,-,-,-,-,-,Yes,Miami
December,-,-,-,-,-,-,Miami
并且您想要 CSV 输出。