从头到尾计算 CSV 行中的沿袭
Counting lineage in rows in a CSV from the end to beginning
下面我有一个 CSV 文件,其中每一列都包含一个谱系。每列都有不同长度的谱系。我试着从谱系的末尾开始计数,就像我从最后一个元素到谱系的开始计数一样。
Column1 Column2 Column3
root root root
cellular organisms cellular organisms cellular organisms
Eukaryota Eukaryota Eukaryota
Sar Sar Viridiplantae
Alveolata Alveolata
Apicomplexa Apicomplexa
Aconoidasida
我尝试了@xjcl 提供的以下代码,但问题是脚本认为所有谱系行的长度都相同,因此产生了错误的值。任何帮助
import pandas as pd
from io import StringIO
def filter_and_count(df, search_string):
df_filtered = df.loc[:, (df == search_string).any(axis=0)] # to access a group of rows and columns by label
return pd.melt(df_filtered)['value'].value_counts() # using more than one column as an identifire
df = pd.read_csv("/Users/Desktop/test.csv") # read the csv file
df = df.transpose()
df = pd.melt(df[-2:])['value'].value_counts() #counting phyla
df.to_csv (r'/Users/Desktop/eukaryotes.csv') # the output file
我正在寻找的输出是让最后列出的组位于顶部,其计数如下
group count
Aconoidasida 1
Apicomplexa 2
Alveolata 2
Sar 2
Viridiplantae 1
Eukaryota 3
cellular organisms 3
root 3
我假设每一行都包含相同的类别(例如订单、家庭、物种等):
import pandas as pd
import numpy as np
data = {
'Column1': ['root', 'cellular organisms', 'Eukaryota', 'Sar', 'Alveolata', 'Apicomplexa', 'Aconoidasida'],
'Column2': ['root', 'cellular organisms', 'Eukaryota', 'Sar', 'Alveolata', 'Apicomplexa', ''],
'Column3': ['root', 'cellular organisms', 'Eukaryota', 'Viridiplantae', '', '', '']
}
df = pd.DataFrame(data).replace('', np.nan)
给予
Column1 Column2 Column3
0 root root root
1 cellular organisms cellular organisms cellular organisms
2 Eukaryota Eukaryota Eukaryota
3 Sar Sar Viridiplantae
4 Alveolata Alveolata NaN
5 Apicomplexa Apicomplexa NaN
6 Aconoidasida NaN NaN
您可以遍历行并将它们的值计数计算为字典,然后合并它们(注意使用 .loc[::-1]
的反转):
counts = df.apply(lambda row: row.value_counts().to_dict(), axis=1)
merged = {group: count for d in counts.loc[::-1] for group, count in d.items()}
给予
{'Aconoidasida': 1,
'Alveolata': 2,
'Apicomplexa': 2,
'Eukaryota': 3,
'Sar': 2,
'Viridiplantae': 1,
'cellular organisms': 3,
'root': 3}
如果需要,您可以将其转换为 DataFrame:
pd.DataFrame.from_dict(merged, orient='index', columns=['count'])
给予
count
Aconoidasida 1
Apicomplexa 2
Alveolata 2
Sar 2
Viridiplantae 1
Eukaryota 3
cellular organisms 3
root 3
下面我有一个 CSV 文件,其中每一列都包含一个谱系。每列都有不同长度的谱系。我试着从谱系的末尾开始计数,就像我从最后一个元素到谱系的开始计数一样。
Column1 Column2 Column3
root root root
cellular organisms cellular organisms cellular organisms
Eukaryota Eukaryota Eukaryota
Sar Sar Viridiplantae
Alveolata Alveolata
Apicomplexa Apicomplexa
Aconoidasida
我尝试了@xjcl 提供的以下代码,但问题是脚本认为所有谱系行的长度都相同,因此产生了错误的值。任何帮助
import pandas as pd
from io import StringIO
def filter_and_count(df, search_string):
df_filtered = df.loc[:, (df == search_string).any(axis=0)] # to access a group of rows and columns by label
return pd.melt(df_filtered)['value'].value_counts() # using more than one column as an identifire
df = pd.read_csv("/Users/Desktop/test.csv") # read the csv file
df = df.transpose()
df = pd.melt(df[-2:])['value'].value_counts() #counting phyla
df.to_csv (r'/Users/Desktop/eukaryotes.csv') # the output file
我正在寻找的输出是让最后列出的组位于顶部,其计数如下
group count
Aconoidasida 1
Apicomplexa 2
Alveolata 2
Sar 2
Viridiplantae 1
Eukaryota 3
cellular organisms 3
root 3
我假设每一行都包含相同的类别(例如订单、家庭、物种等):
import pandas as pd
import numpy as np
data = {
'Column1': ['root', 'cellular organisms', 'Eukaryota', 'Sar', 'Alveolata', 'Apicomplexa', 'Aconoidasida'],
'Column2': ['root', 'cellular organisms', 'Eukaryota', 'Sar', 'Alveolata', 'Apicomplexa', ''],
'Column3': ['root', 'cellular organisms', 'Eukaryota', 'Viridiplantae', '', '', '']
}
df = pd.DataFrame(data).replace('', np.nan)
给予
Column1 Column2 Column3
0 root root root
1 cellular organisms cellular organisms cellular organisms
2 Eukaryota Eukaryota Eukaryota
3 Sar Sar Viridiplantae
4 Alveolata Alveolata NaN
5 Apicomplexa Apicomplexa NaN
6 Aconoidasida NaN NaN
您可以遍历行并将它们的值计数计算为字典,然后合并它们(注意使用 .loc[::-1]
的反转):
counts = df.apply(lambda row: row.value_counts().to_dict(), axis=1)
merged = {group: count for d in counts.loc[::-1] for group, count in d.items()}
给予
{'Aconoidasida': 1,
'Alveolata': 2,
'Apicomplexa': 2,
'Eukaryota': 3,
'Sar': 2,
'Viridiplantae': 1,
'cellular organisms': 3,
'root': 3}
如果需要,您可以将其转换为 DataFrame:
pd.DataFrame.from_dict(merged, orient='index', columns=['count'])
给予
count
Aconoidasida 1
Apicomplexa 2
Alveolata 2
Sar 2
Viridiplantae 1
Eukaryota 3
cellular organisms 3
root 3