Python Pandas 数据框中的第 sorting/counting 行

Question

我正在尝试开发一个质量检查脚本，该脚本将检查一组数据（在 pandas 数据框中）并计算不同类型样本的总数。这是数据库中的示例：

有问题的样本都是

XXX123

我当前的脚本选择并挑选所有 QC 样本，如空白或 IRM，但我无法计算实际的 XXX123，因为其中一些样本有两种类型的重复，作为内部质量检查。

一种是 "ORIG" 和 "PREP"
第二种是“.1”和“.2”

另一个问题是，很少有一个样本会同时获得两者，就像您在 XXX123 85-90 中看到的那样

最后，问题是我怎么可能解释这个？如何告诉 python 这个：

只要有一行包含“.1”和下一行包含“.2” - 将这两个条目计为 1
只要有一行包含 "ORIG" 和下一行包含 "PREPDUP" - 将这两个条目计为 1
只要有一行包含 "ORIG .1" 和下一行包含 "ORIG .2" 和下面的第三个，包含 "PREPDUP" - 将这三个条目计为 1。

如果我能进一步澄清这一点，请告诉我。谢谢！这是我目前运行的代码，但是“# Replicates”下面的所有内容都没有按我想要的方式执行，因为我无法弄清楚：

# IRMs
IRMs = CorrectedDF[CorrectedDF['SampleID'].str.match('IRM')]
print('Total numer of IRM samples in the run is: {}' .format(len(IRMs.index)))

# BLANKs 
searchfor = ['blk', 'Blank', 'BLK', 'blank']
BLANKs = CorrectedDF[CorrectedDF['SampleID'].str.contains('|'.join(searchfor))]
print('Total numer of BLANKs in the run is: {}' .format(len(BLANKs.index)))

# OREAS 239
searchfor2 = ['OREAS 239', 'oreas 239', 'Oreas 239']
OREAS_239 = CorrectedDF[CorrectedDF['SampleID'].str.contains('|'.join(searchfor2))]
print('Total numer of OREAS 239 Samples in the run is: {}' .format(len(OREAS_239.index)))

# Cal Standards 
searchfor3 = ['Standard', 'Au 15']
CalSTD = CorrectedDF[CorrectedDF['SampleID'].str.contains('|'.join(searchfor3))]
print('Total numer of Cal Standard Samples in the run is: {}' .format(len(CalSTD.index)))

# Prep samples
searchfor4 = ['Prep']
Prep = CorrectedDF[CorrectedDF['SampleID'].str.contains('|'.join(searchfor4))]
print('Total numer of Prep Samples in the run is: {}' .format(len(Prep.index)))

# Replicates
searchfor5 = ['ORIG', 'PREPDUP']
Replicates = CorrectedDF[CorrectedDF['SampleID'].str.contains('|'.join(searchfor5))]
print('Total numer of Replicate Samples in the run is: {}' .format(len(Replicates.index)))

print('Total numer of ALL Samples in the run is: {}' .format(len(CorrectedDF.index)))
ClientSamples = len(CorrectedDF.index) - (len(IRMs.index) + len(BLANKs.index)
                                          + len(OREAS_239.index) + len(CalSTD.index) 
                                          + len(Prep.index) + len(Replicates.index))
print('Total numer of Client-ONLY Samples in the run is: {}' .format(ClientSamples))

Answer 1

df['Label'].str.extract('(XXX123 \d+-\d+)').nunique()

您可以只使用正则表达式提取您要查找的内容，然后使用 nunique 找出有多少个唯一值。

Python Pandas 数据框中的第 sorting/counting 行

Row sorting/counting in a Python Pandas dataframe

python

sorting

counting

pandas