我如何识别一个数据集的子集来指示整个数据集？

How can I identify a subset of a dataset which is indicative of the dataset as a whole?

data-visualization
dataset
sampling

我有两个数据集：一个包含企业列表，另一个包含这些企业的评论列表（主键是企业 ID）。评论数据集很大，大约有 400 万个值，每个企业可能只有 0 条评论或多达 100 条评论。我想为每个企业创建一个词云或唯一的词计数器，但我的计算机无法在本地处理太多评论。有没有办法在不损害其完整性的情况下缩小数据集？例如，我可以为每个商家选择最多 50 条评论吗？

您正在寻找的是没有 select 离子偏差的代表性样品。 select 您的示例有多种方法。检查此 link https://humansofdata.atlan.com/2017/07/6-sampling-techniques-choose-representative-subset/ 以获得一些想法。

我如何识别一个数据集的子集来指示整个数据集？

How can I identify a subset of a dataset which is indicative of the dataset as a whole?

data-visualization

dataset

sampling