为什么我不能使用 TWARC/hydrator 应用程序为 24000 条推文 ID 中的超过 18 条推文补水？有人知道更好的方法吗？

Question

我有一个关于推文文本再水化的问题。任何帮助将不胜感激。

这是我的数据来源；这是关于电晕的推文：

source of data set

我已经从它下载了照片中的数据集（名为 01-feb-2020）

然后，我过滤此数据以显示来自 'GB' 的唯一推文，几乎是 24000 条推文

我已经使用 twarc 来滋润我的推文文本，如下所示：

首先，使用 pip 安装 twarc

然后，在命令行中输入：twarc configure

然后，用户间密钥和秘密密钥

然后，写一个命令：

twarc hydrate id.txt > tweet_hydrated.jsonl

但是，我从 24000 个推文 ID 中只得到了 18 个推文文本

我也用过 hydrator 应用程序，但结果是一样的。我究竟做错了什么？从那么大量的数据中提取 18 个是否合乎逻辑？感谢任何有关滋润推文文本世界的新建议。（抱歉我的英语不好，我不是天真的演讲者）

Answer 1

推文 ID 收集方法（copy-pasting）不正确。写了正确的代码将tweet ID保存到文本文件后，问题就解决了。

此外，Andy Piper 在评论部分提到了同样的事情，我将其复制到此处。

How are you getting from JSON format downloaded, into a CSV format? I'm wondering whether the Tweet ID values are valid. – Andy Piper 5 hours ago

I've managed to reproduce this now, and I believe that in the process of converting your JSON input to CSV / Excel to a list of Tweet IDs to hydrate, you are probably using JavaScript (?) and the Tweet IDs are losing their accuracy. The clue was when I noticed all of the Tweet IDs ending in 0000 in my Excel column. You'll need to use a more precise method of getting the Tweet IDs into twarc

Answer 2

我现在已经设法重现了这一点，我相信在将你的 JSON 输入转换为 CSV / Excel 到要水化的推文 ID 列表的过程中，你可能使用 JavaScript (?) 并且推文 ID 正在失去准确性。线索是当我注意到我的 Excel 列中所有以 0000 结尾的推文 ID 时。您需要使用更精确的方法将推文 ID 导入 twarc。

为什么我不能使用 TWARC/hydrator 应用程序为 24000 条推文 ID 中的超过 18 条推文补水？有人知道更好的方法吗？

why I could not rehydrate more than 18 tweets out of 24000 tweet ids using TWARC/ hydrator app? Does any one know a better way?

python

twitter

hydration