抓取数据 PRAW - 如何改进我的代码？

Question

我有这个代码：

posts = []

subs = list(set(['Futurology', 'wallstreetbets', 'DataIsBeautiful','RenewableEnergy', 'Bitcoin', 'Android', 'programming',
'gaming','tech', 'google','hardware', 'oculus', 'software', 'startups', 'linus', 'microsoft',  'AskTechnology', 'realtech', 
'homeautomation', 'HomeKit','singularity', 'technews','Entrepreneur', 'investing', 'BusinessHub', 'CareerSuccess', 
'growmybusiness','venturecapital', 'ladybusiness', 'productivity', 'NFT', 'CryptoCurrency']))

targeted_date = '01-09-19 12:00:00'
targeted_date = datetime.datetime.strptime(targeted_date, '%d-%m-%y %H:%M:%S')


for sub_name in subs:
    for submission in reddit.subreddit(sub_name).hot(limit = 1):
        date = submission.created
        date = datetime.datetime.fromtimestamp(date)
        if date >= targeted_date and reddit.subreddit(sub_name).subscribers >= 35000:
            posts.append([date, submission.subreddit, reddit.subreddit(sub_name).subscribers, 
                      submission.title, submission.selftext])
        
df = pd.DataFrame(posts, columns = ['date', 'subreddit','subscribers', 'title', 'text'])
df

运行时间限制 = 16（~500 行）：905.9099962711334 s

这给了我这个结果：

date    subreddit   subscribers title   text
0   2021-11-08 09:18:22 Bitcoin 3546142 Please upgrade your node to enable Taproot. 
1   2021-09-19 17:01:03 homeautomation  1333753 Looking for developers interested in helping t...   A while back I opened sourced all of my source...
2   2021-11-11 11:00:17 Entrepreneur    1036934 Thank you Thursday! - November 11, 2021 **Your opportunity to thank the** /r/Entrepren...
3   2021-11-08 01:36:05 oculus  396752  [Weekly] What VR games have you been enjoying ...   Welcome to the weekly recommendation thread! :...
4   2021-06-17 19:25:01 microsoft   141810  Microsoft: Official Support Thread  Microsoft: Official Support Thread\n\nMicrosof...
5   2021-11-12 11:02:14 investing   1946917 Daily General Discussion and spitballin thread...   Have a general question? Want to offer some c...
6   2021-11-12 04:16:13 tech    413040  Mars rover scrapes at rock to 'look at somethi...   
7   2021-11-12 12:00:15 wallstreetbets  11143628    Daily Discussion Thread for November 12, 2021   Your daily trading discussion thread. Please k...
8   2021-04-17 14:50:02 singularity 134940  Re: The Discord Link Expired, so here's a new ...   
9   2021-11-12 11:40:04 programming 3682438 It's probably time to stop recommending Clean ...   
10  2021-09-10 10:26:07 software    149655  What I do/install on every Windows PC - Softwa...   Hello, I have to spend a lot of time finding s...
11  2021-11-12 13:00:18 Android 2315799 Daily Superthread (Nov 12 2021) - Your daily t...   Note 1. Check [MoronicMondayAndroid](https://o...
12  2021-11-11 23:32:33 CryptoCurrency  3871810 Live Recording: Kevin O’Leary Talks About Cryp...   
13  2021-11-02 20:53:21 productivity    874076  Self-promotion/shout out thread This is the place to share your personal blogs...
14  2021-11-12 14:57:19 RenewableEnergy 97364   Northvolt produces first fully recycled batter...   
15  2021-11-12 08:00:16 gaming  30936297    Free Talk Friday!   Use this post to discuss life, post memes, or ...
16  2021-11-01 05:01:23 startups    884574  Share Your Startup - November 2021 - Upvote Th...   [r/startups](https://www.reddit.com/r/startups...
17  2021-11-01 09:00:11 HomeKit 107076  Monthly Buying Megathread - Ask which accessor...   Looking for lights, a thermostat, a plug, or a...
18  2021-11-01 13:00:13 dataisbeautiful 16467198    [Topic][Open] Open Discussion Thread — Anybody...   Anybody can post a question related to data vi...
19  2021-11-12 12:29:47 technews    339611  Peter Jackson sells visual effects firm for ...   
20  2021-10-07 19:15:14 NFT 221897  Join our official —and the #1 NFT— Discord Ser...   
21  2020-12-01 12:11:36 google  1622449 Monthly Discussion and Support Thread - Decemb...   Have a question you need answered? A new Googl...

问题是它花费了太多时间。如您所见，我设置了一个限制 = 1，它需要大约 1 分钟才能到达运行。昨天，我设置了300的限制，为了分析数据运行大概2个小时。

我的问题：有没有办法改变代码组织来限制运行时间？

下面的代码过去工作得更快，但我想要一个列订阅者号码，并且不得不添加第二个 for 循环：

posts = []
subs = reddit.subreddit('Futurology+wallstreetbets+DataIsBeautiful+RenewableEnergy+Bitcoin+Android+programming+gaming+tech+google+hardware+oculus+software+startups+linus+microsoft+AskTechnology+realtech+homeautomation+HomeKit+singularity+technews+Entrepreneur+investing+BusinessHub+CareerSuccess+growmybusiness+venturecapital+ladybusiness+productivity+NFT+CryptoCurrency')
targeted_date = '01-09-19 12:00:00'
targeted_date = datetime.datetime.strptime(targeted_date, '%d-%m-%y %H:%M:%S')   

for subreddit in subs.new(limit = 500):
    date = subreddit.created
    date = datetime.datetime.fromtimestamp(date)
    posts.append([date, subreddit.subreddit, subreddit.title, subreddit.selftext])

df = pd.DataFrame(posts, columns = ['date', 'subreddit', 'title', 'text'])
df

运行时间限制 = 500（500 行）：7.630232095718384 s

我知道他们并没有做完全相同的事情，但是，我尝试实现这个新代码的唯一原因是添加新列 'subscribers'，这似乎对其他调用有不同的作用。

有suggestions/improvement建议吗？

最后一个，有谁知道根据特定主题检索所有 subreddit 列表的方法吗？（比如技术）我发现这个页面列出了 subreddits：https://www.reddit.com/r/ListOfSubreddits/wiki/listofsubreddits/#wiki_technology

谢谢:)

Answer 1

通过减少转换和服务器调用来改进您现有的代码（最后有解释）：

posts = []

subs = list(set(['Futurology', 'wallstreetbets', 'DataIsBeautiful','RenewableEnergy', 'Bitcoin', 'Android', 'programming',
'gaming','tech', 'google','hardware', 'oculus', 'software', 'startups', 'linus', 'microsoft',  'AskTechnology', 'realtech', 
'homeautomation', 'HomeKit','singularity', 'technews','Entrepreneur', 'investing', 'BusinessHub', 'CareerSuccess', 
'growmybusiness','venturecapital', 'ladybusiness', 'productivity', 'NFT', 'CryptoCurrency']))

# convert target date into epoch format
targeted_date = '01-09-19 12:00:00'
targeted_date = datetime.datetime.strptime(targeted_date, '%d-%m-%y %H:%M:%S').timestamp()


for sub_name in subs:
    subscriber_number = reddit.subreddit(sub_name).subscribers
    if subscriber_number < 35000: # if the subscribers are less than this skip gathering the posts as this would have resulted in false originally
        continue

    for submission in reddit.subreddit(sub_name).hot(limit = 1):
        date = submission.created # reddit uses epoch time timestamps
        if date >= targeted_date:
            posts.append([date, submission.subreddit, subscriber_number, 
                      submission.title, submission.selftext])

df = pd.DataFrame(posts, columns = ['date', 'subreddit','subscribers', 'title', 'text'])
df

通过分离逻辑与门，您可以跳过那些计算结果为 false 的循环。

不是在 for 循环内将日期转换为 human-readable 日期，而是将目标日期一次转换为 Reddit 使用的格式，通过删除转换操作来提高速度，而只是一个 look-up 比较数字的操作。

通过存储订阅者数量的结果，您可以删除检索该信息的呼叫次数，而是在内存中查找该号码。

抓取数据 PRAW - 如何改进我的代码？

Scraping data PRAW - How can I improve my code?

python

performance

coding-efficiency

web-scraping

praw