抓取数据 PRAW - 如何改进我的代码?
Scraping data PRAW - How can I improve my code?
我有这个代码:
posts = []
subs = list(set(['Futurology', 'wallstreetbets', 'DataIsBeautiful','RenewableEnergy', 'Bitcoin', 'Android', 'programming',
'gaming','tech', 'google','hardware', 'oculus', 'software', 'startups', 'linus', 'microsoft', 'AskTechnology', 'realtech',
'homeautomation', 'HomeKit','singularity', 'technews','Entrepreneur', 'investing', 'BusinessHub', 'CareerSuccess',
'growmybusiness','venturecapital', 'ladybusiness', 'productivity', 'NFT', 'CryptoCurrency']))
targeted_date = '01-09-19 12:00:00'
targeted_date = datetime.datetime.strptime(targeted_date, '%d-%m-%y %H:%M:%S')
for sub_name in subs:
for submission in reddit.subreddit(sub_name).hot(limit = 1):
date = submission.created
date = datetime.datetime.fromtimestamp(date)
if date >= targeted_date and reddit.subreddit(sub_name).subscribers >= 35000:
posts.append([date, submission.subreddit, reddit.subreddit(sub_name).subscribers,
submission.title, submission.selftext])
df = pd.DataFrame(posts, columns = ['date', 'subreddit','subscribers', 'title', 'text'])
df
运行时间限制 = 16(~500 行):905.9099962711334 s
这给了我这个结果:
date subreddit subscribers title text
0 2021-11-08 09:18:22 Bitcoin 3546142 Please upgrade your node to enable Taproot.
1 2021-09-19 17:01:03 homeautomation 1333753 Looking for developers interested in helping t... A while back I opened sourced all of my source...
2 2021-11-11 11:00:17 Entrepreneur 1036934 Thank you Thursday! - November 11, 2021 **Your opportunity to thank the** /r/Entrepren...
3 2021-11-08 01:36:05 oculus 396752 [Weekly] What VR games have you been enjoying ... Welcome to the weekly recommendation thread! :...
4 2021-06-17 19:25:01 microsoft 141810 Microsoft: Official Support Thread Microsoft: Official Support Thread\n\nMicrosof...
5 2021-11-12 11:02:14 investing 1946917 Daily General Discussion and spitballin thread... Have a general question? Want to offer some c...
6 2021-11-12 04:16:13 tech 413040 Mars rover scrapes at rock to 'look at somethi...
7 2021-11-12 12:00:15 wallstreetbets 11143628 Daily Discussion Thread for November 12, 2021 Your daily trading discussion thread. Please k...
8 2021-04-17 14:50:02 singularity 134940 Re: The Discord Link Expired, so here's a new ...
9 2021-11-12 11:40:04 programming 3682438 It's probably time to stop recommending Clean ...
10 2021-09-10 10:26:07 software 149655 What I do/install on every Windows PC - Softwa... Hello, I have to spend a lot of time finding s...
11 2021-11-12 13:00:18 Android 2315799 Daily Superthread (Nov 12 2021) - Your daily t... Note 1. Check [MoronicMondayAndroid](https://o...
12 2021-11-11 23:32:33 CryptoCurrency 3871810 Live Recording: Kevin O’Leary Talks About Cryp...
13 2021-11-02 20:53:21 productivity 874076 Self-promotion/shout out thread This is the place to share your personal blogs...
14 2021-11-12 14:57:19 RenewableEnergy 97364 Northvolt produces first fully recycled batter...
15 2021-11-12 08:00:16 gaming 30936297 Free Talk Friday! Use this post to discuss life, post memes, or ...
16 2021-11-01 05:01:23 startups 884574 Share Your Startup - November 2021 - Upvote Th... [r/startups](https://www.reddit.com/r/startups...
17 2021-11-01 09:00:11 HomeKit 107076 Monthly Buying Megathread - Ask which accessor... Looking for lights, a thermostat, a plug, or a...
18 2021-11-01 13:00:13 dataisbeautiful 16467198 [Topic][Open] Open Discussion Thread — Anybody... Anybody can post a question related to data vi...
19 2021-11-12 12:29:47 technews 339611 Peter Jackson sells visual effects firm for ...
20 2021-10-07 19:15:14 NFT 221897 Join our official —and the #1 NFT— Discord Ser...
21 2020-12-01 12:11:36 google 1622449 Monthly Discussion and Support Thread - Decemb... Have a question you need answered? A new Googl...
问题是它花费了太多时间。如您所见,我设置了一个限制 = 1,它需要大约 1 分钟才能到达 运行。昨天,我设置了300的限制,为了分析数据运行大概2个小时。
我的问题:有没有办法改变代码组织来限制运行时间?
下面的代码过去工作得更快,但我想要一个列订阅者号码,并且不得不添加第二个 for 循环:
posts = []
subs = reddit.subreddit('Futurology+wallstreetbets+DataIsBeautiful+RenewableEnergy+Bitcoin+Android+programming+gaming+tech+google+hardware+oculus+software+startups+linus+microsoft+AskTechnology+realtech+homeautomation+HomeKit+singularity+technews+Entrepreneur+investing+BusinessHub+CareerSuccess+growmybusiness+venturecapital+ladybusiness+productivity+NFT+CryptoCurrency')
targeted_date = '01-09-19 12:00:00'
targeted_date = datetime.datetime.strptime(targeted_date, '%d-%m-%y %H:%M:%S')
for subreddit in subs.new(limit = 500):
date = subreddit.created
date = datetime.datetime.fromtimestamp(date)
posts.append([date, subreddit.subreddit, subreddit.title, subreddit.selftext])
df = pd.DataFrame(posts, columns = ['date', 'subreddit', 'title', 'text'])
df
运行时间限制 = 500(500 行):7.630232095718384 s
我知道他们并没有做完全相同的事情,但是,我尝试实现这个新代码的唯一原因是添加新列 'subscribers',这似乎对其他调用有不同的作用。
有suggestions/improvement建议吗?
最后一个,有谁知道根据特定主题检索所有 subreddit 列表的方法吗? (比如技术)我发现这个页面列出了 subreddits:https://www.reddit.com/r/ListOfSubreddits/wiki/listofsubreddits/#wiki_technology
谢谢:)
通过减少转换和服务器调用来改进您现有的代码(最后有解释):
posts = []
subs = list(set(['Futurology', 'wallstreetbets', 'DataIsBeautiful','RenewableEnergy', 'Bitcoin', 'Android', 'programming',
'gaming','tech', 'google','hardware', 'oculus', 'software', 'startups', 'linus', 'microsoft', 'AskTechnology', 'realtech',
'homeautomation', 'HomeKit','singularity', 'technews','Entrepreneur', 'investing', 'BusinessHub', 'CareerSuccess',
'growmybusiness','venturecapital', 'ladybusiness', 'productivity', 'NFT', 'CryptoCurrency']))
# convert target date into epoch format
targeted_date = '01-09-19 12:00:00'
targeted_date = datetime.datetime.strptime(targeted_date, '%d-%m-%y %H:%M:%S').timestamp()
for sub_name in subs:
subscriber_number = reddit.subreddit(sub_name).subscribers
if subscriber_number < 35000: # if the subscribers are less than this skip gathering the posts as this would have resulted in false originally
continue
for submission in reddit.subreddit(sub_name).hot(limit = 1):
date = submission.created # reddit uses epoch time timestamps
if date >= targeted_date:
posts.append([date, submission.subreddit, subscriber_number,
submission.title, submission.selftext])
df = pd.DataFrame(posts, columns = ['date', 'subreddit','subscribers', 'title', 'text'])
df
通过分离逻辑与门,您可以跳过那些计算结果为 false 的循环。
不是在 for 循环内将日期转换为 human-readable 日期,而是将目标日期一次转换为 Reddit 使用的格式,通过删除转换操作来提高速度,而只是一个 look-up 比较数字的操作。
通过存储订阅者数量的结果,您可以删除检索该信息的呼叫次数,而是在内存中查找该号码。
我有这个代码:
posts = []
subs = list(set(['Futurology', 'wallstreetbets', 'DataIsBeautiful','RenewableEnergy', 'Bitcoin', 'Android', 'programming',
'gaming','tech', 'google','hardware', 'oculus', 'software', 'startups', 'linus', 'microsoft', 'AskTechnology', 'realtech',
'homeautomation', 'HomeKit','singularity', 'technews','Entrepreneur', 'investing', 'BusinessHub', 'CareerSuccess',
'growmybusiness','venturecapital', 'ladybusiness', 'productivity', 'NFT', 'CryptoCurrency']))
targeted_date = '01-09-19 12:00:00'
targeted_date = datetime.datetime.strptime(targeted_date, '%d-%m-%y %H:%M:%S')
for sub_name in subs:
for submission in reddit.subreddit(sub_name).hot(limit = 1):
date = submission.created
date = datetime.datetime.fromtimestamp(date)
if date >= targeted_date and reddit.subreddit(sub_name).subscribers >= 35000:
posts.append([date, submission.subreddit, reddit.subreddit(sub_name).subscribers,
submission.title, submission.selftext])
df = pd.DataFrame(posts, columns = ['date', 'subreddit','subscribers', 'title', 'text'])
df
运行时间限制 = 16(~500 行):905.9099962711334 s
这给了我这个结果:
date subreddit subscribers title text
0 2021-11-08 09:18:22 Bitcoin 3546142 Please upgrade your node to enable Taproot.
1 2021-09-19 17:01:03 homeautomation 1333753 Looking for developers interested in helping t... A while back I opened sourced all of my source...
2 2021-11-11 11:00:17 Entrepreneur 1036934 Thank you Thursday! - November 11, 2021 **Your opportunity to thank the** /r/Entrepren...
3 2021-11-08 01:36:05 oculus 396752 [Weekly] What VR games have you been enjoying ... Welcome to the weekly recommendation thread! :...
4 2021-06-17 19:25:01 microsoft 141810 Microsoft: Official Support Thread Microsoft: Official Support Thread\n\nMicrosof...
5 2021-11-12 11:02:14 investing 1946917 Daily General Discussion and spitballin thread... Have a general question? Want to offer some c...
6 2021-11-12 04:16:13 tech 413040 Mars rover scrapes at rock to 'look at somethi...
7 2021-11-12 12:00:15 wallstreetbets 11143628 Daily Discussion Thread for November 12, 2021 Your daily trading discussion thread. Please k...
8 2021-04-17 14:50:02 singularity 134940 Re: The Discord Link Expired, so here's a new ...
9 2021-11-12 11:40:04 programming 3682438 It's probably time to stop recommending Clean ...
10 2021-09-10 10:26:07 software 149655 What I do/install on every Windows PC - Softwa... Hello, I have to spend a lot of time finding s...
11 2021-11-12 13:00:18 Android 2315799 Daily Superthread (Nov 12 2021) - Your daily t... Note 1. Check [MoronicMondayAndroid](https://o...
12 2021-11-11 23:32:33 CryptoCurrency 3871810 Live Recording: Kevin O’Leary Talks About Cryp...
13 2021-11-02 20:53:21 productivity 874076 Self-promotion/shout out thread This is the place to share your personal blogs...
14 2021-11-12 14:57:19 RenewableEnergy 97364 Northvolt produces first fully recycled batter...
15 2021-11-12 08:00:16 gaming 30936297 Free Talk Friday! Use this post to discuss life, post memes, or ...
16 2021-11-01 05:01:23 startups 884574 Share Your Startup - November 2021 - Upvote Th... [r/startups](https://www.reddit.com/r/startups...
17 2021-11-01 09:00:11 HomeKit 107076 Monthly Buying Megathread - Ask which accessor... Looking for lights, a thermostat, a plug, or a...
18 2021-11-01 13:00:13 dataisbeautiful 16467198 [Topic][Open] Open Discussion Thread — Anybody... Anybody can post a question related to data vi...
19 2021-11-12 12:29:47 technews 339611 Peter Jackson sells visual effects firm for ...
20 2021-10-07 19:15:14 NFT 221897 Join our official —and the #1 NFT— Discord Ser...
21 2020-12-01 12:11:36 google 1622449 Monthly Discussion and Support Thread - Decemb... Have a question you need answered? A new Googl...
问题是它花费了太多时间。如您所见,我设置了一个限制 = 1,它需要大约 1 分钟才能到达 运行。昨天,我设置了300的限制,为了分析数据运行大概2个小时。
我的问题:有没有办法改变代码组织来限制运行时间?
下面的代码过去工作得更快,但我想要一个列订阅者号码,并且不得不添加第二个 for 循环:
posts = []
subs = reddit.subreddit('Futurology+wallstreetbets+DataIsBeautiful+RenewableEnergy+Bitcoin+Android+programming+gaming+tech+google+hardware+oculus+software+startups+linus+microsoft+AskTechnology+realtech+homeautomation+HomeKit+singularity+technews+Entrepreneur+investing+BusinessHub+CareerSuccess+growmybusiness+venturecapital+ladybusiness+productivity+NFT+CryptoCurrency')
targeted_date = '01-09-19 12:00:00'
targeted_date = datetime.datetime.strptime(targeted_date, '%d-%m-%y %H:%M:%S')
for subreddit in subs.new(limit = 500):
date = subreddit.created
date = datetime.datetime.fromtimestamp(date)
posts.append([date, subreddit.subreddit, subreddit.title, subreddit.selftext])
df = pd.DataFrame(posts, columns = ['date', 'subreddit', 'title', 'text'])
df
运行时间限制 = 500(500 行):7.630232095718384 s
我知道他们并没有做完全相同的事情,但是,我尝试实现这个新代码的唯一原因是添加新列 'subscribers',这似乎对其他调用有不同的作用。
有suggestions/improvement建议吗?
最后一个,有谁知道根据特定主题检索所有 subreddit 列表的方法吗? (比如技术)我发现这个页面列出了 subreddits:https://www.reddit.com/r/ListOfSubreddits/wiki/listofsubreddits/#wiki_technology
谢谢:)
通过减少转换和服务器调用来改进您现有的代码(最后有解释):
posts = []
subs = list(set(['Futurology', 'wallstreetbets', 'DataIsBeautiful','RenewableEnergy', 'Bitcoin', 'Android', 'programming',
'gaming','tech', 'google','hardware', 'oculus', 'software', 'startups', 'linus', 'microsoft', 'AskTechnology', 'realtech',
'homeautomation', 'HomeKit','singularity', 'technews','Entrepreneur', 'investing', 'BusinessHub', 'CareerSuccess',
'growmybusiness','venturecapital', 'ladybusiness', 'productivity', 'NFT', 'CryptoCurrency']))
# convert target date into epoch format
targeted_date = '01-09-19 12:00:00'
targeted_date = datetime.datetime.strptime(targeted_date, '%d-%m-%y %H:%M:%S').timestamp()
for sub_name in subs:
subscriber_number = reddit.subreddit(sub_name).subscribers
if subscriber_number < 35000: # if the subscribers are less than this skip gathering the posts as this would have resulted in false originally
continue
for submission in reddit.subreddit(sub_name).hot(limit = 1):
date = submission.created # reddit uses epoch time timestamps
if date >= targeted_date:
posts.append([date, submission.subreddit, subscriber_number,
submission.title, submission.selftext])
df = pd.DataFrame(posts, columns = ['date', 'subreddit','subscribers', 'title', 'text'])
df
通过分离逻辑与门,您可以跳过那些计算结果为 false 的循环。
不是在 for 循环内将日期转换为 human-readable 日期,而是将目标日期一次转换为 Reddit 使用的格式,通过删除转换操作来提高速度,而只是一个 look-up 比较数字的操作。
通过存储订阅者数量的结果,您可以删除检索该信息的呼叫次数,而是在内存中查找该号码。