Pandas:一次迭代一行以自动执行 google 搜索?
Pandas: Iterate on a column one row at a time to automate a google search?
我正在尝试在 csv 中的特定列(通过 python 2.7);但是,我无法让 Pandas 将行内容读取到 Google 搜索自动程序。
*Google搜索源=https://breakingcode.wordpress.com/2010/06/29/google-search-python/
总的来说,当我使用以下代码时,我可以为查询成功打印 Urls:
from google import search
query = "apples"
for url in search(query, stop=5, pause=2.0):
print(url)
但是,当我添加 Pandas(读取每个 "query")时,行未读取 -> 按预期查询。 即正在查询 "data.irow(n)" 而不是行内容,一次一个。
from google import search
import pandas as pd
from pandas import DataFrame
query_performed = 0
querying = True
query = 'data.irow(n)'
#read the excel file at column 2 (i.e. "Fruit")
df = pd.read_csv('C:\Users\Desktop\query_results.csv', header=0, sep=',', index_col= 'Fruit')
# need to specify "Column2" and one "data.irow(n)" queried at a time
while querying:
if query_performed <= 100:
print("query")
query_performed +=1
else:
querying = False
print("Asked all 100 query's")
#prints initial urls for each "query" in a google search
for url in search(query, stop=5, pause=2.0):
print(url)
我在命令行收到的输出不正确:
query
Asked all 100 query's
query
Asked all 100 query's
Asked all 100 query's
http://www.irondata.com/
http://www.irondata.com/careers
http://transportation.irondata.com/
http://www.irondata.com/about
http://www.irondata.com/public-sector/regulatory/products/versa
http://www.irondata.com/contact-us
http://www.irondata.com/public-sector/regulatory/products/cavu
https://www.linkedin.com/company/iron-data-solutions
http://www.glassdoor.com/Reviews/Iron-Data-Reviews-E332311.htm
https://www.facebook.com/IronData
http://www.bloomberg.com/research/stocks/private/snapshot.asp?privcapId=35267805
http://www.indeed.com/cmp/Iron-Data
http://www.ironmountain.com/Services/Data-Centers.aspx
仅供参考:我的 Excel .CSV 格式如下:
B
1 **Fruit**
2 apples
2 oranges
4 mangos
5 mangos
6 mangos
...
101 mangos
非常感谢任何有关后续步骤的建议!提前致谢!
这是我得到的。就像我在评论中提到的那样,我无法让停止参数像我认为的那样工作。也许我误解了它的使用方式。我假设您每次搜索只需要前 5 个网址。
样本 df
d = {"B" : ["mangos", "oranges", "apples"]}
df = pd.DataFrame(d)
然后
stop = 5
urlcols = ["C","D","E","F","G"]
# Here i'm using an apply() to call the google search for each 'row'
# and a list is built for the urls return by search()
df[urlcols] = df["B"].apply(lambda fruit : pd.Series([url for url in
search(fruit, stop=stop, pause=2.0)][:stop])) #get 5 by slicing
这给了你。格式有点粗糙
B C D E F G
0 mangos http://en.wikipedia.org/wiki/Mango http://en.wikipedia.org/wiki/Mango_(disambigua... http://en.wikipedia.org/wiki/Mangifera http://en.wikipedia.org/wiki/Mangifera_indica http://en.wikipedia.org/wiki/Purple_mangosteen
1 oranges http://en.wikipedia.org/wiki/Orange_(fruit) http://en.wikipedia.org/wiki/Bitter_orange http://en.wikipedia.org/wiki/Valencia_orange http://en.wikipedia.org/wiki/Rutaceae http://en.wikipedia.org/wiki/Cherry_Orange
2 apples https://www.apple.com/ http://desmoines.citysearch.com/review/692986920 http://local.yahoo.com/info-28919583-apple-sto... http://www.judysbook.com/Apple-Store-BtoB~Cell... https://tr.foursquare.com/v/apple-store/4b466b...
如果您不想指定列(即 ["C",D"..]),您可以执行以下操作。
df.join(df["B"].apply(lambda fruit : pd.Series([url for url in
search(fruit, stop=stop, pause=2.0)][:stop])))
我正在尝试在 csv 中的特定列(通过 python 2.7);但是,我无法让 Pandas 将行内容读取到 Google 搜索自动程序。
*Google搜索源=https://breakingcode.wordpress.com/2010/06/29/google-search-python/
总的来说,当我使用以下代码时,我可以为查询成功打印 Urls:
from google import search
query = "apples"
for url in search(query, stop=5, pause=2.0):
print(url)
但是,当我添加 Pandas(读取每个 "query")时,行未读取 -> 按预期查询。 即正在查询 "data.irow(n)" 而不是行内容,一次一个。
from google import search
import pandas as pd
from pandas import DataFrame
query_performed = 0
querying = True
query = 'data.irow(n)'
#read the excel file at column 2 (i.e. "Fruit")
df = pd.read_csv('C:\Users\Desktop\query_results.csv', header=0, sep=',', index_col= 'Fruit')
# need to specify "Column2" and one "data.irow(n)" queried at a time
while querying:
if query_performed <= 100:
print("query")
query_performed +=1
else:
querying = False
print("Asked all 100 query's")
#prints initial urls for each "query" in a google search
for url in search(query, stop=5, pause=2.0):
print(url)
我在命令行收到的输出不正确:
query
Asked all 100 query's
query
Asked all 100 query's
Asked all 100 query's
http://www.irondata.com/
http://www.irondata.com/careers
http://transportation.irondata.com/
http://www.irondata.com/about
http://www.irondata.com/public-sector/regulatory/products/versa
http://www.irondata.com/contact-us
http://www.irondata.com/public-sector/regulatory/products/cavu
https://www.linkedin.com/company/iron-data-solutions
http://www.glassdoor.com/Reviews/Iron-Data-Reviews-E332311.htm
https://www.facebook.com/IronData
http://www.bloomberg.com/research/stocks/private/snapshot.asp?privcapId=35267805
http://www.indeed.com/cmp/Iron-Data
http://www.ironmountain.com/Services/Data-Centers.aspx
仅供参考:我的 Excel .CSV 格式如下:
B
1 **Fruit**
2 apples
2 oranges
4 mangos
5 mangos
6 mangos
...
101 mangos
非常感谢任何有关后续步骤的建议!提前致谢!
这是我得到的。就像我在评论中提到的那样,我无法让停止参数像我认为的那样工作。也许我误解了它的使用方式。我假设您每次搜索只需要前 5 个网址。
样本 df
d = {"B" : ["mangos", "oranges", "apples"]}
df = pd.DataFrame(d)
然后
stop = 5
urlcols = ["C","D","E","F","G"]
# Here i'm using an apply() to call the google search for each 'row'
# and a list is built for the urls return by search()
df[urlcols] = df["B"].apply(lambda fruit : pd.Series([url for url in
search(fruit, stop=stop, pause=2.0)][:stop])) #get 5 by slicing
这给了你。格式有点粗糙
B C D E F G
0 mangos http://en.wikipedia.org/wiki/Mango http://en.wikipedia.org/wiki/Mango_(disambigua... http://en.wikipedia.org/wiki/Mangifera http://en.wikipedia.org/wiki/Mangifera_indica http://en.wikipedia.org/wiki/Purple_mangosteen
1 oranges http://en.wikipedia.org/wiki/Orange_(fruit) http://en.wikipedia.org/wiki/Bitter_orange http://en.wikipedia.org/wiki/Valencia_orange http://en.wikipedia.org/wiki/Rutaceae http://en.wikipedia.org/wiki/Cherry_Orange
2 apples https://www.apple.com/ http://desmoines.citysearch.com/review/692986920 http://local.yahoo.com/info-28919583-apple-sto... http://www.judysbook.com/Apple-Store-BtoB~Cell... https://tr.foursquare.com/v/apple-store/4b466b...
如果您不想指定列(即 ["C",D"..]),您可以执行以下操作。
df.join(df["B"].apply(lambda fruit : pd.Series([url for url in
search(fruit, stop=stop, pause=2.0)][:stop])))