Pandas:一次迭代一行以自动执行 google 搜索?

Pandas: Iterate on a column one row at a time to automate a google search?

我正在尝试在 csv 中的特定列(通过 python 2.7);但是,我无法让 Pandas 将行内容读取到 Google 搜索自动程序。

*Google搜索源=https://breakingcode.wordpress.com/2010/06/29/google-search-python/

总的来说,当我使用以下代码时,我可以为查询成功打印 Urls:

from google import search

query = "apples"
for url in search(query, stop=5, pause=2.0):
    print(url)

但是,当我添加 Pandas(读取每个 "query")时,行未读取 -> 按预期查询。 即正在查询 "data.irow(n)" 而不是行内容,一次一个。

 from google import search
import pandas as pd
from pandas import DataFrame

query_performed = 0
querying = True
query = 'data.irow(n)'

#read the excel file at column 2 (i.e. "Fruit")
df = pd.read_csv('C:\Users\Desktop\query_results.csv', header=0, sep=',', index_col= 'Fruit')

# need to specify "Column2" and one "data.irow(n)" queried at a time
while querying: 
    if query_performed <= 100:
        print("query") 
        query_performed +=1
    else:
        querying =  False
    print("Asked all 100 query's")


#prints initial urls for each "query" in a google search
for url in search(query, stop=5, pause=2.0):
    print(url)

我在命令行收到的输出不正确:

query
Asked all 100 query's
query
Asked all 100 query's
Asked all 100 query's
http://www.irondata.com/
http://www.irondata.com/careers
http://transportation.irondata.com/
http://www.irondata.com/about
http://www.irondata.com/public-sector/regulatory/products/versa
http://www.irondata.com/contact-us
http://www.irondata.com/public-sector/regulatory/products/cavu
https://www.linkedin.com/company/iron-data-solutions
http://www.glassdoor.com/Reviews/Iron-Data-Reviews-E332311.htm
https://www.facebook.com/IronData
http://www.bloomberg.com/research/stocks/private/snapshot.asp?privcapId=35267805
http://www.indeed.com/cmp/Iron-Data
http://www.ironmountain.com/Services/Data-Centers.aspx

仅供参考:我的 Excel .CSV 格式如下:

     B
1   **Fruit**
2   apples
2   oranges
4   mangos
5   mangos
6   mangos
...
101 mangos

非常感谢任何有关后续步骤的建议!提前致谢!

这是我得到的。就像我在评论中提到的那样,我无法让停止参数像我认为的那样工作。也许我误解了它的使用方式。我假设您每次搜索只需要前 5 个网址。

样本 df

d = {"B" : ["mangos", "oranges", "apples"]}
df = pd.DataFrame(d)

然后

stop = 5 
urlcols = ["C","D","E","F","G"]
# Here i'm using an apply() to call the google search for each 'row'
# and a list is built for the urls return by search()
df[urlcols] = df["B"].apply(lambda fruit : pd.Series([url for url in 
              search(fruit, stop=stop, pause=2.0)][:stop])) #get 5 by slicing

这给了你。格式有点粗糙

    B   C   D   E   F   G
0    mangos  http://en.wikipedia.org/wiki/Mango  http://en.wikipedia.org/wiki/Mango_(disambigua...   http://en.wikipedia.org/wiki/Mangifera  http://en.wikipedia.org/wiki/Mangifera_indica   http://en.wikipedia.org/wiki/Purple_mangosteen
1    oranges     http://en.wikipedia.org/wiki/Orange_(fruit)     http://en.wikipedia.org/wiki/Bitter_orange  http://en.wikipedia.org/wiki/Valencia_orange    http://en.wikipedia.org/wiki/Rutaceae   http://en.wikipedia.org/wiki/Cherry_Orange
2    apples  https://www.apple.com/  http://desmoines.citysearch.com/review/692986920    http://local.yahoo.com/info-28919583-apple-sto...   http://www.judysbook.com/Apple-Store-BtoB~Cell...   https://tr.foursquare.com/v/apple-store/4b466b...

如果您不想指定列(即 ["C",D"..]),您可以执行以下操作。

df.join(df["B"].apply(lambda fruit : pd.Series([url for url in 
                     search(fruit, stop=stop, pause=2.0)][:stop])))