使用 Python 从复杂的嵌套 Twitter json 中提取地理坐标

Question

我正在读取多个复杂的 json 文件并尝试提取地理坐标。我现在无法附加文件本身，但我可以在此处打印树。该文件有数百个选项，有些对象重复。

请参阅.txt format中的文件结构。

当我在 Python 中使用 Spark 阅读 json 时，它在 coordinates 列中向我显示了这些坐标，它就在那里。

它存储在coordinates列中。请看样张。

我显然正在尝试减少列数和 select 一些列。

最后两列是我的地理坐标。我尝试了 coordinates 和 geo 以及 coordinates.coordinates 和 geo.coordinates。这两个选项都不起作用。

df_tweets = tweets.select(['text', 
                       'user.name', 
                       'user.screen_name', 
                       'user.id', 
                       'user.location',  
                       'place.country', 
                       'place.full_name', 
                       'place.name',
                       'user.followers_count', 
                       'retweet_count',
                       'retweeted',
                       'user.friends_count',
                       'entities.hashtags.text', 
                       'created_at', 
                       'timestamp_ms', 
                       'lang',
                       'coordinates.coordinates', # or just `coordinates`
                       'geo.coordinates' # or just `geo`
                       ])

在 coordinates 和 geo 的第一种情况下，我得到以下信息，打印架构：

df_tweets.printSchema()

root
 |-- text: string (nullable = true)
 |-- name: string (nullable = true)
 |-- screen_name: string (nullable = true)
 |-- id: long (nullable = true)
 |-- location: string (nullable = true)
 |-- country: string (nullable = true)
 |-- full_name: string (nullable = true)
 |-- name: string (nullable = true)
 |-- followers_count: long (nullable = true)
 |-- retweet_count: long (nullable = true)
 |-- retweeted: boolean (nullable = true)
 |-- friends_count: long (nullable = true)
 |-- text: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- created_at: string (nullable = true)
 |-- timestamp_ms: string (nullable = true)
 |-- lang: string (nullable = true)
 |-- coordinates: struct (nullable = true)
 |    |-- coordinates: array (nullable = true)
 |    |    |-- element: double (containsNull = true)
 |    |-- type: string (nullable = true)
 |-- geo: struct (nullable = true)
 |    |-- coordinates: array (nullable = true)
 |    |    |-- element: double (containsNull = true)
 |    |-- type: string (nullable = true)

当我执行 coordinates.coordinates 和 geo.coordinates 时，我得到

root
 |-- text: string (nullable = true)
 |-- name: string (nullable = true)
 |-- screen_name: string (nullable = true)
 |-- id: long (nullable = true)
 |-- location: string (nullable = true)
 |-- country: string (nullable = true)
 |-- full_name: string (nullable = true)
 |-- name: string (nullable = true)
 |-- followers_count: long (nullable = true)
 |-- retweet_count: long (nullable = true)
 |-- retweeted: boolean (nullable = true)
 |-- friends_count: long (nullable = true)
 |-- text: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- created_at: string (nullable = true)
 |-- timestamp_ms: string (nullable = true)
 |-- lang: string (nullable = true)
 |-- coordinates: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- coordinates: array (nullable = true)
 |    |-- element: double (containsNull = true)

当我在 Pandas 中打印两个数据帧时，其中 none 给了我坐标，我仍然有 None.

如何正确提取地理坐标？

Answer 1

如果我用推文数据查看我的数据框，我会看到它是这样的

In [44]: df[df.coordinates.notnull()]['coordinates']
Out[44]:
98    {'type': 'Point', 'coordinates': [-122.32111, ...
99    {'type': 'Point', 'coordinates': [-122.32111, ...
Name: coordinates, dtype: object

所以这是一个必须解析的字典

tweets_coords = df[df.coordinates.notnull()]['coordinates'].tolist()

for coords in tweets_coords:
    print(coords)
    print(coords['coordinates'])
    print(coords['coordinates'][0])
    print(coords['coordinates'][1])

输出：

{'type': 'Point', 'coordinates': [-122.32111, 47.62366]}
[-122.32111, 47.62366]
-122.32111
47.62362
{'type': 'Point', 'coordinates': [-122.32111, 47.62362]}
[-122.32111, 47.62362]
-122.32111
47.62362

您可以在 apply() 中设置一个 lambda 函数来逐行解析这些，否则您可以使用我提供的列表理解作为分析的基础。

综上所述，也许先检查一下... 在你使用 coordinates.coordinates 和 geo.coordinates 的地方，试试 coordinates['coordinates'] 和 geo['coordinates']

使用 Python 从复杂的嵌套 Twitter json 中提取地理坐标

Extracting geo coordinates from a complex nested Twitter json, using Python

python

json

geolocation

coordinates