如何在数据帧上使用 sort_index()?
How to use sort_index() on a dataframe?
我使用 spark SQLContext 将 JSON 文件加载到数据框中。
它存储来自不同用户的推文。它看起来像下面。我正在 python 中使用 pandas 库来探索此数据框中的数据。
import pandas as pd
tweets = pd.read_json('/filepath')
sqlcontext = SQLContext(sc)
tweet_sdf = sqlcontext.createDataFrame(tweets)
tweet_sdf.show(10)
+-------------+------------------+-------------+--------------------+-------------------+
| country| id| place| text| user|
+-------------+------------------+-------------+--------------------+-------------------+
| India|572692378957430784| Orissa|@always_nidhi @Yo...| Srkian_nishu :)|
|United States|572575240615796736| Manhattan|@OnlyDancers Bell...| TagineDiningGlobal|
|United States|572575243883036672| Claremont|1/ "Without the a...| Daniel Beer|
|United States|572575252020109312| Vienna|idk why people ha...| someone actually|
|United States|572575274539356160| Boston|Taste of Iceland!...| BostonAttitude|
|United States|572647819401670656| Suwanee|Know what you don...|Collin A. Zimmerman|
| Indonesia|572647831053312000| Mario Riawa|Serasi ade haha @...| Rinie Syamsuddin|
| Indonesia|572647839521767424|Bogor Selatan|Akhirnya bisa jug...| Vinny Sylvia|
|United States|572647841220337664| Norwalk|@BeezyDH_ it's li...| Cas|
|United States|572647842277396480| Santee| obsessed with music| kimo|
+-------------+------------------+-------------+--------------------+-------------------+
only showing top 10 rows
tweet_sdf.printSchema()
root
|-- country: string (nullable = true)
|-- id: long (nullable = true)
|-- place: string (nullable = true)
|-- text: string (nullable = true)
|-- user: string (nullable = true)
我正在尝试使用以下方法对索引 'id' 上的数据帧进行排序。
tweet_sdf.sort_index(by='id', ascending=False, inplace=True)
但是我收到下面提到的属性错误。
AttributeError: 'DataFrame' 对象没有属性 'sort_index'
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-106-6cd99444a12a> in <module>()
----> 1 tweet_sdf.sort_index(by='id', ascending=False, inplace=True)
/home/notebook/spark-1.6.0-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc in __getattr__(self, name)
837 if name not in self.columns:
838 raise AttributeError(
--> 839 "'%s' object has no attribute '%s'" % (self.__class__.__name__, name))
840 jc = self._jdf.apply(name)
841 return Column(jc)
AttributeError: 'DataFrame' object has no attribute 'sort_index'
pandas 上的版本是 0.18.0,python 上的版本是 2.7.11
有人可以帮我理解为什么会这样吗?
我认为你可以使用 sort_values
,因为你需要按列排序 id
。
print tweet_sdf
country id place text \
0 India 572692378957430784 Orissa @always_nidhi@Yo
1 United States 572575240615796736 Manhattan @OnlyDancers Bell
2 United States 572575243883036672 Claremont 1/ "Without the a
3 United States 572575252020109312 Vienna idk why people ha
4 United States 572575274539356160 Boston Taste of Iceland!
5 United States 572647819401670656 Suwanee Know what you don
6 Indonesia 572647831053312000 Mario Riawa Serasi ade haha @
7 Indonesia 572647839521767424 Bogor Selatan Akhirnya bisa jug
8 United States 572647841220337664 Norwalk @BeezyDH_ it's li
9 United States 572647842277396480 Santee obsessed with music
user
0 Srkian_nishu :)
1 TagineDiningGlobal
2 Daniel Beer
3 someone actually
4 BostonAttitude
5 Collin A Zimmerman
6 Rinie Syamsuddin
7 Vinny Sylvia
8 Cas
9 kimo
tweet_sdf.sort_values(by='id', ascending=False, inplace=True)
print tweet_sdf
country id place text \
0 India 572692378957430784 Orissa @always_nidhi@Yo
9 United States 572647842277396480 Santee obsessed with music
8 United States 572647841220337664 Norwalk @BeezyDH_ it's li
7 Indonesia 572647839521767424 Bogor Selatan Akhirnya bisa jug
6 Indonesia 572647831053312000 Mario Riawa Serasi ade haha @
5 United States 572647819401670656 Suwanee Know what you don
4 United States 572575274539356160 Boston Taste of Iceland!
3 United States 572575252020109312 Vienna idk why people ha
2 United States 572575243883036672 Claremont 1/ "Without the a
1 United States 572575240615796736 Manhattan @OnlyDancers Bell
user
0 Srkian_nishu :)
9 kimo
8 Cas
7 Vinny Sylvia
6 Rinie Syamsuddin
5 Collin A Zimmerman
4 BostonAttitude
3 someone actually
2 Daniel Beer
1 TagineDiningGlobal
DataFrame.sort_index API reference
我相信 "by" 参数在 0.17.0 后已被删除。您可能需要更改参数或使用排序。
The by argument of DataFrame.sort_index() has been deprecated and will be removed in a future version.
我使用 spark SQLContext 将 JSON 文件加载到数据框中。 它存储来自不同用户的推文。它看起来像下面。我正在 python 中使用 pandas 库来探索此数据框中的数据。
import pandas as pd
tweets = pd.read_json('/filepath')
sqlcontext = SQLContext(sc)
tweet_sdf = sqlcontext.createDataFrame(tweets)
tweet_sdf.show(10)
+-------------+------------------+-------------+--------------------+-------------------+
| country| id| place| text| user|
+-------------+------------------+-------------+--------------------+-------------------+
| India|572692378957430784| Orissa|@always_nidhi @Yo...| Srkian_nishu :)|
|United States|572575240615796736| Manhattan|@OnlyDancers Bell...| TagineDiningGlobal|
|United States|572575243883036672| Claremont|1/ "Without the a...| Daniel Beer|
|United States|572575252020109312| Vienna|idk why people ha...| someone actually|
|United States|572575274539356160| Boston|Taste of Iceland!...| BostonAttitude|
|United States|572647819401670656| Suwanee|Know what you don...|Collin A. Zimmerman|
| Indonesia|572647831053312000| Mario Riawa|Serasi ade haha @...| Rinie Syamsuddin|
| Indonesia|572647839521767424|Bogor Selatan|Akhirnya bisa jug...| Vinny Sylvia|
|United States|572647841220337664| Norwalk|@BeezyDH_ it's li...| Cas|
|United States|572647842277396480| Santee| obsessed with music| kimo|
+-------------+------------------+-------------+--------------------+-------------------+
only showing top 10 rows
tweet_sdf.printSchema()
root
|-- country: string (nullable = true)
|-- id: long (nullable = true)
|-- place: string (nullable = true)
|-- text: string (nullable = true)
|-- user: string (nullable = true)
我正在尝试使用以下方法对索引 'id' 上的数据帧进行排序。
tweet_sdf.sort_index(by='id', ascending=False, inplace=True)
但是我收到下面提到的属性错误。 AttributeError: 'DataFrame' 对象没有属性 'sort_index'
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-106-6cd99444a12a> in <module>()
----> 1 tweet_sdf.sort_index(by='id', ascending=False, inplace=True)
/home/notebook/spark-1.6.0-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc in __getattr__(self, name)
837 if name not in self.columns:
838 raise AttributeError(
--> 839 "'%s' object has no attribute '%s'" % (self.__class__.__name__, name))
840 jc = self._jdf.apply(name)
841 return Column(jc)
AttributeError: 'DataFrame' object has no attribute 'sort_index'
pandas 上的版本是 0.18.0,python 上的版本是 2.7.11 有人可以帮我理解为什么会这样吗?
我认为你可以使用 sort_values
,因为你需要按列排序 id
。
print tweet_sdf
country id place text \
0 India 572692378957430784 Orissa @always_nidhi@Yo
1 United States 572575240615796736 Manhattan @OnlyDancers Bell
2 United States 572575243883036672 Claremont 1/ "Without the a
3 United States 572575252020109312 Vienna idk why people ha
4 United States 572575274539356160 Boston Taste of Iceland!
5 United States 572647819401670656 Suwanee Know what you don
6 Indonesia 572647831053312000 Mario Riawa Serasi ade haha @
7 Indonesia 572647839521767424 Bogor Selatan Akhirnya bisa jug
8 United States 572647841220337664 Norwalk @BeezyDH_ it's li
9 United States 572647842277396480 Santee obsessed with music
user
0 Srkian_nishu :)
1 TagineDiningGlobal
2 Daniel Beer
3 someone actually
4 BostonAttitude
5 Collin A Zimmerman
6 Rinie Syamsuddin
7 Vinny Sylvia
8 Cas
9 kimo
tweet_sdf.sort_values(by='id', ascending=False, inplace=True)
print tweet_sdf
country id place text \
0 India 572692378957430784 Orissa @always_nidhi@Yo
9 United States 572647842277396480 Santee obsessed with music
8 United States 572647841220337664 Norwalk @BeezyDH_ it's li
7 Indonesia 572647839521767424 Bogor Selatan Akhirnya bisa jug
6 Indonesia 572647831053312000 Mario Riawa Serasi ade haha @
5 United States 572647819401670656 Suwanee Know what you don
4 United States 572575274539356160 Boston Taste of Iceland!
3 United States 572575252020109312 Vienna idk why people ha
2 United States 572575243883036672 Claremont 1/ "Without the a
1 United States 572575240615796736 Manhattan @OnlyDancers Bell
user
0 Srkian_nishu :)
9 kimo
8 Cas
7 Vinny Sylvia
6 Rinie Syamsuddin
5 Collin A Zimmerman
4 BostonAttitude
3 someone actually
2 Daniel Beer
1 TagineDiningGlobal
DataFrame.sort_index API reference
我相信 "by" 参数在 0.17.0 后已被删除。您可能需要更改参数或使用排序。
The by argument of DataFrame.sort_index() has been deprecated and will be removed in a future version.