如何在存在联合条件和两个单独条件的 sframe 中提取行?
How to extract rows in sframe where there's a joint condition and two separate conditions?
我有一个 sframe
这样的:
+---------+------+-------------------------------+-----------+------------------+
| term_id | lang | term_str | term_type | reliability_code |
+---------+------+-------------------------------+-----------+------------------+
| IATE-14 | ro | Agenție de aprovizionare | fullForm | 3 |
| IATE-84 | bg | компетенции на държави чле... | fullForm | 3 |
| IATE-84 | cs | příslušnost členských stát... | fullForm | 3 |
| IATE-84 | da | medlemsstatskompetence | fullForm | 3 |
| IATE-84 | de | Zuständigkeit der Mitglied... | fullForm | 3 |
| IATE-84 | el | αρμοδιότητα των κρατών μελ... | fullForm | 3 |
| IATE-84 | en | competence of the Member S... | fullForm | 3 |
| IATE-84 | es | competencias de los Estado... | fullForm | 3 |
| IATE-84 | et | liikmesriikide pädevus | fullForm | 3 |
| IATE-84 | fi | jäsenvaltioiden toimivalta | fullForm | 3 |
| IATE-84 | fr | compétence des États membres | fullForm | 3 |
| IATE-84 | ga | inniúlacht na mBallstát | fullForm | 3 |
| IATE-84 | hu | tagállami hatáskör | fullForm | 3 |
| IATE-84 | it | competenza degli Stati membri | fullForm | 3 |
| IATE-84 | lt | valstybių narių kompetencija | fullForm | 2 |
| IATE-84 | lv | dalībvalstu kompetence | fullForm | 3 |
| IATE-84 | nl | bevoegdheid van de lidstaten | fullForm | 3 |
| IATE-84 | pl | kompetencje państw członko... | fullForm | 3 |
| IATE-84 | pt | competência dos Estados-Me... | fullForm | 3 |
| IATE-84 | ro | competența statelor membre... | fullForm | 3 |
+---------+------+-------------------------------+-----------+------------------+
我需要提取 lang == 'de' or lang == 'en'
所在的所有行,但是我用 lang == 'en'
提取的行需要具有相应的 lang == 'de'
以便它们共享相同的 term_id
.
我一直在使用 graphlab
和 sframe
:
sf = gl.SFrame.read_csv('iate.csv', delimiter='\t', quote_char='[=11=]', column_type_hints=[str,str,unicode,str,int])
de = sf[sf['lang'] == 'de']
de_termids = de['term_id']
和de.print_rows(10)
:
+------------+------+-------------------------------+-----------+------------------+
| term_id | lang | term_str | term_type | reliability_code |
+------------+------+-------------------------------+-----------+------------------+
| IATE-84 | de | Zuständigkeit der Mitglied... | fullForm | 3 |
| IATE-290 | de | Schutz der öffentlichen Ge... | fullForm | 3 |
| IATE-662 | de | mengenmäßigen Ausfuhrbesch... | fullForm | 3 |
| IATE-801 | de | Eintragungshindernisse | fullForm | 2 |
| IATE-1326 | de | Sonderregelung für Reisebü... | fullForm | 4 |
| IATE-1702 | de | Erwerbslose | fullForm | 4 |
| IATE-2818 | de | Verwaltungsvorschriften | fullForm | 3 |
| IATE-21139 | de | frisches Obst und Gemüse | fullForm | 3 |
| IATE-21563 | de | chemische Erzeugnisse zur ... | fullForm | 3 |
| IATE-21564 | de | Mineralsäuren | fullForm | 3 |
+------------+------+-------------------------------+-----------+------------------+
然后:
en = sf[sf['lang'] == 'en']
en.print_rows(10)
[输出]:
+------------+------+-------------------------------+--------------+------------------+
| term_id | lang | term_str | term_type | reliability_code |
+------------+------+-------------------------------+--------------+------------------+
| IATE-84 | en | competence of the Member S... | fullForm | 3 |
| IATE-254 | en | award of public works cont... | fullForm | 3 |
| IATE-290 | en | public health protection | fullForm | 3 |
| IATE-662 | en | quantitative restriction o... | fullForm | 3 |
| IATE-801 | en | grounds for refusal | fullForm | 2 |
| IATE-1299 | en | CEP | abbreviation | 3 |
| IATE-1326 | en | special scheme for travel ... | fullForm | 3 |
| IATE-2818 | en | regulations | fullForm | 3 |
| IATE-7128 | en | company name | fullForm | 2 |
| IATE-21139 | en | fresh fruits and vegetables | fullForm | 3 |
+------------+------+-------------------------------+--------------+------------------+
我试过:
en_de = en[en['term_id'] in de_termids]
但是我的语法错误,给我这个错误:
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-12-9656091794b8> in <module>()
1 en = sf[sf['lang'] == 'en']
----> 2 en_de = en[en['term_id'] in de_termids]
/usr/local/lib/python2.7/dist-packages/graphlab/data_structures/sarray.pyc in __contains__(self, item)
691
692 """
--> 693 return (self == item).any()
694
695 def contains(self, item):
/usr/local/lib/python2.7/dist-packages/graphlab/data_structures/sarray.pyc in __eq__(self, other)
973 return SArray(_proxy = self.__proxy__.vector_operator(other.__proxy__, '=='))
974 else:
--> 975 return SArray(_proxy = self.__proxy__.left_scalar_operator(other, '=='))
976
977
/usr/local/lib/python2.7/dist-packages/graphlab/cython/context.pyc in __exit__(self, exc_type, exc_value, traceback)
47 if not self.show_cython_trace:
48 # To hide cython trace, we re-raise from here
---> 49 raise exc_type(exc_value)
50 else:
51 # To show the full trace, we do nothing and let exception propagate
RuntimeError: Runtime Exception. Array size mismatch
我应该如何过滤 sframe 以便我得到带有 en
和 de
以及对应的 term_id
的行?
生成的数据框应如下所示:
+---------+-----------------+-------------+
| term_id | term_str_en | term_str_de |
+---------+-------------------------------+
| IATE-999 | something | etwas |
...
+---------+-----------------+-------------+
如何对 pandas
执行相同的操作?
你的意思是,对于每个 term_id,语言应该是 'de',或者 ('de', 'en')
但不只是 'en'?如果是,
我觉得你可以select lang = 'de' or 'en',过滤掉lang[=10中只有'en'的term_id =]
假设您已经有两个数据框,其中包含两种语言的过滤数据:df_en
和 df_de
。然后你可以 merge
他们:
new_df = pd.merge(df_en[['term_id','term_str']], df_de[['term_id','term_str']], how = 'inner', on ='term_id', suffixes = ('_en', '_de'))
方法inner
负责跳过所有不匹配的行。
您可以在 pandas docs and refs
中找到 merge
的更多选项
编辑
没有创建两个数据框的相同结果(df
是包含所有条目的原始数据框,可能也包含其他语言):
new_df = pd.merge(df.loc[df['lang']=='en',['term_id','term_str']], df.loc[df['lang']=='de',['term_id','term_str']], how = 'inner', on ='term_id', suffixes = ('_en', '_de'))
我目前正在 phone,但我认为您想利用索引对齐并进行连接。
例如
de_terms = sf [sf ['lang'] == 'de'].set_index ('term_id')
这将使 term_id
成为数据帧的索引。对 en
执行相同的操作,然后以新名称将术语列从一个分配给另一个。
通常情况下,目的地决定包含哪些索引,例如如果您只想包含带有 'en' 的条目,如果它们也有 'de',但所有 'de',请将 en 列合并到 de 框架中;仅在 en 帧中的索引将被删除。或者,您可以进行过滤以获得 no nan 的布尔组合。
由于您只得到了 pandas 的答案,下面是如何在 SFrame 中做到这一点,给定代码示例中的 de_termids
和 en
:
en.filter_by(de_termids, 'term_id')
我有一个 sframe
这样的:
+---------+------+-------------------------------+-----------+------------------+
| term_id | lang | term_str | term_type | reliability_code |
+---------+------+-------------------------------+-----------+------------------+
| IATE-14 | ro | Agenție de aprovizionare | fullForm | 3 |
| IATE-84 | bg | компетенции на държави чле... | fullForm | 3 |
| IATE-84 | cs | příslušnost členských stát... | fullForm | 3 |
| IATE-84 | da | medlemsstatskompetence | fullForm | 3 |
| IATE-84 | de | Zuständigkeit der Mitglied... | fullForm | 3 |
| IATE-84 | el | αρμοδιότητα των κρατών μελ... | fullForm | 3 |
| IATE-84 | en | competence of the Member S... | fullForm | 3 |
| IATE-84 | es | competencias de los Estado... | fullForm | 3 |
| IATE-84 | et | liikmesriikide pädevus | fullForm | 3 |
| IATE-84 | fi | jäsenvaltioiden toimivalta | fullForm | 3 |
| IATE-84 | fr | compétence des États membres | fullForm | 3 |
| IATE-84 | ga | inniúlacht na mBallstát | fullForm | 3 |
| IATE-84 | hu | tagállami hatáskör | fullForm | 3 |
| IATE-84 | it | competenza degli Stati membri | fullForm | 3 |
| IATE-84 | lt | valstybių narių kompetencija | fullForm | 2 |
| IATE-84 | lv | dalībvalstu kompetence | fullForm | 3 |
| IATE-84 | nl | bevoegdheid van de lidstaten | fullForm | 3 |
| IATE-84 | pl | kompetencje państw członko... | fullForm | 3 |
| IATE-84 | pt | competência dos Estados-Me... | fullForm | 3 |
| IATE-84 | ro | competența statelor membre... | fullForm | 3 |
+---------+------+-------------------------------+-----------+------------------+
我需要提取 lang == 'de' or lang == 'en'
所在的所有行,但是我用 lang == 'en'
提取的行需要具有相应的 lang == 'de'
以便它们共享相同的 term_id
.
我一直在使用 graphlab
和 sframe
:
sf = gl.SFrame.read_csv('iate.csv', delimiter='\t', quote_char='[=11=]', column_type_hints=[str,str,unicode,str,int])
de = sf[sf['lang'] == 'de']
de_termids = de['term_id']
和de.print_rows(10)
:
+------------+------+-------------------------------+-----------+------------------+
| term_id | lang | term_str | term_type | reliability_code |
+------------+------+-------------------------------+-----------+------------------+
| IATE-84 | de | Zuständigkeit der Mitglied... | fullForm | 3 |
| IATE-290 | de | Schutz der öffentlichen Ge... | fullForm | 3 |
| IATE-662 | de | mengenmäßigen Ausfuhrbesch... | fullForm | 3 |
| IATE-801 | de | Eintragungshindernisse | fullForm | 2 |
| IATE-1326 | de | Sonderregelung für Reisebü... | fullForm | 4 |
| IATE-1702 | de | Erwerbslose | fullForm | 4 |
| IATE-2818 | de | Verwaltungsvorschriften | fullForm | 3 |
| IATE-21139 | de | frisches Obst und Gemüse | fullForm | 3 |
| IATE-21563 | de | chemische Erzeugnisse zur ... | fullForm | 3 |
| IATE-21564 | de | Mineralsäuren | fullForm | 3 |
+------------+------+-------------------------------+-----------+------------------+
然后:
en = sf[sf['lang'] == 'en']
en.print_rows(10)
[输出]:
+------------+------+-------------------------------+--------------+------------------+
| term_id | lang | term_str | term_type | reliability_code |
+------------+------+-------------------------------+--------------+------------------+
| IATE-84 | en | competence of the Member S... | fullForm | 3 |
| IATE-254 | en | award of public works cont... | fullForm | 3 |
| IATE-290 | en | public health protection | fullForm | 3 |
| IATE-662 | en | quantitative restriction o... | fullForm | 3 |
| IATE-801 | en | grounds for refusal | fullForm | 2 |
| IATE-1299 | en | CEP | abbreviation | 3 |
| IATE-1326 | en | special scheme for travel ... | fullForm | 3 |
| IATE-2818 | en | regulations | fullForm | 3 |
| IATE-7128 | en | company name | fullForm | 2 |
| IATE-21139 | en | fresh fruits and vegetables | fullForm | 3 |
+------------+------+-------------------------------+--------------+------------------+
我试过:
en_de = en[en['term_id'] in de_termids]
但是我的语法错误,给我这个错误:
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-12-9656091794b8> in <module>()
1 en = sf[sf['lang'] == 'en']
----> 2 en_de = en[en['term_id'] in de_termids]
/usr/local/lib/python2.7/dist-packages/graphlab/data_structures/sarray.pyc in __contains__(self, item)
691
692 """
--> 693 return (self == item).any()
694
695 def contains(self, item):
/usr/local/lib/python2.7/dist-packages/graphlab/data_structures/sarray.pyc in __eq__(self, other)
973 return SArray(_proxy = self.__proxy__.vector_operator(other.__proxy__, '=='))
974 else:
--> 975 return SArray(_proxy = self.__proxy__.left_scalar_operator(other, '=='))
976
977
/usr/local/lib/python2.7/dist-packages/graphlab/cython/context.pyc in __exit__(self, exc_type, exc_value, traceback)
47 if not self.show_cython_trace:
48 # To hide cython trace, we re-raise from here
---> 49 raise exc_type(exc_value)
50 else:
51 # To show the full trace, we do nothing and let exception propagate
RuntimeError: Runtime Exception. Array size mismatch
我应该如何过滤 sframe 以便我得到带有 en
和 de
以及对应的 term_id
的行?
生成的数据框应如下所示:
+---------+-----------------+-------------+
| term_id | term_str_en | term_str_de |
+---------+-------------------------------+
| IATE-999 | something | etwas |
...
+---------+-----------------+-------------+
如何对 pandas
执行相同的操作?
你的意思是,对于每个 term_id,语言应该是 'de',或者 ('de', 'en') 但不只是 'en'?如果是,
我觉得你可以select lang = 'de' or 'en',过滤掉lang[=10中只有'en'的term_id =]
假设您已经有两个数据框,其中包含两种语言的过滤数据:df_en
和 df_de
。然后你可以 merge
他们:
new_df = pd.merge(df_en[['term_id','term_str']], df_de[['term_id','term_str']], how = 'inner', on ='term_id', suffixes = ('_en', '_de'))
方法inner
负责跳过所有不匹配的行。
您可以在 pandas docs and refs
merge
的更多选项
编辑
没有创建两个数据框的相同结果(df
是包含所有条目的原始数据框,可能也包含其他语言):
new_df = pd.merge(df.loc[df['lang']=='en',['term_id','term_str']], df.loc[df['lang']=='de',['term_id','term_str']], how = 'inner', on ='term_id', suffixes = ('_en', '_de'))
我目前正在 phone,但我认为您想利用索引对齐并进行连接。
例如
de_terms = sf [sf ['lang'] == 'de'].set_index ('term_id')
这将使 term_id
成为数据帧的索引。对 en
执行相同的操作,然后以新名称将术语列从一个分配给另一个。
通常情况下,目的地决定包含哪些索引,例如如果您只想包含带有 'en' 的条目,如果它们也有 'de',但所有 'de',请将 en 列合并到 de 框架中;仅在 en 帧中的索引将被删除。或者,您可以进行过滤以获得 no nan 的布尔组合。
由于您只得到了 pandas 的答案,下面是如何在 SFrame 中做到这一点,给定代码示例中的 de_termids
和 en
:
en.filter_by(de_termids, 'term_id')