列中的搜索值

search value in column

我想搜索某列是否包含值。

import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import *
import pandas as pd

df_init = pd.DataFrame({'id':['1', '2'], 'val':[100, 200]})

spark = SparkSession.builder.appName('pandasToSparkDF').getOrCreate()


mySchema = StructType([ StructField("id", StringType(), True),
                        StructField("val", IntegerType(), True)])


df = spark.createDataFrame(df_init, schema=mySchema)


if df.filter(df.id == "3"):
    print('Yes')
else:
    print('No')

它总是打印 'Yes'.

在 pandas 数据框中,我会做:

if '3' in df_init['id].values:
        print('Yes')
   else:
        print('No')```

but with pyspark I don't know how to handle this.
I tried using 'contains' , 'isin' but still the same.

您可以使用 collect_list 获取 'id' 列中的所有值作为列表。然后检查您的元素是否在此列表中:

from pyspark.sql import functions as F

if '3' in df.select(F.collect_list('id')).first()[0]:
     print("Yes")
else:
     print('No')

或者在过滤操作后检查计数是否 >=1:

if df.filter(df.id == "3").count() >= 1:
     print("Yes")
else:
     print('No')