使用列表中的随机值在 Pyspark 中创建数据框
Create a dataframe in Pyspark using random values from a list
我需要将此代码转换为 PySpark 等效代码。我无法使用 pandas 创建数据框。
这就是我使用 Pandas 创建数据框的方式:
df['Name'] = np.random.choice(["Alex","James","Michael","Peter","Harry"], size=3)
df['ID'] = np.random.randint(1, 10, 3)
df['Fruit'] = np.random.choice(["Apple","Grapes","Orange","Pear","Kiwi"], size=3)
PySpark 中的数据框应如下所示:
df
Name ID Fruit
Alex 3 Apple
James 6 Grapes
Harry 5 Pear
我已针对 1 列尝试了以下操作:
sdf1 = spark.createDataFrame([(k,) for k in ['Alex','James', 'Harry']]).orderBy(rand()).limit(6).show()
names = np.random.choice(["Alex","James","Michael","Peter","Harry"], size=3)
id = np.random.randint(1, 10, 3)
fruits = np.random.choice(["Apple","Grapes","Orange","Pear","Kiwi"], size=3)
columns = ['Name', 'ID', "Fruit"]
dataframe = spark.createDataFrame(zip(names, id, fruits), columns)
dataframe.show()
您可以先创建 pandas 数据帧,然后将其转换为 Pyspark 数据帧。或者您可以压缩 3 个随机 numpy 数组并像这样创建 spark 数据帧:
import numpy as np
names = [str(x) for x in np.random.choice(["Alex", "James", "Michael", "Peter", "Harry"], size=3)]
ids = [int(x) for x in np.random.randint(1, 10, 3)]
fruits = [str(x) for x in np.random.choice(["Apple", "Grapes", "Orange", "Pear", "Kiwi"], size=3)]
df = spark.createDataFrame(list(zip(names, ids, fruits)), ["Name", "ID", "Fruit"])
df.show()
#+-------+---+------+
#| Name| ID| Fruit|
#+-------+---+------+
#| Peter| 8| Pear|
#|Michael| 7| Kiwi|
#| Harry| 4|Orange|
#+-------+---+------+
我需要将此代码转换为 PySpark 等效代码。我无法使用 pandas 创建数据框。
这就是我使用 Pandas 创建数据框的方式:
df['Name'] = np.random.choice(["Alex","James","Michael","Peter","Harry"], size=3)
df['ID'] = np.random.randint(1, 10, 3)
df['Fruit'] = np.random.choice(["Apple","Grapes","Orange","Pear","Kiwi"], size=3)
PySpark 中的数据框应如下所示:
df
Name ID Fruit
Alex 3 Apple
James 6 Grapes
Harry 5 Pear
我已针对 1 列尝试了以下操作:
sdf1 = spark.createDataFrame([(k,) for k in ['Alex','James', 'Harry']]).orderBy(rand()).limit(6).show()
names = np.random.choice(["Alex","James","Michael","Peter","Harry"], size=3)
id = np.random.randint(1, 10, 3)
fruits = np.random.choice(["Apple","Grapes","Orange","Pear","Kiwi"], size=3)
columns = ['Name', 'ID', "Fruit"]
dataframe = spark.createDataFrame(zip(names, id, fruits), columns)
dataframe.show()
您可以先创建 pandas 数据帧,然后将其转换为 Pyspark 数据帧。或者您可以压缩 3 个随机 numpy 数组并像这样创建 spark 数据帧:
import numpy as np
names = [str(x) for x in np.random.choice(["Alex", "James", "Michael", "Peter", "Harry"], size=3)]
ids = [int(x) for x in np.random.randint(1, 10, 3)]
fruits = [str(x) for x in np.random.choice(["Apple", "Grapes", "Orange", "Pear", "Kiwi"], size=3)]
df = spark.createDataFrame(list(zip(names, ids, fruits)), ["Name", "ID", "Fruit"])
df.show()
#+-------+---+------+
#| Name| ID| Fruit|
#+-------+---+------+
#| Peter| 8| Pear|
#|Michael| 7| Kiwi|
#| Harry| 4|Orange|
#+-------+---+------+