df to rdd:
df.rdd.map(list)
df.rdd.map(tuple)
df.rdd.map(lambda x: list(x[0]))
rdd to list:
rdd1.collect()
df to list:
df.rdd.map(list).collect()
df.rdd.map(tuple).collect()
rdd to df:
rdd1.toDF()
list to rdd:
sc.parallelize(list1)
list to df:
sc.parallelize(list1).toDF()
example 1:
df
df = spark.createDataFrame([
(1, 144.5, 5.9, 33, 'M'),
(2, 167.2, 5.4, 45, 'M'),
(3, 124.1, 5.2, 23, 'F'),
], ['id', 'weight', 'height', 'age', 'gender'])
1. df to rdd
rdd1 = df.rdd.map(list)
rdd1.take(3)
[[1, 144.5, 5.9, 33, 'M'], [2, 167.2, 5.4, 45, 'M'], [3, 124.1, 5.2, 23, 'F']]
或
rdd2 = df.rdd.map(tuple)
rdd2.take(3)
[(1, 144.5, 5.9, 33, 'M'), (2, 167.2, 5.4, 45, 'M'), (3, 124.1, 5.2, 23, 'F')]
2. rdd to list
list1 = rdd1.collect()
print(list1)
[[1, 144.5, 5.9, 33, 'M'], [2, 167.2, 5.4, 45, 'M'], [3, 124.1, 5.2, 23, 'F']]
或
list2 = rdd2.collect()
print(list2)
[(1, 144.5, 5.9, 33, 'M'), (2, 167.2, 5.4, 45, 'M'), (3, 124.1, 5.2, 23, 'F')]
3. df to list
list1 = df.rdd.map(list).collect()
print(list1)
[[1, 144.5, 5.9, 33, 'M'], [2, 167.2, 5.4, 45, 'M'], [3, 124.1, 5.2, 23, 'F']]
或
list2 = df.rdd.map(tuple).collect()
print(list2)
[(1, 144.5, 5.9, 33, 'M'), (2, 167.2, 5.4, 45, 'M'), (3, 124.1, 5.2, 23, 'F')]
4. rdd to df
df1=rdd1.toDF()
df1.show()
+---+-----+---+---+---+
| _1| _2| _3| _4| _5|
+---+-----+---+---+---+
| 1|144.5|5.9| 33| M|
| 2|167.2|5.4| 45| M|
| 3|124.1|5.2| 23| F|
+---+-----+---+---+---+
5. list to rdd
rdd1 = sc.parallelize(list1)
rdd1.take(3)
[[1, 144.5, 5.9, 33, 'M'], [2, 167.2, 5.4, 45, 'M'], [3, 124.1, 5.2, 23, 'F']]
6. list to df
df1 = sc.parallelize(list1).toDF()
df1.show()
+---+-----+---+---+---+
| _1| _2| _3| _4| _5|
+---+-----+---+---+---+
| 1|144.5|5.9| 33| M|
| 2|167.2|5.4| 45| M|
| 3|124.1|5.2| 23| F|
+---+-----+---+---+---+
example 2:
df
df.show()
+------------------------+|
ITEM_CD_joint |
+------------------------+
|[756830] |
|[720305, 738767, 738770]|
|[229344, 229382, 749208]|
+------------------------+
df to rdd
rdd=df_this.rdd.map(list)
rdd.take(3)
[[[756830]], [[720305, 738767, 738770]], [[229344, 229382, 749208]]]
df to rdd 去掉嵌套
rdd=df_this.rdd.map(lambda x: list(x[0]))
rdd.take(3)
[[756830], [720305, 738767, 738770], [229344, 229382, 749208]]