如何在 PySpark Dataframe 中显示完整的列内容?
有时在 Dataframe 中,当列数据包含长内容或大句子时,PySpark SQL 以压缩形式显示 dataframe 意味着显示句子的前几个单词,其他单词后面跟着点,表示还有更多数据可用。
从上面的示例Dataframe中,我们可以很容易地看到Name列的内容没有完全显示出来。这件事是由 PySpark 自动完成的,通过这种方式来系统地显示数据框,数据框看起来并不凌乱,但在某些情况下,我们需要阅读或查看特定列的完整内容。
因此,在本文中,我们将学习如何在 PySpark Dataframe 中显示完整的列内容。我们使用 show()函数显示完整列内容的唯一方法。
Syntax: df.show(n, truncate=True)
Where df is the dataframe
- show(): Function is used to show the Dataframe.
- n: Number of rows to display.
- truncate: Through this parameter we can tell the Output sink to display the full column content by setting truncate option to false, by default this value is true.
示例 1:显示 PySpark Dataframe 的完整列内容。
Python
# importing necessary libraries
from pyspark.sql import SparkSession
# function to create new SparkSession
def create_session():
spk = SparkSession.builder \
.master("local") \
.appName("Product_details.com") \
.getOrCreate()
return spk
def create_df(spark, data, schema):
df1 = spark.createDataFrame(data, schema)
return df1
if __name__ == "__main__":
# calling function to create SparkSession
spark = create_session()
input_data = [("Mobile(Fluid Black, 8GB RAM, 128GB Storage)",
112345, 4.0, 12499),
("LED TV", 114567, 4.2, 49999),
("Refrigerator", 123543, 4.4, 13899),
("6.5 kg Fully-Automatic Top Loading Washing Machine \
(WA65A4002VS/TL, Imperial Silver, Center Jet Technology)",
113465, 3.9, 6999),
("T-shirt", 124378, 4.1, 1999),
("Jeans", 126754, 3.7, 3999),
("Men's Casual Shoes in White Sneakers for Outdoor and\
Daily use", 134565, 4.7, 1499),
("Vitamin C Ultra Light Gel Oil-Free Moisturizer",
145234, 4.6, 999),
]
schema = ["Name", "ID", "Rating", "Price"]
# calling function to create dataframe
df = create_df(spark, input_data, schema)
# visualizing full content of the Dataframe
# by setting truncate to False
df.show(truncate=False)
Python
# importing necessary libraries
from pyspark.sql import SparkSession
# function to create new SparkSession
def create_session():
spk = SparkSession.builder \
.master("local") \
.appName("Student_report.com") \
.getOrCreate()
return spk
def create_df(spark,data,schema):
df1 = spark.createDataFrame(data,schema)
return df1
if __name__ == "__main__":
# calling function to create SparkSession
spark = create_session()
input_data = [(1,"Shivansh","Male",80,"Good Performance"),
(2,"Arpita","Female",18,"Have to work hard otherwise \
result will not improve"),
(3,"Raj","Male",21,"Work hard can do better"),
(4,"Swati","Female",69,"Good performance can do more better"),
(5,"Arpit","Male",20,"Focus on some subject to improve"),
(6,"Swaroop","Male",65,"Good performance"),
(7,"Reshabh","Male",70,"Good performance"),
(8,"Dinesh","Male",65,"Can do better"),
(9,"Rohit","Male",55,"Can do better"),
(10,"Sanjana","Female",67,"Have to work hard")]
schema = ["ID","Name","Gender","Percentage","Remark"]
# calling function to create dataframe
df = create_df(spark,input_data,schema)
# visualizing full column content of the dataframe by setting truncate to 0
df.show(truncate=0)
Python
# importing necessary libraries
from pyspark.sql import SparkSession
# function to create new SparkSession
def create_session():
spk = SparkSession.builder \
.master("local") \
.appName("Student_report.com") \
.getOrCreate()
return spk
def create_df(spark, data, schema):
df1 = spark.createDataFrame(data, schema)
return df1
if __name__ == "__main__":
# calling function to create SparkSession
spark = create_session()
input_data = [(1, "Shivansh", "Male", (70, 66, 78, 70, 71, 50), 80,
"Good Performance"),
(2, "Arpita", "Female", (20, 16, 8, 40, 11, 20), 18,
"Have to work hard otherwise result will not improve"),
(3, "Raj", "Male", (10, 26, 28, 10, 31, 20),
21, "Work hard can do better"),
(4, "Swati", "Female", (70, 66, 78, 70, 71, 50),
69, "Good performance can do more better"),
(5, "Arpit", "Male", (20, 46, 18, 20, 31, 10),
20, "Focus on some subject to improve"),
(6, "Swaroop", "Male", (70, 66, 48, 30, 61, 50),
65, "Good performance"),
(7, "Reshabh", "Male", (70, 66, 78, 70, 71, 50),
70, "Good performance"),
(8, "Dinesh", "Male", (40, 66, 68, 70, 71, 50),
65, "Can do better"),
(9, "Rohit", "Male", (50, 66, 58, 50, 51, 50),
55, "Can do better"),
(10, "Sanjana", "Female", (60, 66, 68, 60, 61, 50),
67, "Have to work hard")]
schema = ["ID", "Name", "Gender",
"Sessionals Marks", "Percentage", "Remark"]
# calling function to create dataframe
df = create_df(spark, input_data, schema)
# visualizing full column content of the
# dataframe by setting n and truncate to
# False
df.show(df.count(), truncate=False)
输出:
示例 2:通过将 truncate 设置为 0 来显示数据框的完整列内容。
在例子中,我们设置了参数truncate=0,这里如果我们设置任何从1开始的整数比如3,那么它会显示最多三个字符或整数位的列内容,不超过如下所示无花果。但是在这里代替 False 如果我们传递 0 这也将充当 False,就像在二进制数中 0 指的是 false 并在 Dataframe 中显示完整的列内容。
Python
# importing necessary libraries
from pyspark.sql import SparkSession
# function to create new SparkSession
def create_session():
spk = SparkSession.builder \
.master("local") \
.appName("Student_report.com") \
.getOrCreate()
return spk
def create_df(spark,data,schema):
df1 = spark.createDataFrame(data,schema)
return df1
if __name__ == "__main__":
# calling function to create SparkSession
spark = create_session()
input_data = [(1,"Shivansh","Male",80,"Good Performance"),
(2,"Arpita","Female",18,"Have to work hard otherwise \
result will not improve"),
(3,"Raj","Male",21,"Work hard can do better"),
(4,"Swati","Female",69,"Good performance can do more better"),
(5,"Arpit","Male",20,"Focus on some subject to improve"),
(6,"Swaroop","Male",65,"Good performance"),
(7,"Reshabh","Male",70,"Good performance"),
(8,"Dinesh","Male",65,"Can do better"),
(9,"Rohit","Male",55,"Can do better"),
(10,"Sanjana","Female",67,"Have to work hard")]
schema = ["ID","Name","Gender","Percentage","Remark"]
# calling function to create dataframe
df = create_df(spark,input_data,schema)
# visualizing full column content of the dataframe by setting truncate to 0
df.show(truncate=0)
输出:
示例 3:使用 show()函数显示 PySpark Dataframe 的完整列内容。
在显示完整列内容的代码中,我们通过传递参数df.count(),truncate=False使用 show()函数,我们可以写成df.show(df.count(), truncate=False) ,这里显示函数将第一个参数作为 n,即要显示的行数,因为 df.count() 返回 Dataframe 中存在的总行数的计数,如在上述情况下总行数为 10,所以在show()函数n 作为 10 传递,它只是要显示的总行数。
Python
# importing necessary libraries
from pyspark.sql import SparkSession
# function to create new SparkSession
def create_session():
spk = SparkSession.builder \
.master("local") \
.appName("Student_report.com") \
.getOrCreate()
return spk
def create_df(spark, data, schema):
df1 = spark.createDataFrame(data, schema)
return df1
if __name__ == "__main__":
# calling function to create SparkSession
spark = create_session()
input_data = [(1, "Shivansh", "Male", (70, 66, 78, 70, 71, 50), 80,
"Good Performance"),
(2, "Arpita", "Female", (20, 16, 8, 40, 11, 20), 18,
"Have to work hard otherwise result will not improve"),
(3, "Raj", "Male", (10, 26, 28, 10, 31, 20),
21, "Work hard can do better"),
(4, "Swati", "Female", (70, 66, 78, 70, 71, 50),
69, "Good performance can do more better"),
(5, "Arpit", "Male", (20, 46, 18, 20, 31, 10),
20, "Focus on some subject to improve"),
(6, "Swaroop", "Male", (70, 66, 48, 30, 61, 50),
65, "Good performance"),
(7, "Reshabh", "Male", (70, 66, 78, 70, 71, 50),
70, "Good performance"),
(8, "Dinesh", "Male", (40, 66, 68, 70, 71, 50),
65, "Can do better"),
(9, "Rohit", "Male", (50, 66, 58, 50, 51, 50),
55, "Can do better"),
(10, "Sanjana", "Female", (60, 66, 68, 60, 61, 50),
67, "Have to work hard")]
schema = ["ID", "Name", "Gender",
"Sessionals Marks", "Percentage", "Remark"]
# calling function to create dataframe
df = create_df(spark, input_data, schema)
# visualizing full column content of the
# dataframe by setting n and truncate to
# False
df.show(df.count(), truncate=False)
输出: