SparkSql 数据类型转换

转载

爱是与世界平行 2021-06-01 12:16:12

文章标签 Spark教程 Spark学习 文章分类 Spark 大数据

SparkSql 数据类型转换

1、SparkSql数据类型
- 1.1数字类型
- 1.2复杂类型
2、Spark Sql数据类型和Scala数据类型对比
3、Spark Sql数据类型转换案例
4、Spark DateType cast

SparkSql 数据类型转换

1、SparkSql数据类型

1.1数字类型

ByteType：代表一个字节的整数。范围是-128到127
ShortType：代表两个字节的整数。范围是-32768到32767
IntegerType：代表4个字节的整数。范围是-2147483648到2147483647
LongType：代表8个字节的整数。范围是-9223372036854775808到9223372036854775807
FloatType：代表4字节的单精度浮点数 DoubleType：代表8字节的双精度浮点数
DecimalType：代表任意精度的10进制数据。通过内部的java.math.BigDecimal支持。BigDecimal由一个任意精度的整型非标度值和一个32位整数组成
StringType：代表一个字符串值
BinaryType：代表一个byte序列值
BooleanType：代表boolean值

Datetime类型：

TimestampType：代表包含字段年，月，日，时，分，秒的值
DateType：代表包含字段年，月，日的值

1.2复杂类型

ArrayType(elementType, containsNull)：代表由elementType类型元素组成的序列值。containsNull用来指明ArrayType中的值是否有null值
MapType(keyType, valueType, valueContainsNull)：表示包括一组键 - 值对的值。通过keyType表示key数据的类型，通过valueType表示value数据的类型。valueContainsNull用来指明MapType中的值是否有null值
StructType(fields):表示一个拥有StructFields (fields)序列结构的值
StructField(name, dataType, nullable):代表StructType中的一个字段，字段的名字通过name指定，dataType指定field的数据类型，nullable表示字段的值是否有null值。

2、Spark Sql数据类型和Scala数据类型对比

Spark sql数据类型	Scala数据类型
ByteType	Byte
ShortType	Short
IntegerType	Int
LongType	Long
FloatType	Float
DoubleType	Double
DecimalType	scala.math.BigDecimal
StringType	String
BinaryType	Array[Byte]
BooleanType	Boolean
TimestampType	java.sql.Timestamp
DateType	java.sql.Date
ArrayType	scala.collection.Seq
MapType	scala.collection.Map
StructType	org.apache.spark.sql.Row
StructField	The value type in Scala of the data type of this field (For example, Int for a StructField with the data type IntegerType)

3、Spark Sql数据类型转换案例

调用Column类的cast方法

3.1获取Column类

df("columnName")            // On a specific `df` DataFrame.
col("columnName")           // A generic column not yet associated with a DataFrame.
col("columnName.field")     // Extracting a struct field
col("`a.column.with.dots`") // Escape `.` in column names.
$"columnName"               // Scala short hand for a named column.

3.2测试数据准备

1,tom,23
2,jack,24
3,lily,18
4,lucy,19

3.3spark入口代码

val spark = SparkSession
      .builder()
      .appName("test")
      .master("local[*]")
      .getOrCreate()

3.4测试默认数据类型

spark.read.
      textFile("./data/user")
      .map(_.split(","))
      .map(x => (x(0), x(1), x(2)))
      .toDF("id", "name", "age")
      .dtypes
      .foreach(println)

结果：

(id,StringType)
(name,StringType)
(age,StringType)

3.5把数值型的列转为IntegerType

 import spark.implicits._
    spark.read.
      textFile("./data/user")
      .map(_.split(","))
      .map(x => (x(0), x(1), x(2)))
      .toDF("id", "name", "age")
      .select($"id".cast("int"), $"name", $"age".cast("int"))
      .dtypes
      .foreach(println)

结果：

(id,IntegerType)
(name,StringType)
(age,IntegerType)

3.6Column类cast方法的两种重载

第一种
def cast(to: String): Column
Casts the column to a different data type, using the canonical string representation of the type. The supported types are:
string, boolean, byte, short, int, long, float, double, decimal, date, timestamp.

// Casts colA to integer.
df.select(df("colA").cast("int"))
Since
1.3.0

第二种
def cast(to: DataType): Column
Casts the column to a different data type.

// Casts colA to IntegerType.
import org.apache.spark.sql.types.IntegerType
df.select(df("colA").cast(IntegerType))

// equivalent to
df.select(df("colA").cast("int"))

4、Spark DateType cast

配置 Spark 的默认时区config(“spark.sql.session.timeZone”, “UTC”), 最直观. 这样直接写 df.select(df.col(“birth”).cast(TimestampType).cast(LongType))
不配置 conf

df.select(from_utc_timestamp(to_utc_timestamp(df.col("birth"), TimeZone.getTimeZone("UTC").getID), TimeZone.getDefault.getID).cast(LongType))

没有配置 UTC：

from_utc_timestamp(to_utc_timestamp(lit("2012-12-11 16:00:00"), TimeZone.getTimeZone("UTC").getID), TimeZone.getDefault.getID)

配置了 UTC: 多了8小时：

from_utc_timestamp(to_utc_timestamp(lit("2012-12-12 00:00:00"), TimeZone.getTimeZone("UTC").getID), TimeZone.getDefault.getID)

上一篇：Spark获取DataFrame中列的方式--col，$，column，apply

下一篇：Spark SQL DataFrame新增一列的四种方法

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯

SparkSql 数据类型转换

SparkSql 数据类型转换

SparkSql 数据类型转换

1.1数字类型

1.2复杂类型

3.1获取Column类

3.2测试数据准备

3.3spark入口代码

3.4测试默认数据类型

3.5把数值型的列转为IntegerType

3.6Column类cast方法的两种重载

51CTO博客