编译Spark3.x
1.修改 D:\source\spark-3.0.0\dev\make-distribution.sh
将133行-151行注释,按如下方式修改
VERSION=3.0.0
SCALA_VERSION=2.12
SPARK_HADOOP_VERSION=2.6.0-cdh5.16.2
SPARK_HIVE=1
#VERSION=$("$MVN" help:evaluate -Dexpression=project.version $@ 2>/dev/null\
# | grep -v "INFO"\
# | grep -v "WARNING"\
# | tail -n 1)
#SCALA_VERSION=$("$MVN" help:evaluate -Dexpression=scala.binary.version $@ 2>/dev/null\
# | grep -v "INFO"\
# | grep -v "WARNING"\
# | tail -n 1)
#SPARK_HADOOP_VERSION=$("$MVN" help:evaluate -Dexpression=hadoop.version $@ 2>/dev/null\
# | grep -v "INFO"\
# | grep -v "WARNING"\
# | tail -n 1)
#SPARK_HIVE=$("$MVN" help:evaluate -Dexpression=project.activeProfiles -pl sql/hive $@ 2>/dev/null\
# | grep -v "INFO"\
# | grep -v "WARNING"\
# | fgrep --count "<id>hive</id>";\
# # Reset exit status to 0, otherwise the script stops here if the last grep finds nothing\
# # because we use "set -o pipefail"
# echo -n)
2.修改D:\source\spark-3.0.0\pom.xml
修改profile ,将其默认的版本改成自己的hadoop版本,scala版本
3.在Git中开始编译
1 ./dev/change-scala-version.sh 2.12
2 ./dev/make-distribution.sh --name 2.6.0-cdh5.16.2 --tgz -Phive -Phive-thriftserver -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0-cdh5.16.2 -Dscala.version=2.12.10
如果你的hadoop版本低于2.6.4此时编译到yarn环节会报错 :
解决办法:https://github.com/apache/spark/pull/16884/files
修改源码:
D:\source\spark-3.0.0\resource-managers\yarn\src\main\scala\org\apache\spark\deploy\yarn\Client.scala
修改295行 。由于这两个方法是hadoop2.6.4添加的,如果你的hadoop版本低于2.6.4,那么编译就会报错。
按如下方式修改即可解决问题
// sparkConf.get(ROLLED_LOG_INCLUDE_PATTERN).foreach { includePattern =>
// try {
// val logAggregationContext = Records.newRecord(classOf[LogAggregationContext])
// logAggregationContext.setRolledLogsIncludePattern(includePattern)
// sparkConf.get(ROLLED_LOG_EXCLUDE_PATTERN).foreach { excludePattern =>
// logAggregationContext.setRolledLogsExcludePattern(excludePattern)
// }
// appContext.setLogAggregationContext(logAggregationContext)
// } catch {
// case NonFatal(e) =>
// logWarning(s"Ignoring ${ROLLED_LOG_INCLUDE_PATTERN.key} because the version of YARN " +
// "does not support it", e)
// }
// }
// appContext.setUnmanagedAM(isClientUnmanagedAMEnabled)
//
// sparkConf.get(APPLICATION_PRIORITY).foreach { appPriority =>
// appContext.setPriority(Priority.newInstance(appPriority))
// }
// appContext
// }
sparkConf.get(ROLLED_LOG_INCLUDE_PATTERN).foreach { includePattern =>
try {
val logAggregationContext = Records.newRecord(classOf[LogAggregationContext])
// These two methods were added in Hadoop 2.6.4, so we still need to use reflection to
// avoid compile error when building against Hadoop 2.6.0 ~ 2.6.3.
val setRolledLogsIncludePatternMethod =
logAggregationContext.getClass.getMethod("setRolledLogsIncludePattern", classOf[String])
setRolledLogsIncludePatternMethod.invoke(logAggregationContext, includePattern)
sparkConf.get(ROLLED_LOG_EXCLUDE_PATTERN).foreach { excludePattern =>
val setRolledLogsExcludePatternMethod =
logAggregationContext.getClass.getMethod("setRolledLogsExcludePattern", classOf[String])
setRolledLogsExcludePatternMethod.invoke(logAggregationContext, excludePattern)
}
appContext.setLogAggregationContext(logAggregationContext)
} catch {
case NonFatal(e) =>
logWarning(s"Ignoring ${ROLLED_LOG_INCLUDE_PATTERN.key} because the version of YARN " +
"does not support it", e)
}
}
appContext
}
Spark3.X------Spark-Sql读取hive数据
启动Spark-Sql 必须使用--driver-class-path 把mysql的驱动传入,--jars可能不起作用
问题1
报错信息以及原因:
Spark3.x 指定的Hive版本是2.X 我们使用的Hive版本是1.1.0,Spark3.0调用的方法是ThriftHiveMetastore$Client.get_table_req ,但是Hive1.1.0不支持
Error in query: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table hive_metadata_test. Invalid method name: 'get_table_req';
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table hive_metadata_test. Invalid method name: 'get_table_req';
Caused by: org.apache.thrift.TApplicationException: Invalid method name: 'get_table_req'
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_table_req(ThriftHiveMetastore.java:1567)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_table_req(ThriftHiveMetastore.java:1554)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:1350)
at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.getTable(SessionHiveMetaStoreClient.java:127)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:173)
at com.sun.proxy.$Proxy18.getTable(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient$SynchronizedHandler.invoke(HiveMetaStoreClient.java:2336)
at com.sun.proxy.$Proxy18.getTable(Unknown Source)
at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:1274)
... 88 more
解决办法:https://github.com/apache/spark/pull/27161
启动Spark-Sql的时候使用--conf将以下两个参数传入,或者修改spark3.x/conf/spark-defaults.conf 把下面两个配置项加入进去。
spark.sql.hive.metastore.version=1.1.0 #Hive版本
spark.sql.hive.metastore.jars=/home/badass/app/hive/lib/* #当前使用的Hive的lib目录
问题2
解决[问题1],启动Spark-Sql,还会出现以下问题。
报错信息以及原因:
由于问题1中的Hive版本兼容问题,我们使用spark.sql.hive.metastore.jars
来修改Spark加载访问Hive Metastore的类。
Spark允许用户设置spark.sql.hive.metastore.jars
以指定jar来访问Hive Metastore。这些jar由独立的类加载器加载。因为我们还与独立的类加载器共享Hadoop类,所以用户不需要将Hadoop jar添加到中spark.sql.hive.metastore.jars
,这意味着当我们使用独立的类加载器时,在这种情况下hadoop-common jar不可用。如果在切换到独立的类加载器之前未初始化Hadoop VersionInfo,并尝试使用隔离的类加载器(当前线程上下文类加载器)对其进行初始化,它将失败并报告,这将Unknown
导致Hive引发以下异常。
java.lang.RuntimeException: Illegal Hadoop Version: Unknown (expected A.B.* format)
at org.apache.hadoop.hive.shims.ShimLoader.getMajorVersion(ShimLoader.java:147)
at org.apache.hadoop.hive.shims.ShimLoader.loadShims(ShimLoader.java:122)
at org.apache.hadoop.hive.shims.ShimLoader.getHadoopShims(ShimLoader.java:88)
at org.apache.hadoop.hive.conf.HiveConf$ConfVars.<clinit>(HiveConf.java:370)
at org.apache.hadoop.hive.conf.HiveConf.<clinit>(HiveConf.java:108)
at org.apache.spark.sql.hive.client.HiveClientImpl$.newHiveConf(HiveClientImpl.scala:1233)
at org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:163)
at org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:128)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:301)
at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:431)
at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:324)
at org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:68)
at org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:67)
at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$databaseExists$1(HiveExternalCatalog.scala:221)
at org.apache.spark.sql.hive.HiveExternalCatalog$$Lambda$785/90628418.apply$mcZ$sp(Unknown Source)
at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99)
at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:221)
at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:137)
at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:127)
at org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:58)
at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.<init>(SparkSQLCLIDriver.scala:321)
at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:155)
at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:928)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
解决办法:
https://github.com/apache/spark/pull/29059 这里的解决方法是修改类的加载方式,按照patch修改源码重新编译。
上面的方法比较麻烦,因为获取VersionInfo
需要hadoop-common jar,并且由于spark.sql.hive.metastore.jars
的配置导致无法获取VersionInfo
,所以最便捷的解决方法就是将hadoop/share/hadoop/common/hadoop-common-2.6.0-cdh5.16.2.jar 这个jar包拷贝到hive的lib下。最后启动Spark-Sql完成,并使用Spark-Sql读取到了Hive的数据。