背景知识:这两天公司想把xgboost模型做的件量预测移植到spark xgboost上,然后就开始了漫漫长路。踩了很多坑,然后把自己的目前可运行的一个demo放上来跟大家分享。
1.环境:
idea
linux系统
这里有个坑:如果不想去编译xgboost,通过maven引入的xgboost4j包只支持linux系统,因为windows需要.dll文件,linux需要.so文件,而maven引入的xgboost4j里面只有.so文件,所以只能在linux上跑。
scala:2.11.0
jdk:1.8
xgboost:0.72
spark:必须要2.3.0及其以上,否则会出千奇百怪的错
<java.version>1.8</java.version>
<spark.version>2.3.0</spark.version>
<hadoop.version>2.7.3</hadoop.version>
<dependency>
<groupId>ml.dmlc</groupId>
<artifactId>xgboost4j</artifactId>
<version>0.72</version>
</dependency>
<dependency>
<groupId>ml.dmlc</groupId>
<artifactId>xgboost4j-spark</artifactId>
<version>0.72</version>
</dependency>
2.可运行的demo
import ml.dmlc.xgboost4j.scala.spark.XGBoost
import org.apache.log4j.{Level, Logger}
import org.apache.spark.ml.feature._
import org.apache.spark.sql._
object myCallXGBoost {
Logger.getLogger("org").setLevel(Level.WARN)
def main(args: Array[String]): Unit = {
val inputPath = "/Users/01376233/IdeaProjects/myxgboost/src/main/data"
// create SparkSession
val spark = SparkSession
.builder()
.appName("SimpleXGBoost Application")
.config("spark.executor.memory", "2G")
.config("spark.executor.cores", "4")
.config("hive.metastore.uris","thrift://10.202.77.200:9083")
.config("spark.driver.memory", "1G")
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.config("spark.default.parallelism", "4")
.enableHiveSupport()
//.master("local[*]")
.getOrCreate()
//从csv中读取数据
//val myTrainCsv = spark.read.option("header", "true").option("inferSchema", true).csv(inputPath + "/my_train.csv")
//val myTestCsv = spark.read.option("header", "true").option("inferSchema", true).csv(inputPath + "/my_test.csv")
//从hive中读取数据
val myTrainCsv = spark.sql("select * from dm_analysis.lsm_xgboost_train")
val myTestCsv = spark.sql("select * from dm_analysis.lsm_xgboost_test")
//println(myTrainCsv.getClass.getSimpleName) //Dataset
//sys.exit()
//myTrainCsv.show(10)
//把特征转化为一个vector
//将多列的特征转化为一个vector,这个vector叫features
val vectorAssembler = new VectorAssembler()
.setInputCols(Array("iswork","rank","cntLag1","cntLag2","Monday",
"Saturday","Sunday","Thursday","Tuesday","Wednesday",
"August","December","February","January","July","June","March"
,"May","November","October","September","lateMonth","midMonth"))
.setOutputCol("features")
val xGBoostTrainInput = vectorAssembler.transform(myTrainCsv).drop("_c0").withColumnRenamed("cnt","label").select("features", "label")
val xGBoostTestInput = vectorAssembler.transform(myTestCsv).select("features")
xGBoostTestInput.show(10)
//sys.exit()
//sys.exit()
// number of iterations
val numRound = 10
val numWorkers = 4
// training parameters
val paramMap = List(
"colsample_bytree" -> 1,
"eta" -> 0.05f, //就是学习率
"max_depth" -> 8, //树的最大深度
"min_child_weight" -> 5, //
"n_estimators" -> 120,
"subsample" -> 0.7
).toMap
println("Starting Xgboost ")
//val a = new XGBoostRegressionModel
val xgBoostModel = XGBoost.trainWithDataFrame(xGBoostTrainInput, paramMap, round = 10, nWorkers = 4, useExternalMemory = true)
val output = xgBoostModel.transform(xGBoostTestInput)
output.show()
}
}
这是可以在linux调试的版本,上传到spark 集群上的后续会继续更新。