背景知识:这两天公司想把xgboost模型做的件量预测移植到spark xgboost上,然后就开始了漫漫长路。踩了很多坑,然后把自己的目前可运行的一个demo放上来跟大家分享。

1.环境:

idea

linux系统

这里有个坑:如果不想去编译xgboost,通过maven引入的xgboost4j包只支持linux系统,因为windows需要.dll文件,linux需要.so文件,而maven引入的xgboost4j里面只有.so文件,所以只能在linux上跑。

分享一个spark xgboost可运行的实例_其他

scala:2.11.0

jdk:1.8

xgboost:0.72

spark:必须要2.3.0及其以上,否则会出千奇百怪的错

<java.version>1.8</java.version>
<spark.version>2.3.0</spark.version>
<hadoop.version>2.7.3</hadoop.version>
<dependency>
    <groupId>ml.dmlc</groupId>
    <artifactId>xgboost4j</artifactId>
    <version>0.72</version>
</dependency>
<dependency>
    <groupId>ml.dmlc</groupId>
    <artifactId>xgboost4j-spark</artifactId>
    <version>0.72</version>
</dependency>

2.可运行的demo

import ml.dmlc.xgboost4j.scala.spark.XGBoost
import org.apache.log4j.{Level, Logger}
import org.apache.spark.ml.feature._
import org.apache.spark.sql._



object myCallXGBoost {
  Logger.getLogger("org").setLevel(Level.WARN)

  def main(args: Array[String]): Unit = {
    val inputPath = "/Users/01376233/IdeaProjects/myxgboost/src/main/data"

    // create SparkSession
    val spark = SparkSession
      .builder()
      .appName("SimpleXGBoost Application")
      .config("spark.executor.memory", "2G")
      .config("spark.executor.cores", "4")
      .config("hive.metastore.uris","thrift://10.202.77.200:9083")
      .config("spark.driver.memory", "1G")
      .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
      .config("spark.default.parallelism", "4")
      .enableHiveSupport()
      //.master("local[*]")
      .getOrCreate()

    //从csv中读取数据
    //val myTrainCsv = spark.read.option("header", "true").option("inferSchema", true).csv(inputPath + "/my_train.csv")
    //val myTestCsv = spark.read.option("header", "true").option("inferSchema", true).csv(inputPath + "/my_test.csv")

    //从hive中读取数据
    val myTrainCsv = spark.sql("select * from dm_analysis.lsm_xgboost_train")
    val myTestCsv = spark.sql("select * from dm_analysis.lsm_xgboost_test")



    //println(myTrainCsv.getClass.getSimpleName)      //Dataset
    //sys.exit()
    //myTrainCsv.show(10)

    //把特征转化为一个vector
    //将多列的特征转化为一个vector,这个vector叫features
    val vectorAssembler = new VectorAssembler()
      .setInputCols(Array("iswork","rank","cntLag1","cntLag2","Monday",
        "Saturday","Sunday","Thursday","Tuesday","Wednesday",
        "August","December","February","January","July","June","March"
        ,"May","November","October","September","lateMonth","midMonth"))
      .setOutputCol("features")


    val xGBoostTrainInput = vectorAssembler.transform(myTrainCsv).drop("_c0").withColumnRenamed("cnt","label").select("features", "label")

    val xGBoostTestInput = vectorAssembler.transform(myTestCsv).select("features")
    xGBoostTestInput.show(10)
    //sys.exit()

    //sys.exit()
    // number of iterations
    val numRound = 10
    val numWorkers = 4
    // training parameters
    val paramMap = List(
      "colsample_bytree" -> 1,
      "eta" -> 0.05f,                        //就是学习率
      "max_depth" -> 8,                       //树的最大深度
      "min_child_weight" -> 5,                //
      "n_estimators" -> 120,
      "subsample" -> 0.7
      ).toMap


    println("Starting Xgboost ")

    //val a = new XGBoostRegressionModel

    val xgBoostModel = XGBoost.trainWithDataFrame(xGBoostTrainInput, paramMap, round = 10, nWorkers = 4, useExternalMemory = true)

    val output = xgBoostModel.transform(xGBoostTestInput)

    output.show()
  }
}

这是可以在linux调试的版本,上传到spark 集群上的后续会继续更新。