我们的大数据平台之前定义的步骤就是,使用ETL工具从关系型数据库抽取到HBase,然后通过Phoenix的二级索引,SQL关联查询,将大数据需要学习的训练集以及验证集提供给spark,调用spark ml的机器学习类库,做相应的算法分析,比如线性回归算法和决策树算法等等,最后生成临时表到phnenix的,使用zeppelin将数据展示出来,整个大数据平台的思路就是这样。

        下面我们按照步骤逐一展开:

1.搭建Docker的单机版phoenix和hbase(生产环境建议使用集群版,可以参考)

             https://gitee.com/astra_zhao/hbase-phoenix-docker,进行下载,下载完后,按照README.md,最后启动,请使用如下语句启动容器

docker run -it -p 8765:8765 -跑2181:2181 iteblog/hbase-phoenix-docker

2.搭建Docker的Spark多节点环境(生产环境可以采用docker,但docker-compose要设置的比较好,因为存储文件要实时备份)

            https://gitee.com/astra_zhao/docker-spark,下载后,使用docker-compose up -d即可安装成功,安装成功后,暴露端口如下:

docker大数据项目 docker搭建大数据平台_java

注意,docker-compose.yml文件要加入如下说明:

master:
   image: gettyimages/spark
   command: bin/spark-class org.apache.spark.deploy.master.Master -h master
   hostname: master
   environment:
     MASTER: spark://master:7077
     SPARK_CONF_DIR: /conf
     SPARK_PUBLIC_DNS: localhost
   extra_hosts:
     - "主机名:192.168.63.9"    -"phoenix容器ID:172.17.0.2"

通过添加extra_hosts,来指定容器机器跟主机进行通讯,以及容器之间互相通讯。否则启动会报错。
 

3.使用Phoenix的Join操作和优化

参考这篇文章:

4.搭建Java示例

4.1搭建maven工程(spring boot工程自行完成)

下面的maven支持两种打包方式,mvn clean package -Dmaven.skip.test=true是将第三方jar包打入到target目录的lib下。

mvn clean package assembly:single单独打成独立的包,建议使用第一种方式

<properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    <maven.compiler.source>1.8</maven.compiler.source>
    <maven.compiler.target>1.8</maven.compiler.target>
    <spark.version>2.4.0</spark.version>
    <scala.binary.version>2.11</scala.binary.version>
  </properties>

  <dependencies>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>4.11</version>
      <scope>test</scope>
    </dependency>
    <!--phoenix core-->
    <dependency>
      <groupId>org.apache.phoenix</groupId>
      <artifactId>phoenix-core</artifactId>
      <version>5.0.0-HBase-2.0</version>
      <exclusions>
        <exclusion>
          <groupId>org.slf4j</groupId>
          <artifactId>slf4j-log4j12</artifactId>
        </exclusion>
        <exclusion>
          <groupId>log4j</groupId>
          <artifactId>log4j</artifactId>
        </exclusion>
      </exclusions>
    </dependency>
    <dependency>
      <groupId>org.apache.phoenix</groupId>
      <artifactId>phoenix-spark</artifactId>
      <version>5.0.0-HBase-2.0</version>
    </dependency>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_${scala.binary.version}</artifactId>
      <version>${spark.version}</version>
    </dependency>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-sql_${scala.binary.version}</artifactId>
      <version>${spark.version}</version>
     </dependency>
    <dependency>
      <groupId>org.apache.hbase</groupId>
      <artifactId>hbase-client</artifactId>
      <version>2.0.6</version>
    </dependency>
    <dependency>
      <groupId>org.apache.hbase</groupId>
      <artifactId>hbase-common</artifactId>
      <version>2.0.6</version>
    </dependency>
    <dependency>
      <groupId>org.apache.hbase</groupId>
      <artifactId>hbase-server</artifactId>
      <version>2.0.6</version>
    </dependency>
    <dependency>
      <groupId>org.apache.zookeeper</groupId>
      <artifactId>zookeeper</artifactId>
      <version>3.4.10</version>
    </dependency>
    <dependency>
      <groupId>org.apache.hbase</groupId>
      <artifactId>hbase-protocol</artifactId>
      <version>2.0.6</version>
    </dependency>
    <dependency>
      <groupId>org.apache.htrace</groupId>
      <artifactId>htrace-core</artifactId>
      <version>3.2.0-incubating</version>
    </dependency>
    <dependency>
      <groupId>io.dropwizard.metrics</groupId>
      <artifactId>metrics-core</artifactId>
      <version>3.2.6</version>
    </dependency>
  </dependencies>

  <build>
    <resources>
      <!-- 编译之后包含properties -->
      <resource>
        <directory>src/main/resources</directory>
        <includes>
          <include>**/*.properties</include>
        </includes>
        <filtering>true</filtering>
      </resource>
    </resources>
      <plugins>
        <plugin>
          <groupId>org.apache.maven.plugins</groupId>
          <artifactId>maven-compiler-plugin</artifactId>
          <version>3.7.0</version>
          <configuration>
            <source>1.8</source>
            <target>1.8</target>
            <encoding>UTF-8</encoding>
          </configuration>
        </plugin>
        <plugin>
          <groupId>org.apache.maven.plugins</groupId>
          <artifactId>maven-jar-plugin</artifactId>
          <configuration>
            <archive>
              <manifest>
                <addClasspath>true</addClasspath>
                <classpathPrefix>lib/</classpathPrefix>
                <mainClass>tech.zhaoxin.App</mainClass>
              </manifest>
            </archive>
          </configuration>
        </plugin>
        <plugin>
          <groupId>org.apache.maven.plugins</groupId>
          <artifactId>maven-dependency-plugin</artifactId>
          <executions>
            <execution>
              <id>copy-dependencies</id>
              <phase>package</phase>
              <goals>
                <goal>copy-dependencies</goal>
              </goals>
              <configuration>
                <outputDirectory>${project.build.directory}/lib</outputDirectory>
                <overWriteReleases>false</overWriteReleases>
                <overWriteSnapshots>false</overWriteSnapshots>
                <overWriteIfNewer>true</overWriteIfNewer>
              </configuration>
            </execution>
          </executions>
        </plugin>
        <plugin>
          <groupId>org.apache.maven.plugins</groupId>
          <artifactId>maven-assembly-plugin</artifactId>
          <version>2.5.5</version>
          <configuration>
            <archive>
              <manifest>
                <mainClass>tech.zhaoxin.App</mainClass>
              </manifest>
            </archive>
            <descriptorRefs>
              <descriptorRef>jar-with-dependencies</descriptorRef>
            </descriptorRefs>
          </configuration>
        </plugin>
      </plugins>
  </build>

生成Java类

public class PhoenixSparkRead {

    public static void main(String[] args){
        SparkConf sparkConf = new SparkConf().setMaster("spark://192.168.61.102:7077").setAppName("phoenix-test");
        JavaSparkContext jsc = new JavaSparkContext(sparkConf);
        SQLContext sqlContext = new SQLContext(jsc);
        System.out.println("开始执行第一步");
        // Load data from TABLE1
        Dataset<Row> df = sqlContext
                .read()
                .format("org.apache.phoenix.spark")
                .option("table", "iteblog")
                .option("zkUrl", "192.168.61.102:2181")
                .load();
        df.createOrReplaceTempView("iteblog");
        System.out.println("开始执行第二步");
        SQLContext sqlCtx = new SQLContext(jsc);
        df = sqlCtx.sql("SELECT * FROM iteblog");
        System.out.println("开始执行第三步");
        List<Row> rows = df.collectAsList();
        System.out.println(rows);
        jsc.stop();
        System.out.println("完成");
    }
}

5.配置操作

5.1进入linux服务器,将spark-2.4.1-bin-hadoop2.7.tgz放入到opt目录,进行解压操作

tar -xvzf spark-2.4.1-bin-hadoop2.7.tgz

5.2将上面mvn打包的lib目录下jar包,拷贝到opt/jars/lib目录下

5.3将下面的jar包全部拷贝到spark-2.4.1-bin-hadoop2.7/jars目录

docker大数据项目 docker搭建大数据平台_hbase_02

docker大数据项目 docker搭建大数据平台_spark_03

5.4将上面文件拷入到spark的docker容器里面,参考命令如下:

docker cp /opt/phoenix/ 8ead:/usr/spark-2.4.1/jars/    (/opt/phoenix目录只包含上面图片的jar包)

然后进入容器将/usr/spark-2.4.1/jars/phoenix的jar包拷贝到上层目录

两个容器都执行如下操作

5.5最后到主机的/opt/spark-2.4.1-bin-hadoop2.7/bin目录执行如下命令:

./spark-submit --class com.astra.PhoenixSparkRead /opt/jars/spark-zeppelin-learn-1.0-SNAPSHOT.jar  --jars /opt/jars/lib/*.jar --master spark://192.168.61.102:7077 --driver-memory 4g

5.6最后就能看到相关数据

docker大数据项目 docker搭建大数据平台_java_04