(1)初学者对于spark的几个疑问 | http://aperise.iteye.com/blog/2302481 |
(2)spark开发环境搭建 | http://aperise.iteye.com/blog/2302535 |
(3)Spark Standalone集群安装介绍 | http://aperise.iteye.com/blog/2305905 |
(4)spark-shell 读写hdfs 读写redis 读写hbase | http://aperise.iteye.com/blog/2324253 |
spark开发环境搭建
- jdk下载安装
- Scala下载安装
- Scala IDE for Eclipse下载安装
- IntelliJ IDEA for scala下载安装
- IntelliJ IDEA Ultimate版安装
- 在线安装SCALA插件
- 离线安装SCALA插件
- 创建maven scala工程
- intellij IDEA 常用设置
- intellij IDEA本地开发无法解析hadoop ha下虚拟的ha-cluster名称
spark源代码开发语言是Scala,Scala是一个基于JVM的开发语言,所以后期开发最好是选择Scala,因为可以不断的练习你的Scala开发技能,从而更深入的去查看spark源代码,更深层次提高自己能力。
这里只讲Scala开发环境搭建。
1.JDK安装
Oracle官网各种java版本下载地址:http://www.oracle.com/technetwork/java/archive-139210.html
(1)jdk下载
jdk1.7下载 http://download.oracle.com/otn-pub/java/jdk/7u79-b15/jdk-7u79-linux-x64.tar.gz
jdk1.8下载 http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html
jdk版本上选择看自己需求,然后自己去下载。
(2)jdk环境变量配置
我的jdk1.7所在位置为:D:\Java\jdk1.7.0_55
设置环境变量JAVA_HOME如下:
JAVA_HOME=D:\Java\jdk1.7.0_55
设置环境变量CLASSPATH如下:
CLASSPATH=.;%JAVA_HOME%\lib;%JAVA_HOME%\lib\tools.jar
这里注意了,不要把PATH全部覆盖了,毕竟这里有windows环境下DOS命令配置,这里要做的是将;%JAVA_HOME%\bin;%JAVA_HOME%\jre\bin;追加到PATH环境变量之后如下
PATH=这里是之前已经存在的PATH变量值;%JAVA_HOME%\bin;%JAVA_HOME%\jre\bin;
测试JDK是否安装成功,命令窗口输入java -version查看java版本
2.Scala安装
(1)scala下载
Scala 2.10.6下载http://www.scala-lang.org/download/2.10.6.html
Scala 2.11.8下载http://www.scala-lang.org/download/2.11.8.html
(2)Scala环境变量配置
官网关于Scala环境变量设置介绍地址http://www.scala-lang.org/documentation/getting-started.html
我的Scala的位置为:D:\scala\scala-2.10.6
设置环境变量SCALA_HOME如下:
SCALA_HOME=D:\scala\scala-2.10.6
设置环境变量PATH,这里注意了,不要把PATH全部覆盖了,毕竟这里有windows环境下DOS命令配置,这里要做的是将;%SCALA_HOME%\bin追加到PATH环境变量之后如下
PATH=这里是之前已经存在的PATH变量值;%SCALA_HOME%\bin
设置完毕后,在命令窗口检测是否Scala安装成功
3.IDE工具安装
(1)Scala IDE for Eclipse
下载地址:http://scala-ide.org/download/sdk.html
(2)IntelliJ IDEA for free(java scala andorid)
这里下载的是免费版本的IntelliJ IDEA,如下图
下载地址为:http://www.jetbrains.com/idea/download/download-thanks.html?code=IIC
4.IntelliJ IDEA Ultimate版安装
4.1 下载地址:https://www.jetbrains.com/idea/download/#section=windows
在官网下载最新idea Ultimate版(默认只能免费使用一个月,后面会讲),也就是任何功能不受限制的版本。
4.2 安装IntelliJ IDEA
这里我下载的是ideaIU-2016.2.4.exe,双击安装,步骤如下:
5.在线安装SCALA插件
完Intellij IDEA后,首次打开时候,scala插件默认是没有安装的,这时候需要自己手动安装,这里讲解如何在线安装。
如果你已经打开intellij IDEA,可以在如下菜单找到插件安装窗口
如下步骤继续:
6.离线安装SCALA插件
下载的插件版本和地址在这里已经有提示了
7.创建maven scala工程
7.1 File->New Project
7.2 set Project SDK
7.3 create from achetype
7.4 set Groupid and antifactid
7.5 set maven
7.6 set project name
7.7 change maven pom.xml
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.XXX</groupId>
<artifactId>spark-offline</artifactId>
<version>1.0-SNAPSHOT</version>
<packaging>jar</packaging>
<inceptionYear>2008</inceptionYear>
<properties>
<scala.version>2.10.5</scala.version>
<spark.version>1.6.0</spark.version>
<hadoop.version>2.7.1</hadoop.version>
<jedis.version>2.9.0</jedis.version>
<commons-pool2.version>2.4.2</commons-pool2.version>
<hbase.version>1.2.1</hbase.version>
<!-- Plugin的属性 -->
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<repositories>
<repository>
<id>scala-tools.org</id>
<name>Scala-Tools Maven2 Repository</name>
<url>http://scala-tools.org/repo-releases</url>
</repository>
</repositories>
<pluginRepositories>
<pluginRepository>
<id>scala-tools.org</id>
<name>Scala-Tools Maven2 Repository</name>
<url>http://scala-tools.org/repo-releases</url>
</pluginRepository>
</pluginRepositories>
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.4</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.specs</groupId>
<artifactId>specs</artifactId>
<version>1.2.5</version>
<scope>test</scope>
</dependency>
<!--spark-->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>${spark.version}</version>
</dependency>
<!--hadoop-->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
<!--jedis-->
<dependency>
<groupId>redis.clients</groupId>
<artifactId>jedis</artifactId>
<version>${jedis.version}</version>
</dependency>
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-pool2</artifactId>
<version>${commons-pool2.version}</version>
</dependency>
<!--hbase-->
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-client</artifactId>
<version>${hbase.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-common</artifactId>
<version>${hbase.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-server</artifactId>
<version>${hbase.version}</version>
</dependency>
</dependencies>
<build>
<sourceDirectory>src/main/scala</sourceDirectory>
<testSourceDirectory>src/test/scala</testSourceDirectory>
<plugins>
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
<configuration>
<scalaVersion>${scala.version}</scalaVersion>
<args>
<arg>-target:jvm-1.5</arg>
</args>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-eclipse-plugin</artifactId>
<configuration>
<downloadSources>true</downloadSources>
<buildcommands>
<buildcommand>ch.epfl.lamp.sdt.core.scalabuilder</buildcommand>
</buildcommands>
<additionalProjectnatures>
<projectnature>ch.epfl.lamp.sdt.core.scalanature</projectnature>
</additionalProjectnatures>
<classpathContainers>
<classpathContainer>org.eclipse.jdt.launching.JRE_CONTAINER</classpathContainer>
<classpathContainer>ch.epfl.lamp.sdt.launching.SCALA_CONTAINER</classpathContainer>
</classpathContainers>
</configuration>
</plugin>
</plugins>
</build>
<reporting>
<plugins>
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<configuration>
<scalaVersion>${scala.version}</scalaVersion>
</configuration>
</plugin>
</plugins>
</reporting>
</project>
7.8 use maven plugin install scala project
8.intellij IDEA 常用设置
8.1 set UI theme
设置完成后,效果如下:
8.2 set font and colors
为了不破坏默认的配置,首先我们需要另存一个自己的配置文件,这里取名myself
这里我们将代码字体调大到16
8.3 set code template
/**
* Project Name:${PROJECT_NAME}
* File Name:${FILE_NAME}
* Package Name:${PACKAGE_NAME}
* Date:${DATE}${TIME}
* User:${USER}
* Description: TODO
* Copyright (c) ${year}, xxx@xxx.xxx All Rights Reserved.
*/
8.4 SET SCALA SDK
8.4 export your own settings and import your own settings anlywhere
在项目开发过成中,已经设置了很多代码模板、代码编程风格,这些个性化设置可以很方便的导出以便后续使用,这里导出步骤如下:
在任意地方,你可以导入之前已经保存的个性化设置文件settings.jar
8.5 keymap Refrence
详见附件“Intellij IDEA default keymap.pdf”
9.intellij IDEA本地开发无法解析hadoop ha下虚拟的ha-cluster名称
9.1.windows本地使用intellij IDEA开发spark
hadoop安装的是采用HA的方式,现在本地开发环境开发spark时候,无法解析hadoop-ha方式下的cluster名称,原因是本地程序不知道加载的cluster ha对应的namenode名称和IP,解决办法是通过sparkconf追加参数,让spark 本地local模式知道hadoop ha配置,如下:
val spark = SparkSession
.builder()
.master("local[2]")
.appName("HtSecApp UserEvent Processor")
.getOrCreate()
val sc = spark.sparkContext
val hadoopConf = sc.hadoopConfiguration
hadoopConf.set("dfs.nameservices", "mycluster")
hadoopConf.set("dfs.client.failover.proxy.provider.mycluster", "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider")
hadoopConf.set("dfs.ha.namenodes.mycluster", "nn1,nn2")
hadoopConf.set("dfs.namenode.rpc-address.mycluster.nn1", "192.168.77.38:9000")
hadoopConf.set("dfs.namenode.rpc-address.mycluster.nn2", "192.168.77.39:9000")
解决如下问题:
9.2.服务端spark无法解析hadoop ha解决办法
首先spark-env.sh里添加参数让spark知道哪里加载hadoop ha配置文件:
xport HADOOP_HOME=/home/hadoop/hadoop-2.7.1
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export HADOOP_USER_CLASSPATH_FIRST=true
其次采用spark-submit时候可以明确指定参数--files让spark读取额外的hadoop配置
./spark-submit \
--master spark://hadoop31:7077,hadoop35:7077 \
--class "com.xxx.offline.FridayReportAnalysis" \
--files "/home/hadoop/hadoop-2.7.1/etc/hadoop/core-site.xml,/home/hadoop/hadoop-2.7.1/etc/hadoop/hdfs-site.xml" \
/home/hadoop/sparkoffline/spark-offline-1.0-SNAPSHOT.jar \
20170303