Spark:
(1)是一个高速的可扩充的处理海量数据的引擎
(2)使用scala开发
(3)提供spark shell 供开发人员学习或者处理数据
(4)可以使用python,java,R,scala语言开发spark应用程序,用于海量数据处理
(5)Spark是一个针对超大数据集合的低延迟的集群分布式计算系统,比MapReducer快40倍左右。
(6)Spark是hadoop的升级版本,Hadoop作为第一代产品使用HDFS,第二代加入了Cache来保存中间计算结果,并能适时主动推Map/Reduce任务,第三代就是Spark倡导的流Streaming。
(7)Spark兼容Hadoop的APi,能够读写Hadoop的HDFS HBASE 顺序文件等。
注意:在linux上安装spark ,前提要部署了hadoop,并且安装了scala.
本人安装的对应版本:
名称 | 版本 |
JDK | 1.8.0.151 |
hadoop | 2.6.3.0-235 |
scala | 2.11.0 |
spark | 2.3.1 |
一、下载
官网:http://spark.apache.org/downloads.html
或者网址:https://mirrors.tuna.tsinghua.edu.cn/apache/spark/spark-2.3.1/
下载最新版本: spark-2.3.1-bin-hadoop2.6.tgz
二、解压
命令:tar -zxvf spark-2.3.1-bin-hadoop2.6.tgz
[root@wugenqiang ~]# ls
anaconda-ks.cfg metastore_db spark-2.3.1-bin-hadoop2.6.tgz
derby.log original-ks.cfg wugenqiang.hello
[root@wugenqiang ~]# tar xvfz spark-2.3.1-bin-hadoop2.6.tgz
三、移动路径
[root@wugenqiang ~]# cd /usr/local
[root@wugenqiang local]# ls
bin etc games include lib lib64 libexec sbin share src
[root@wugenqiang local]# mv /root/spark-2.3.1-bin-hadoop2.6 /usr/local
[root@wugenqiang local]# ls
bin games lib libexec share src
etc include lib64 sbin spark-2.3.1-bin-hadoop2.6
四、配置环境变量
1.命令:vim /etc/profile
[root@wugenqiang ~]# vim /etc/profile
添加:
export SPARK_HOME=/usr/local/spark-2.3.1-bin-hadoop2.6
export PATH=${PATH}:$SPARK_HOME/bin
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$PYTHONPATH
2.使生效
[root@wugenqiang ~]# source /etc/profile
五、使用rpm包
1.安装软件:
[root@wugenqiang ~]# yum install -y spark2 spark2-python
已加载插件:fastestmirror
Loading mirror speeds from cached hostfile
软件包 spark2-2.2.0.2.6.3.0-235.noarch 已安装并且是最新版本
软件包 spark2-python-2.2.0.2.6.3.0-235.noarch 已安装并且是最新版本
无须任何处理
六、进入spark安装位置, 然后进入spark中的 bin 文件夹
[root@wugenqiang ~]# cd /usr/local/spark-2.3.1-bin-hadoop2.6/
[root@wugenqiang spark-2.3.1-bin-hadoop2.6]# cd bin
[root@wugenqiang bin]# ls
beeline pyspark.cmd spark-shell
beeline.cmd run-example spark-shell2.cmd
docker-image-tool.sh run-example.cmd spark-shell.cmd
find-spark-home spark-class spark-sql
find-spark-home.cmd spark-class2.cmd spark-sql2.cmd
load-spark-env.cmd spark-class.cmd spark-sql.cmd
load-spark-env.sh sparkR spark-submit
pyspark sparkR2.cmd spark-submit2.cmd
pyspark2.cmd sparkR.cmd spark-submit.cmd
1.运行: spark-shell使得运行scala
[root@wugenqiang bin]# spark-shell
results:
[root@wugenqiang bin]# spark-shell
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
18/07/27 11:11:13 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark context Web UI available at http://192.168.75.213:4040
Spark context available as 'sc' (master = local[*], app id = local-1532661080069).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.2.0.2.6.3.0-235
/_/
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_151)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
2.运行pyspark (python)
[root@wugenqiang ~]# source /etc/profile
[root@wugenqiang ~]# pyspark
Python 2.7.5 (default, Aug 4 2017, 00:39:18)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-16)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
18/07/27 11:47:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.3.1
/_/
Using Python version 2.7.5 (default, Aug 4 2017 00:39:18)
SparkSession available as 'spark'.
>>>
七、调整日志级别控制输出的信息量:
在conf目录下将log4j.properties.template 复制为 log4j.properties, 然后找到 log4j.rootCategory = INFO, console
[root@wugenqiang spark-2.3.1-bin-hadoop2.6]# ls
bin data jars LICENSE NOTICE R RELEASE yarn
conf examples kubernetes licenses python README.md sbin
[root@wugenqiang spark-2.3.1-bin-hadoop2.6]# cd conf
[root@wugenqiang conf]# ls
docker.properties.template slaves.template
fairscheduler.xml.template spark-defaults.conf.template
log4j.properties.template spark-env.sh.template
metrics.properties.template
[root@wugenqiang conf]# cp log4j.properties.template log4j.properties
[root@wugenqiang conf]# ls
docker.properties.template metrics.properties.template
fairscheduler.xml.template slaves.template
log4j.properties spark-defaults.conf.template
log4j.properties.template spark-env.sh.template
将INFO改为WARN (也可以设置为其他级别)
[root@wugenqiang conf]# vim log4j.properties
log4j.rootCategory=INFO, console
改成:
log4j.rootCategory=WARN, console
那么,之后再打开shell输入信息量会减少.
八、打开网址
http://192.168.75.213:4040