配置hadoop+pyspark环境

1、部署hadoop环境

配置hadoop伪分布式环境,所有服务都运行在同一个节点上。

1.1、安装JDK

安装jdk使用的是二进制免编译包,下载页面

  • 下载jdk
$ cd /opt/local/src/
$ curl -o jdk-8u171-linux-x64.tar.gz  http://download.oracle.com/otn-pub/java/jdk/8u171-b11/512cd62ec5174c3487ac17c61aaa89e8/jdk-8u171-linux-x64.tar.gz?AuthParam=1529719173_f230ce3269ab2fccf20e190d77622fe1 
  • 解压文件,配置环境变量
### 解压到指定位置
$ tar -zxf jdk-8u171-linux-x64.tar.gz -C /opt/local

### 创建软连接
$ cd /opt/local/
$ ln -s jdk1.8.0_171 jdk

### 配置环境变量,在当前用的配置文件 ~/.bashrc 增加如下配置
$ tail ~/.bashrc 

# Java 
export JAVA_HOME=/opt/local/jdk
export JRE_HOME=$JAVA_HOME/jre
export CLASSPATH=.:$CLASSPATH:$JAVA_HOME/lib:$JRE_HOME/lib
export PATH=$PATH:$JAVA_HOME/bin:$JRE_HOME/bin
  • 刷新环境变量
$ source ~/.bashrc

### 演那种是否生效,返回java信息说明正确
$ java -version
java version "1.8.0_171"
Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode)

1.2、配置/etc/hosts

### 配置/etc/hosts 把主机名和IP地址一一对应
$ head -n 3 /etc/hosts
# ip --> hostname or domain
192.168.20.10    node

### 验证
$ ping node -c 2
PING node (192.168.20.10) 56(84) bytes of data.
64 bytes from node (192.168.20.10): icmp_seq=1 ttl=64 time=0.063 ms
64 bytes from node (192.168.20.10): icmp_seq=2 ttl=64 time=0.040 ms

1.3、设置ssh无密码登录

  • 生成SSH key
### 生成ssh key
$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
  • 配置公钥到许可文件authorizd_keys
### 需要输入密码
ssh-copy-id node

### 验证登录,不需要密码即为成功
$ ssh node

1.4、安装配置hadoop

  • 下载hadoop
### 下载Hadoop2.7.6
$ cd /opt/local/src/
$ wget -c http://mirrors.hust.edu.cn/apache/hadoop/common/hadoop-2.7.6/hadoop-2.7.6.tar.gz
  • 创建hadoop相关目录
$ mkdir -p /opt/local/hdfs/{namenode,datanode,tmp}
$ tree /opt/local/hdfs/
/opt/local/hdfs/
├── datanode
├── namenode
└── tmp
  • 解压hadoop安装文件
### 解压到指定位置
$ cd /opt/local/src/
$ tar -zxf hadoop-2.7.6.tar.gz -C /opt/local/

### 创建软连接
$ cd /opt/local/
$ ln -s hadoop-2.7.6 hadoop

1.5、配置hadoop

1.5.1、 配置core-site.xml

$ vim /opt/local/hadoop/etc/hadoop/core-site.xml
<configuration>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>file:/opt/local/hdfs/tmp/</value>
    </property>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://node:9000</value>
    </property>
    <property>
        <name>io.file.buffer.size</name>
        <value>131072</value>
    </property>
</configuration>

1.5.2、 配置hdfs-site.xml

$ vim /opt/local/hadoop/etc/hadoop/hdfs-site.xml
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>/opt/local/hdfs/namenode</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>/opt/local/hdfs/datanode</value>
    </property>
    <property>
        <name>dfs.webhdfs.enabled</name>
        <value>true</value>
    </property>
</configuration>

1.5.3、 配置mapred-site.xml

### mapred-site.xml需要从一个模板拷贝在修改
$ cp /opt/local/hadoop/etc/hadoop/mapred-site.xml.template  /opt/local/hadoop/etc/hadoop/mapred-site.xml
$ vim /opt/local/hadoop/etc/hadoop/mapred-site.xml
<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
    <property>
        <name>mapreduce.jobhistory.address</name>
        <value>node:10020</value>
    </property>
    <property>
        <name>mapreduce.jobhistory.webapp.address</name>
        <value>node:19888</value>
    </property>
    <property>
        <name>mapreduce.jobhistory.done-dir</name>
        <value>/history/done</value>
    </property>
    <property>
        <name>mapreduce.jobhistory.intermediate-done-dir</name>
        <value>/history/done_intermediate</value>
    </property>
</configuration>

1.5.4、 配置yarn-site.xml

$ vim /opt/local/hadoop/etc/hadoop/yarn-site.xml
<configuration>

<!-- Site specific YARN configuration properties -->
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>node</value>
    </property>
    <property>
        <name>yarn.resourcemanager.address</name>
        <value>node:8032</value>
    </property>
    <property>
        <name>yarn.resourcemanager.scheduler.address</name>
        <value>node:8030</value>
    </property>
    <property>
        <name>yarn.resourcemanager.resource-tracker.address</name>
        <value>node:8031</value>
    </property>
    <property>
        <name>yarn.resourcemanager.admin.address</name>
        <value>node:8033</value>
    </property>
    <property>
        <name>yarn.resourcemanager.webapp.address</name>
        <value>node:8088</value>
    </property>
    <property>  
        <name>yarn.log-aggregation-enable</name>  
        <value>true</value>  
    </property> 
    <property>
        <name>yarn.log-aggregation.retain-seconds</name>
        <value>604800</value>
    </property>
    <property>
        <name>yarn.nodemanager.pmem-check-enabled</name>
        <value>false</value>
    </property>
    <property>
        <name>yarn.nodemanager.vmem-check-enabled</name>
        <value>false</value>
    </property>
</configuration>

1.5.5、 配置slaves

$ cat  /opt/local/hadoop/etc/hadoop/slaves 
node

1.5.6、 配置master

$ cat  /opt/local/hadoop/etc/hadoop/master
node

1.5.7、 配置hadoop-env

$ vim  /opt/local/hadoop/etc/hadoop/hadoop-env.sh
### 修改JAVA_HOME
export JAVA_HOME=/opt/local/jdk

1.5.8、 配置yarn-env

$ vim  /opt/local/hadoop/etc/hadoop/yarn-env.sh
### 修改JAVA_HOME
export JAVA_HOME=/opt/local/jdk

1.5.9、 配置mapred-env

$ vim  /opt/local/hadoop/etc/hadoop/mapred-env.sh
### 修改JAVA_HOME
export JAVA_HOME=/opt/local/jdk

1.5.10、配置hadoop环境变量

  • 增加hadoop相关配置

在 ~/.bashrc 增加hadoop环境变量,配置如下

# hadoop
export HADOOP_HOME=/opt/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME 
export HADOOP_COMMON_HOME=$HADOOP_HOME 
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-DJava.library.path=$HADOOP_HOME/lib"
export JAVA_LIBRARY_PATH=$HADOOP_HOME/lib/native:$JAVA_LIBRARY_PATH
  • 启用配置
$ source ~/.bashrc

### 验证
$ hadoop version
Hadoop 2.7.6
Subversion https://shv@git-wip-us.apache.org/repos/asf/hadoop.git -r 085099c66cf28be31604560c376fa282e69282b8
Compiled by kshvachk on 2018-04-18T01:33Z
Compiled with protoc 2.5.0
From source with checksum 71e2695531cb3360ab74598755d036
This command was run using /opt/local/hadoop-2.7.6/share/hadoop/common/hadoop-common-2.7.6.jar

1.6、格式化hdfs文件系统

### 格式化hdfs,如果已有数据慎重使用,这会删除原有的数据
$ hadoop namenode -format

### namenode存储目录会产生数据
$ ls /opt/local/hdfs/namenode/
current

1.7、启动hadoop

启动hadoop主要有HDFS(Namenode、Datanode)和YARN(ResourceManager、NodeManager),可以使用start-all.sh命令启动;关闭命令stop-all.sh,也可以指定应用启动;

1.7.1、启动dfs

启动dfs包括namenode和datanode两个服务,可以使用start-dfs.sh启动,以下采用分布启动;

1.7.1.1、启动Namenode
### 启动namenode
$ hadoop-daemon.sh start namenode
starting namenode, logging to /opt/local/hadoop-2.7.6/logs/hadoop-hadoop-namenode-node.out

### 查看进程
$ jps
7547 Jps
7500 NameNode

### 启动SecondaryNameNode
$ hadoop-daemon.sh start secondarynamenode
starting secondarynamenode, logging to /opt/local/hadoop-2.7.6/logs/hadoop-hadoop-secondarynamenode-node.out

### 查看进程
$ jps
10001 SecondaryNameNode
10041 Jps
9194 NameNode
1.7.1.2、启动Datanode
### 启动datanode
$ hadoop-daemon.sh start datanode
starting datanode, logging to /opt/local/hadoop-2.7.6/logs/hadoop-hadoop-datanode-node.out

### 查看进程
$ jps
7607 DataNode
7660 Jps
7500 NameNode
10001 SecondaryNameNode

1.7.2、启动yarn

启动yarn包括ResourceManager和NodeManager,可以使用start-yarn.sh启动,以下采用分布启动;

1.7.2.1、启动ResourceManager
### 启动resourcemanager
$ yarn-daemon.sh start resourcemanager
starting resourcemanager, logging to /opt/local/hadoop-2.7.6/logs/yarn-hadoop-resourcemanager-node.out

### 查看进程
$ jps
7607 DataNode
7993 Jps
7500 NameNode
7774 ResourceManager
10001 SecondaryNameNode
1.7.2.2、启动NodeManager
### 启动nodemanager
$ yarn-daemon.sh start nodemanager
starting nodemanager, logging to /opt/local/hadoop-2.7.6/logs/yarn-hadoop-nodemanager-node.out

### 查看进程
$ jps
7607 DataNode
8041 NodeManager
8106 Jps
7500 NameNode
7774 ResourceManager
10001 SecondaryNameNode

1.7.3、启动Historyserver

### 启动 historyserver
$ mr-jobhistory-daemon.sh start historyserver
starting historyserver, logging to /opt/local/hadoop/logs/mapred-hadoop-historyserver-node.out

### 查看进程
$ jps
8278 JobHistoryServer
7607 DataNode
8041 NodeManager
7500 NameNode
8317 Jps
7774 ResourceManager
10001 SecondaryNameNode

1.7.4、hadoop相关功能说明

hadoop启动后,主要有以下功能

  • HDFS功能:NameNode、SecondaryNameNode、DataNode
  • YARN功能:ResourceManager、NodeManager
  • HistoryServer:JobHistoryServer

1.8、hadoop基本操作

1.8.1、 hadoop 常用命令

命令 说明
hadoop fs -mkdir 创建HDFS 目录
hadoop fs -ls 列出HDFS 目录
hadoop fs -copyFromLocal 复制本地文件到HDFS
hadoop fs -put 复制本地文件到HDFS,put可以接收stdin(标准输入)
hadoop fs -cat 列出HDFS文件的内容
hadoop fs -copyToLocal 将HDFS上的文件复制到本地
hadoop fs -get 将HDFS上的文件复制到本地
hadoop fs -cp 负载HDFS文件
hadoop fs -rm 删除HDFS文件或目录(-R参数)

1.8.2、haodoom 命令操作

1.8.2.1 基本命令操作
  • 创建目录
$ hadoop fs -mkdir /user/hadoop
  • 创建多个目录
$ hadoop fs -mkdir -p /user/hadoop/{input,output} 
  • 查看HDFS目录
$ hadoop fs -ls /
Found 2 items
drwxrwx---   - hadoop supergroup          0 2018-06-23 12:20 /history
drwxr-xr-x   - hadoop supergroup          0 2018-06-23 13:20 /user
$ hadoop fs -ls /user
Found 1 items
drwxr-xr-x   - hadoop supergroup          0 2018-06-23 13:20 /user/hadoop
  • 查看所有的目录
$ hadoop fs -ls -R /
drwxrwx---   - hadoop supergroup          0 2018-06-23 12:20 /history
drwxrwx---   - hadoop supergroup          0 2018-06-23 12:20 /history/done
drwxrwxrwt   - hadoop supergroup          0 2018-06-23 12:20 /history/done_intermediate
drwxr-xr-x   - hadoop supergroup          0 2018-06-23 13:20 /user
drwxr-xr-x   - hadoop supergroup          0 2018-06-23 13:24 /user/hadoop
drwxr-xr-x   - hadoop supergroup          0 2018-06-23 13:24 /user/hadoop/input
drwxr-xr-x   - hadoop supergroup          0 2018-06-23 13:24 /user/hadoop/output
  • 上传本地文件到HDFS
$ hadoop fs -copyFromLocal /opt/local/hadoop/README.txt /user/hadoop/input
  • 查看HDFS上的文件内容
$ hadoop fs -cat /user/hadoop/input/README.txt
  • 将HDFS上的文件下载到本地
$ hadoop fs -get /user/hadoop/input/README.txt ./
  • 删除文件或目录
### 删除文件会提示
$ hadoop fs -rm /user/hadoop/input/examples.desktop
18/06/23 13:47:06 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted /user/hadoop/input/examples.desktop

### 删除目录
$ hadoop fs -rm -R /user/hadoop
18/06/23 13:48:17 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted /user/hadoop
1.8.2.2、执行mapreduce任务

使用hadoop内置的wordcount程序统计字数

  • 执行计算任务
$ hadoop fs -put /opt/local/hadoop/README.txt /user/input
$ cd /opt/local/hadoop/share/hadoop/mapreduce
#### hadoop jar jar包名称 类 输入文件 输出目录
$ hadoop jar hadoop-mapreduce-examples-2.7.6.jar wordcount /user/input/ /user/output/wordcount
  • 查看当前的任务
#### 查看当然的任务情况,也可以在http://node:8088 上查看
$ yarn application -list
18/06/23 13:55:34 INFO client.RMProxy: Connecting to ResourceManager at node/192.168.20.10:8032
Total number of applications (application-types: [] and states: [SUBMITTED, ACCEPTED, RUNNING]):1
                Application-Id      Application-Name        Application-Type          User       Queue               State         Final-State         Progress                        Tracking-URL
application_1529732240998_0001            word count               MAPREDUCE        hadoop     default             RUNNING           UNDEFINED               5%                   http://node:41713
  • 查看计算结果
#### _SUCCESS 表示成功,part开头的文件表是结果
$ hadoop fs -ls /user/output/wordcount
Found 2 items
-rw-r--r--   1 hadoop supergroup          0 2018-06-23 13:55 /user/output/wordcount/_SUCCESS
-rw-r--r--   1 hadoop supergroup       1306 2018-06-23 13:55 /user/output/wordcount/part-r-00000

#### 查看内容
$ hadoop fs -cat /user/output/wordcount/part-r-00000|tail
uses    1
using   2
visit   1
website 1
which   2
wiki,   1
with    1
written 1
you 1
your    1

1.9、hadoop Web界面

  • hadoop NameNode HDFS Web界面可以查看当前HDFS和DataNode的运行情况
http://node:50070
  • hadoop ResourceManager Web界面可以查看当前hadoop的Node节点状态,应用进程、任务执行状态
http://node:8088

2、部署spark

2.1、Scala简介和安装

2.1.1、Scala简介

spark是用是scala编写,Scala官网为:https://www.scala-lang.org/,因此需要首先安装Scala,Scala有以下特点:

  • Scala可编译为Java bytecode 字节码,也就是说可以在JVM(Java Virtual Machine)上运行,具备跨平台能力;
  • 现有的Java的链接库都可以使用,可以继续使用丰富的Java开放源代码生态系统;
  • Scala是一种函数式语言,在函数式语言中,函数也是值,与整数字符串处于同一地位;函数可以作为参数传递给其他函数;
  • Scala是一种纯面向对象的语言,所有东西都是对象,而所有操作都是方法;

2.1.2、Scala安装

Scala下载地址为https://www.scala-lang.org/files/archive/
从Spark2.0版开始,Spark默认使用Scala 2.11构建,因此下载scala-2.11版本;

  • 下载Scala
$ cd /opt/local/src/
$ wget -c https://www.scala-lang.org/files/archive/scala-2.11.11.tgz
  • 解压Scala文件
#### 解压到指定位置,并做软连接
$ tar -zxf scala-2.11.11.tgz -C /opt/local/
$ cd /opt/local/
  • 配置Scala环境变量
#### 配置 ~/.bashrc 增加如下
$ tail -n 5 ~/.bashrc
# scala
export SCALA_HOME=/opt/local/scala
export PATH=$PATH:$SCALA_HOME/bin

#### 启用配置
$ source ~/.bashrc

#### 验证
$ scala -version
Scala code runner version 2.11.11 -- Copyright 2002-2017, LAMP/EPFL

2.2、Spark安装

2.2.1、Spark下载

spark下载页面地址是http://spark.apache.org/downloads.html,需要选择用于hadoop2.7及以上版本;

$ cd /opt/local/src/
$ wget -c http://mirror.bit.edu.cn/apache/spark/spark-2.2.1/spark-2.3.1-bin-hadoop2.7.tgz

2.2.2、Spark解压配置

  • 解压spark到指定目录,并做软连接
$ tar zxf spark-2.3.1-bin-hadoop2.7.tgz -C /opt/local/
$ cd /opt/local/
$ ln -s spark-2.3.1-bin-hadoop2.7 spark
  • 配置spark环境变量
$ tail -n 5 ~/.bashrc 
# spark
export SPARK_HOME=/opt/local/spark
export PATH=$PATH:$SPARK_HOME/bin
  • 启用环境变量
$ source ~/.bashrc

2.3、运行pyspark

2.3.1、本地运行pyspyark

在终端输入pyspark启动spark的python接口,启动会显示使用的python版本和spark版本信息;

pyspark --master local[4],local[N]代表本地运行,使用N个线程(thread);local[*] 会尽量使用所有的CPU核心;

  • 本地启动pyspark
$ pyspark 
Python 2.7.12 (default, Dec  4 2017, 14:50:18) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
2018-06-23 19:25:00 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.3.1
      /_/

Using Python version 2.7.12 (default, Dec  4 2017 14:50:18)
SparkSession available as 'spark'.
  • 查看当前运行模式
>>> sc.master
u'local[*]'
  • 读取本地文件
>>> textFile=sc.textFile("file:/opt/local/spark/README.md")
>>> textFile.count()
103
  • 读取HDFS文件
>>> textFile=sc.textFile("hdfs://node:9000/user/input/README.md")
>>> textFile.count()
103

2.3.2、Hadoop YARN 运行spark

spark可以在Hadoop YARN上运行,让YARN帮助它进行资源管理;

HADOOP_CONF_DIR=/opt/local/hadoop/etc/hadoop pyspark --master yarn --deploy-mode client
HADOOP_CONF_DIR=/opt/local/hadoop/etc/hadoop 表示设置hadoop配置文件目录;
pyspark 运行的程序;
--master yarn --deploy-mode client 设置运行模式为YARN-Client

$ HADOOP_CONF_DIR=/opt/local/hadoop/etc/hadoop pyspark --master yarn --deploy-mode client
Python 2.7.12 (default, Dec  4 2017, 14:50:18) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
2018-06-23 20:27:48 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2018-06-23 20:27:52 WARN  Client:66 - Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.3.1
      /_/

Using Python version 2.7.12 (default, Dec  4 2017 14:50:18)
SparkSession available as 'spark'.
  • 查看当前运行模式
>>> sc.master
u'yarn'
  • 读取HDFS文件
>>> textFile=sc.textFile("hdfs://node:9000/user/input/README.md")
>>> textFile.count()
103  
  • yarn 查看任务运行情况
#### 也可以通过web查看:http://node:8088
$ yarn application -list
18/06/23 20:34:40 INFO client.RMProxy: Connecting to ResourceManager at node/192.168.20.10:8032
Total number of applications (application-types: [] and states: [SUBMITTED, ACCEPTED, RUNNING]):1
                Application-Id      Application-Name        Application-Type          User       Queue               State         Final-State         Progress                        Tracking-URL
application_1529756801315_0001          PySparkShell                   SPARK        hadoop     default             RUNNING           UNDEFINED              10%                    http://node:4040

2.3.3、Spark Standalone Cluster运行spark

配置Spark Standalone Cluste伪分布式环境,所有服务都运行在同一个节点上。

2.3.3.1、配置spark-env.sh
  • 复制模板文件创建spark-env.sh
$ cp /opt/local/spark/conf/spark-env.sh.template /opt/local/spark/conf/spark-env.sh
  • 配置spark-env.sh
$ tail -n 6 /opt/local/spark/conf/spark-env.sh
#### Spark Standalone Cluster
export JAVA_HOME=/opt/local/jdk
export SPARK_MASTER_HOST=node
export SPARK_WORKER_CORES=1
export SPARK_WORKER_MEMORY=512m
export SPARK_WORKER_INSTANCES=1
2.3.3.2、配置slave
#### 增加编辑,也可以拷贝模板文件
$ tail /opt/local/spark/conf/slaves
node
2.3.3.3、在Spark Standalone Cluster运行pyspark
2.3.3.3.1、启动Spark Standalone Cluster

启动Spark Standalone Cluster可以使用${SPARN_HOME}/sbin/start-all.sh一个脚本启动所有服务;也可以分布启动master和slaves;

  • 启动master
$ /opt/local/spark/sbin/start-master.sh 
starting org.apache.spark.deploy.master.Master, logging to /opt/local/spark/logs/spark-hadoop-org.apache.spark.deploy.master.Master-1-node.out
~$ jps
4185 Master
  • 启动slaves
$ /opt/local/spark/sbin/start-slaves.sh 
node: starting org.apache.spark.deploy.worker.Worker, logging to /opt/local/spark/logs/spark-hadoop-org.apache.spark.deploy.worker.Worker-1-node.out
$ jps
4185 Master
4313 Worker
  • 通过http://node:8080查看集群状态
$ w3m http://node:8080/
[spark-logo] 2.3.1 Spark Master at spark://node:7077

  • URL: spark://node:7077
  • REST URL: spark://node:6066 (cluster mode)
  • Alive Workers: 1
  • Cores in use: 1 Total, 0 Used
  • Memory in use: 256.0 MB Total, 0.0 B Used
  • Applications: 0 Running, 0 Completed
  • Drivers: 0 Running, 0 Completed
  • Status: ALIVE

Workers (1)

                Worker Id                       Address       State Cores Memory
                                                                          512.0
worker-20180624102100-192.168.20.10-42469 192.168.20.10:42469 ALIVE 1 (0  MB
                                                                    Used) (0.0 B
                                                                          Used)

Running Applications (0)

Application ID Name Cores     Memory per     Submitted Time User State Duration
                               Executor

Completed Applications (0)

Application ID Name Cores     Memory per     Submitted Time User State Duration
                               Executor
2.3.3.3.2、在Spark Standalone Cluster运行pyspark
  • 运行pyspark
$ pyspark --master spark://node:7077 --num-executors 1 --total-executor-cores 1 --executor-memory 512m
Python 2.7.12 (default, Dec  4 2017, 14:50:18) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
2018-06-24 10:39:09 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.3.1
      /_/

Using Python version 2.7.12 (default, Dec  4 2017 14:50:18)
SparkSession available as 'spark'.
  • 查看当前运行模式
>>> sc.master
u'spark://node:7077'
  • 读取本地文件
>>> textFile=sc.textFile("file:/opt/local/spark/README.md")
>>> textFile.count()
103
  • 读取hdfs文件
>>> textFile=sc.textFile("hdfs://node:9000/user/input/README.md")
>>> textFile.count()
103

2.3、总结

spark的运行方式有多种,主要有独立集群、YARN集群、Mesos集群,和本地模式

master可选值 描述说明
spark://host:port spark standalone集群,默认端口为7077
yarn YARN集群,当在YARN上运行时,需设置环境变量HADOOP_CONF_DIR指向hadoop配置目录,以获取集群信息
mesos://host:port Mesos集群,默认端口为5050
local 本地模式,使用1个核心
local[n] 本地模式,使用n个核心
local[*] 本地模式,使用尽可能多的核心