基于Docker的Spark1.6.0集群部署
作者 fiisio | 发布于 2016-03-30
DockerSpark 大数据

Spark on Docker

1、部署准备

      Docker已经在Linux(CentOS和Ubuntu等)中完美适应,在两者中任一个都行,安装时尽量用最新的正式版本,可以减少不必要的麻烦。目前大多数教程都是基于spark1.4之前的,虽然1.6.0与之前的部署差别不大但还是有一些问题的

      我们在此用的是

  • Docker 1.6.2
  • Ubuntu 14.04 桌面版
  • Spark 1.6.0

          在安装完Docker后,你就可以拉Spark库啦,在此推荐sequenceiq/spark:1.6.0,在拉之前请解决外网https访问问题,因为公司服务器代理的原因在此我耗费啦一天半的时间,最终放弃。

          另外,关于docker的安装什么的就不说啦,一大堆一大堆的都是教程,此外使用docker安装避免了直接部署的很多问题,例如java版本,Python之类的问题

    2、拉取spark镜像

      docker pull sequenceiq/spark:1.6.0

    然后如果网络畅通的话会下载2.87G大小的镜像,所以需要等待很长时间,如果不能访问https请设置docker默认代理设置,即文件/etc/default/docker加入http_proxy和https_proxy,dokcer默认走自己的代理所以你设置的系统代理可能不起作用的哟

    成功后会显示镜像状态号,代表完成拉取

    3、编译镜像

      docker build --rm -t sequenceiq/spark:1.6.0 .

          运行时将Docker-hub上的那个Dockerfile下载下来,放到你运行该指令的文件目录下 下载地址:https://github.com/sequenceiq/docker-spark/blob/master/Dockerfile

    4、启动容器

    我搭建一个master和两个worker2 创建一个master容器:

      docker run --name master -it -p 8088:8088 -p 8042:8042 -p 8085:8080 -p 4040:4040 -p 7077:7077 -p 2022:22 -v /data:/data -h master sequenceiq/spark:1.6.0 bash

    说明:--name 表示创建的容器的名称,-it可以理解创建一个标准的临时终端,-p表示映射本地到容器的端口 -v 建立本地到容器的文件,就是让本地的data文件和容器上的文件共享,-h设置容器的机器名 创建三个worker容器(在两个不同终端运行):

      docker run --name worker1 -it -h worker1 sequenceiq/spark:1.6.0 bash docker run --name worker2 -it -h worker2 sequenceiq/spark:1.6.0 bash

    四个容器现在已经建立成功了,相互ping一下,若是平不通的话,请关闭本地的防火墙: service iptables stop(每次重启都必须配置)

    5、配置集群

          

    5.1 关闭每个容器上的hadoop集群

      cd /usr/local/hadoop-2.6.0/sbin/ ./stop-all.sh

    5.2 配置每个容器上的hosts文件(每次重启都必须配置)

      vi /etc/hosts 172.17.0.1 master 172.17.0.2 worker1 172.17.0.3 worker2 127.0.0.1 localhost ::1 localhost ip6-localhost ip6-loopback fe00::0 ip6-localnet ff00::0 ip6-mcastprefix ff02::1 ip6-allnodes ff02::2 ip6-allrouters

    5.3 配置每个容器上hadoop的slaves文件

      vi /usr/local/hadoop-2.6.0/etc/hadoop/slaves 配置为 worker1 worker2

    5.4 配置每个容器上core-site.xml

    vi /usr/local/hadoop-2.6.0/etc/hadoop/core-site.xml 配置为
      <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://master:9000</value> </property> </configuration>

    5.5 配置每个容器上spark的slaves文件

      cd /usr/local/spark-1.6.0-bin-hadoop2.6/conf cp slaves.template slaves vi slaves 配置为 worker1 worker2

    5.6 配置每个容器上Spark的spark-env.sh

      cd /usr/local/spark-1.6.0-bin-hadoop2.6/conf/ cp spark-env.sh.template spark-env.sh vi spark-env.sh 添加 export JAVA_HOME=/usr/java/default export SPARK_MASTER_IP=master export SPARK_WORKER_CORES=1 # export SPARK_WORKER_INSTANCES=1 export SPARK_EXECUTOR_INSTANCES=3 export SPARK_MASTER_PORT=7077 export SPARK_WORKER_MEMORY=1g export MASTER=spark://${SPARK_MASTER_IP}:${SPARK_MASTER_PORT}

    5.7 在每个容器上配置spark的yarn(每次重启都必须配置)

      cd /usr/local/spark-1.6.0-bin-hadoop2.6/yarn-remote-client/ 1>修改core-site.xml<configuration> <property> <name>fs.default.name</name> <value>hdfs://master:9000</value> </property> <property> <name>dfs.client.use.legacy.blockreader</name> <value>true</value> </property> </configuration> 2> 修改yarn-site.xml <configuration> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>master:8030</value> </property> <property> <name>yarn.resourcemanager.address</name> <value>master:8032</value> </property> <property> <name>yarn.resourcemanager.webapp.address</name> <value>master:8088</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>master:8031</value> </property> <property> <name>yarn.resourcemanager.admin.address</name> <value>master:8033</value> </property> <property> <name>yarn.application.classpath</name> <value>/usr/local/hadoop/etc/hadoop, /usr/local/hadoop/share/hadoop/common/*, /usr/local/hadoop/share/hadoop/common/lib/*, /usr/local/hadoop/share/hadoop/hdfs/*, /usr/local/hadoop/share/hadoop/hdfs/lib/*, /usr/local/hadoop/share/hadoop/mapreduce/*, /usr/local/hadoop/share/hadoop/mapreduce/lib/*, /usr/local/hadoop/share/hadoop/yarn/*, /usr/local/hadoop/share/hadoop/yarn/lib/*, /usr/local/hadoop/share/spark/*</value> </property> </configuration>

    5.8 在每个容器上配置hadoop集群(每次重启都必须配置)

    1> 创建存储目录
      /usr/local/hadoop-2.6.0/ 目录 1>在hadoop的home下创建 mkdir -p dfs/name mkdir -p dfs/data mkdir -p tmp
    2> 修改core-site.xml
      <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://master:9000</value> </property> <property> <name>io.file.buffer.size</name> <value>131702</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/usr/local/hadoop-2.6.0/tmp</value> </property> </configuration>
    3> 修改hdfs-site.xml
      <configuration> <property> <name>dfs.replication</name> <value>3</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>/usr/local/hadoop-2.6.0/dfs/name</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>/usr/local/hadoop-2.6.0/dfs/data</value> </property> <property> <name>dfs.namenode.secondary.http-address</name> <value>master:9001</value> </property> <property> <name>dfs.webhdfs.enabled</name> <value>true</value> </property> </configuration>
    4> 修改mapred-site.xml
      cp mapred-site.xml.template mapred-site.xml <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>mapreduce.jobhistory.address</name> <value>master:10020</value> </property> <property> <name>mapreduce.jobhistory.webapp.address</name> <value>master:19888</value> </property> </configuration>
    5> 配置yarn-site.xml(这里注意内存和cpu的设置)
      <configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property> <name>yarn.resourcemanager.address</name> <value>master:8032</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>master:8030</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>master:8031</value> </property> <property> <name>yarn.resourcemanager.admin.address</name> <value>master:8033</value> </property> <property> <name>yarn.resourcemanager.webapp.address</name> <value>master:8088</value> </property> <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>2048</value> </property> <property> <name>yarn.nodemanager.resource.cpu-vcores</name> <value>1</value> </property> <property> <name>yarn.application.classpath</name> <value>/usr/local/hadoop/etc/hadoop, /usr/local/hadoop/share/hadoop/common/*, /usr/local/hadoop/share/hadoop/common/lib/*,/usr/local/hadoop/share/hadoop/hdfs/*, /usr/local/hadoop/share/hadoop/hdfs/lib/*,/usr/local/hadoop/share/hadoop/mapreduce/*, /usr/local/hadoop/share/hadoop/mapreduce/lib/*, /usr/local/hadoop/share/hadoop/yarn/*, /usr/local/hadoop/share/hadoop/yarn/lib/*</value> </property> <property> <description> Number of seconds after an application finishes before the nodemanager's DeletionService will delete the application's localized file directory and log directory. To diagnose Yarn application problems, set this property's value large enough (for example, to 600 = 10 minutes) to permit examination of these directories. After changing the property's value, you must restart the nodemanager in order for it to have an effect. The roots of Yarn applications' work directories is configurable with the yarn.nodemanager.local-dirs property (see below), and the roots of the Yarn applications' log directories is configurable with the yarn.nodemanager.log-dirs property (see also below). </description> <name>yarn.nodemanager.delete.debug-delay-sec</name> <value>600</value> </property> </configuration>

    6 启动spark集群

    6.1 在master上启动hadoop集群

      cd /usr/local/hadoop-2.6.0/sbin/ ./start-all.sh
      在master上可以查看jps bash-4.1# jps 2441 NameNode 2611 SecondaryNameNode 2746 ResourceManager 3980 Jps 在worker上查看 bash-4.1# jps 1339 DataNode 1417 NodeManager 2290 Jps

    6.2 在master上启动spark集群

      cd /usr/local/spark-1.4.0-bin-hadoop2.6/sbin/ ./start-all.sh 在master上查看 bash-4.1# jps 2441 NameNode 2611 SecondaryNameNode 2746 ResourceManager 3980 Jps 3392 Master 在每个worker上查看 bash-4.1# jps 1339 DataNode 1417 NodeManager 2290 Jps

    7、运行集群

    如果你配置了master的slaves文件,那么你只需要运行master中spark目录下的./sbin/start-all.sh即可,不然你就需要一个个去运行master和slaves了。在浏览器打开(你的mastre ip:8080) 可以看到如下spark各节点的状态。

    7、部署问题以及解决方案

    7.1、一直报worker无法启动

    原因:系统自带的jdk没有卸载干净

      rpm -qa | grep java rpm -e --nodeps java_cup-0.10k-5.el6.x86_64 rpm -e --nodeps java-1.5.0-gcj-1.5.0.0-29.1.el6.x86_64

    7.2、一直报JAVA_HOME没有设置

      echo $JAVA_HOME 得到自己的JAVA路径 在所有的节点 /usr/local/spark-1.6.0-bin-hadoop/conf/spark-env.sh 中添加export JAVA_HOME=自己的JAVA路径

    7.3、所有的问题都要看日志,看错误提示和日志几乎能解决所有问题