What is YARN
Part of the core Hadoop project, YARN is the architectural center of Hadoop that allows multiple data processing engines such as interactive SQL, real-time streaming, data science and batch processing to handle data stored in a single platform, unlocking an entirely new approach to analytics.
YARN is the foundation of the new generation of Hadoop and is enabling organizations everywhere to realize a modern data architecture.
What YARN Does
YARN is the prerequisite for Enterprise Hadoop, providing resource management and a central platform to deliver consistent operations, security, and data governance tools across Hadoop clusters.
YARN also extends the power of Hadoop to incumbent and new technologies found within the data center so that they can take advantage of cost effective, linear-scale storage and processing. It provides ISVs and developers a consistent framework for writing data access applications that run in Hadoop.
Docker on YARN
Traditional YARN applications like Spark and MapReduce requires a complex set of requirement of dependencies. Spark applications may require specific versions of these dependencies to be installed on all the cluster hosts where Spark executors run, sometimes with conflicting versions. Installing such dependencies on the cluster machines poses package isolation challenges as well as organizational challenges — typically the cluster machines are maintained by specialized operations teams.
Docker support in Apache Hadoop 3 can be leveraged by Apache Spark for addressing these long standing challenges related to package isolation — by converting application’s dependencies to be containerized via docker images. With this solution, users can bring their own versions of python, libraries, without heavy involvement of admins and have an efficient solution with docker image layer caching.
Leveraging Docker for Spark on YARN
To use Spark on YARN, Hadoop YARN cluster should be Docker enabled. In the remainder of this discussion, we are going to describe YARN Docker support in Apache Hadoop 3.1.0 release and beyond.
Note that YARN containerization support enables applications to optionally run inside docker containers. That is, on the same Hadoop cluster, one can run applications within Docker and without Docker side-by-side.
Install Docker on Centos 7
The Docker installation package available in the official CentOS 7 repository may not be the latest version. To get the latest and greatest version, install Docker from the official Docker repository. This section shows you how to do just that.
But first, let’s update the package database:
sudo yum check-update
Now run this command. It will add the official Docker repository, download the latest version of Docker, and install it:
curl -sSL https://get.docker.com/ | sh
Finally, let's configurate Docker to connect DNS Server for supporting DNS service in Docker Container.
Edit /etc/docker/daemon.json and add the following options:
{ "live-restore" : true,
"debug" : true,
"dns": ["8.8.4.4", "8.8.8.8"]
}
After installation has completed, start the Docker daemon:
sudo systemctl start docker
Verify that it’s running:
sudo systemctl status docker
The output should be similar to the following, showing that the service is active and running:
Output● docker.service - Docker Application Container Engine
Loaded: loaded (/lib/systemd/system/docker.service; enabled; vendor preset: enabled)
Active: active (running) since Sun 2016-05-01 06:53:52 CDT; 1 weeks 3 days ago
Docs: https://docs.docker.com
Main PID: 749 (docker)
Lastly, make sure it starts at every server reboot:
sudo systemctl enable docker
Installing Docker now gives you not just the Docker service (daemon) but also the docker
command line utility, or the Docker client. We'll explore how to use the docker
command later in this tutorial.
2. Enable Cgroup
On an Ambari cluster, you can enable CPU Scheduling to enable cgroups. On a non-Ambari cluster, you must configure certain properties in yarn-site.xml
on the ResourceManager and NodeManager hosts to enable cgroups.
Cgroups is a Linux kernel feature. cgroups is supported on the following Linux operating systems:
- CentOS 6.9, 7.3
- RHEL 6.9, 7.3
- SUSE 12
- Ubuntu 16
The following commands must be run on every reboot of the NodeManager hosts to set up the CGroup hierarchy. Note that operating systems use different mount points for the CGroup interface. Replace /sys/fs/cgroup
with your operating system equivalent.
mkdir -p /sys/fs/cgroup/cpu/hadoop-yarn
chown -R yarn /sys/fs/cgroup/cpu/hadoop-yarn
mkdir -p /sys/fs/cgroup/memory/hadoop-yarn
chown -R yarn /sys/fs/cgroup/memory/hadoop-yarn
mkdir -p /sys/fs/cgroup/blkio/hadoop-yarn
chown -R yarn /sys/fs/cgroup/blkio/hadoop-yarn
mkdir -p /sys/fs/cgroup/net_cls/hadoop-yarn
chown -R yarn /sys/fs/cgroup/net_cls/hadoop-yarn
mkdir -p /sys/fs/cgroup/devices/hadoop-yarn
chown -R yarn /sys/fs/cgroup/devices/hadoop-yarn
mkdir /sys/fs/cgroup/cpu/yarn
chown -R yarn /sys/fs/cgroup/cpu/yarn
3. Configure YARN for running Docker
Make sure YARN cgroups are enabled before configruing YARN for running Docker containers.
To leverage YARN cgroup support, the nodemanager must be configured to use LinuxContainerExecutor
. The Docker YARN integration also requires this container executor.
Set the following properties in the yarn-site.xml
file
<property>
<name>yarn.nodemanager.linux-container-executor.nonsecure-mode.local-user</name>
<value>nobody</value>
</property>
<property>
<name>yarn.nodemanager.runtime.linux.allowed-runtimes</name>
<value>default,docker</value>
</property>
<property>
<name>yarn.nodemanager.runtime.linux.docker.capabilities</name>
<value>CHOWN,DAC_OVERRIDE,FSETID,FOWNER,MKNOD,NET_RAW,SETGID,SETUID,
SETFCAP,SETPCAP,NET_BIND_SERVICE,SYS_CHROOT,KILL,AUDIT_WRITE</value>
</property>
<property>
<name>yarn.nodemanager.runtime.linux.docker.privileged-containers.allowed</name>
<value>false</value>
</property>
<property>
<name>yarn.nodemanager.runtime.linux.docker.privileged-containers.acl</name>
<value> </value>
</property>
<property>
<name>yarn.nodemanager.runtime.linux.docker.allowed-container-networks</name>
<value>host,bridge</value>
</property>
<property>
<name>yarn.nodemanager.runtime.linux.docker.default-container-network</name>
<value>host</value>
</property>
Set the following properties in a container-executor.cfg
file:
yarn.nodemanager.local-dirs=<yarn.nodemanager.local-dirs from yarn-site.xml>yarn.nodemanager.log-dirs=<yarn.nodemanager.log-dirs from yarn-site.xml>yarn.nodemanager.linux-container-executor.group=hadoop banned.users=hdfs,yarn,mapred,bin min.user.id=50[docker]module.enabled=truedocker.binary=/usr/bin/docker docker.allowed.capabilities=CHOWN,DAC_OVERRIDE,FSETID,FOWNER,MKNOD,NET_RAW, SETGID,SETUID,SETFCAP,SETPCAP,NET_BIND_SERVICE,SYS_CHROOT,KILL,AUDIT_WRITE, DAC_READ_SEARCH,SYS_PTRACE,SYS_ADMIN docker.allowed.devices=docker.allowed.networks=bridge,host,nonedocker.allowed.ro-mounts=/sys/fs/cgroup,<yarn.nodemanager.local-dirs from yarn-site.xml>docker.allowed.rw-mounts=<yarn.nodemanager.local-dirs from yarn-site.xml>, <yarn.nodemanager.log-dirs from yarn-site.xml> docker.privileged-containers.enabled=false docker.trusted.registries=local,centos,hortonworksdocker.allowed.volume-drivers=
6. Image management
Images need preloaded on all NodeManager hosts or they can be implicitly pulled at runtime if they are available in a public Docker registry, such as Docker hub. If the image does not exist on the NodeManager and cannot be pulled, the container will fail.
7. Run Docker on YARN using the YARN services API
Create Yarnfile and save it to /tmp/recognition.json
{
"name":"recogition",
"version":"1.0.0",
"lifetime":"3600",
"configuration":{
"properties":{
"docker.network":"bridge"
}
},
"components":[
{
"name":"recogition",
"number_of_containers":2,
"artifact":{
"id":"<your-repo>",
"type":"DOCKER"
},
"launch_command":"",
"resource":{
"cpus":1,
"memory":"1024"
}
}
]
}
Launch Yarn application:
yarn app -launch recognition /tmp/recognition.json
References:
https://blog.cloudera.com/containerized-apache-spark-yarn-apache-hadoop-3-1/