Install Spark

Updated by LinodeContributed by Florent Houbart

Install Spark Plugs
Install Spark Linux

Use promo code DOCS10 for $10 credit on a new account.

Contribute on GitHub

Report an Issue | View File | Edit File

What is Spark?

This video walks through the steps of getting up and running with Apache Spark on a Windows 10 system. - Installing a JDK - Installing Spark itself, and all of its many dependencies and env variables. What is Spark, RDD, DataFrames, Spark Vs Hadoop? Spark Architecture, Lifecycle with simple Example - Duration: 26:17. Tech Primers 62,366 views. How to Install Apache Spark. Although cluster-based installations of Spark can become large and relatively complex by integrating with Mesos, Hadoop, Cassandra, or other systems, it is straightforward to download Spark and configure it in standalone mode on a laptop or server for learning and exploration. At Dataquest, we've released an interactive course on Spark, with a focus on PySpark. We explore the fundamentals of Map-Reduce and how to utilize PySpark to clean, transform, and munge data. In this post, we'll dive into how to install PySpark locally on your own computer and how to integrate it. Spark Install Instructions - Windows Instructions tested with Windows 10 64-bit. It is highly recommend that you use Mac OS X or Linux for this course, these instructions are only for people who cannot run Mac OS X or Linux on their computer.

Spark is a general purpose cluster computing system. It can deploy and run parallel applications on clusters ranging from a single node to thousands of distributed nodes. Spark was originally designed to run Scala applications, but also supports Java, Python and R.

Spark can run as a standalone cluster manager, or by taking advantage of dedicated cluster management frameworks like Apache Hadoop YARN or Apache Mesos.

Before You Begin

Follow our guide on how to install and configure a three-node Hadoop cluster to set up your YARN cluster. The master node (HDFS NameNode and YARN ResourceManager) is called node-master and the slave nodes (HDFS DataNode and YARN NodeManager) are called node1 and node2.
Run the commands in this guide from node-master unless otherwise specified.
Be sure you have a hadoop user that can access all cluster nodes with SSH keys without a password.
Note the path of your Hadoop installation. This guide assumes it is installed in /home/hadoop/hadoop. If it is not, adjust the path in the examples accordingly.
Run jps on each of the nodes to confirm that HDFS and YARN are running. If they are not, start the services with:

Note

This guide is written for a non-root user. Commands that require elevated privileges are prefixed with sudo. If you’re not familiar with the sudo command, see the Users and Groups guide.

Download and Install Spark Binaries

Spark binaries are available from the Apache Spark download page. Adjust each command below to match the correct version number.

Get the download URL from the Spark download page, download it, and uncompress it.
For Spark 2.2.0 with Hadoop 2.7 or later, log on node-master as the hadoop user, and run:
Add the Spark binaries directory to your PATH. Edit /home/hadoop/.profile and add the following line:
For Debian/Ubuntu systems:
/home/hadoop/.profile
For RedHat/Fedora/CentOS systems:
/home/hadoop/.profile

Integrate Spark with YARN

To communicate with the YARN Resource Manager, Spark needs to be aware of your Hadoop configuration. This is done via the HADOOP_CONF_DIR environment variable. The SPARK_HOME variable is not mandatory, but is useful when submitting Spark jobs from the command line.

Edit the hadoop user profile /home/hadoop/.profile and add the following lines:
/home/hadoop/.profile
Restart your session by logging out and logging in again.
Rename the spark default template config file:
Edit $SPARK_HOME/conf/spark-defaults.conf and set spark.master to yarn:
$SPARK_HOME/conf/spark-defaults.conf

Spark is now ready to interact with your YARN cluster.

Understand Client and Cluster Mode

Spark jobs can run on YARN in two modes: cluster mode and client mode. Understanding the difference between the two modes is important for choosing an appropriate memory allocation configuration, and to submit jobs as expected.

A Spark job consists of two parts: Spark Executors that run the actual tasks, and a Spark Driver that schedules the Executors.

Cluster mode: everything runs inside the cluster. You can start a job from your laptop and the job will continue running even if you close your computer. In this mode, the Spark Driver is encapsulated inside the YARN Application Master.
Client mode the Spark driver runs on a client, such as your laptop. If the client is shut down, the job fails. Spark Executors still run on the cluster, and to schedule everything, a small YARN Application Master is created.

Client mode is well suited for interactive jobs, but applications will fail if the client stops. For long running jobs, cluster mode is more appropriate.

Configure Memory Allocation

Allocation of Spark containers to run in YARN containers may fail if memory allocation is not configured properly. For nodes with less than 4G RAM, the default configuration is not adequate and may trigger swapping and poor performance, or even the failure of application initialization due to lack of memory.

Be sure to understand how Hadoop YARN manages memory allocation before editing Spark memory settings so that your changes are compatible with your YARN cluster’s limits.

Note

See the memory allocation section of the Install and Configure a 3-Node Hadoop Cluster guide for more details on managing your YARN cluster’s memory.

Give Your YARN Containers Maximum Allowed Memory

If the memory requested is above the maximum allowed, YARN will reject creation of the container, and your Spark application won’t start.

Get the value of yarn.scheduler.maximum-allocation-mb in $HADOOP_CONF_DIR/yarn-site.xml. This is the maximum allowed value, in MB, for a single container.
Make sure that values for Spark memory allocation, configured in the following section, are below the maximum.

This guide will use a sample value of 1536 for yarn.scheduler.maximum-allocation-mb. If your settings are lower, adjust the samples with your configuration.

Configure the Spark Driver Memory Allocation in Cluster Mode

In cluster mode, the Spark Driver runs inside YARN Application Master. The amount of memory requested by Spark at initialization is configured either in spark-defaults.conf, or through the command line.

From spark-defaults.conf

Set the default amount of memory allocated to Spark Driver in cluster mode via spark.driver.memory (this value defaults to 1G). To set it to 512MB, edit the file:
$SPARK_HOME/conf/spark-defaults.conf

From the Command Line

Use the --driver-memory parameter to specify the amount of memory requested by spark-submit. See the following section about application submission for examples.
Note
Values given from the command line will override whatever has been set in spark-defaults.conf.

Configure the Spark Application Master Memory Allocation in Client Mode

In client mode, the Spark driver will not run on the cluster, so the above configuration will have no effect. A YARN Application Master still needs to be created to schedule the Spark executor, and you can set its memory requirements.

Set the amount of memory allocated to Application Master in client mode with spark.yarn.am.memory (default to 512M)

$SPARK_HOME/conf/spark-defaults.conf

This value can not be set from the command line.

Configure Spark Executors’ Memory Allocation

The Spark Executors’ memory allocation is calculated based on two parameters inside $SPARK_HOME/conf/spark-defaults.conf:

spark.executor.memory: sets the base memory used in calculation
spark.yarn.executor.memoryOverhead: is added to the base memory. It defaults to 7% of base memory, with a minimum of 384MB

Note

Make sure that Executor requested memory, including overhead memory, is below the YARN container maximum size, otherwise the Spark application won’t initialize.

Example: for spark.executor.memory of 1Gb , the required memory is 1024+384=1408MB. For 512MB, the required memory will be 512+384=896MB

To set executor memory to 512MB, edit $SPARK_HOME/conf/spark-defaults.conf and add the following line:

$SPARK_HOME/conf/spark-defaults.conf

How to Submit a Spark Application to the YARN Cluster

Applications are submitted with the spark-submit command. The Spark installation package contains sample applications, like the parallel calculation of Pi, that you can run to practice starting Spark jobs.

To run the sample Pi calculation, use the following command:

Canon PIXMA MX410 it is safe to say that you’re looking for Printer drivers Canon PIXMA MX410? Just look at this page, you can download the drivers through the table through the tabs below for Windows 7,8,10 Vista and XP, Mac Os, Linux that you want. Canon PIXMA MX410 Driver Download for OS Windows, Linux and Mac – Canon Pixma MX410 Wireless Office All-In-One Printer Good quality, the simplicity of use and flexibility make the PIXMA MX410 Wi-fi Inkjet Office All-In-One a fantastic addition to your household or a small place of work. Disclaimer canon u.s.a.,inc. Makes no guarantees of any kind with regard to any programs, files, drivers or any other materials contained on or downloaded from this, or any other, canon software site. Canon offers a wide range of compatible supplies and accessories that can enhance your user experience with you PIXMA MX410 that you can purchase direct. Scroll down to easily select items to add to your shopping cart for a faster, easier checkout. Cannon mx410 driver. Canon PIXMA MX410 Driver Download - flexibility, ease of use and high quality make the PIXMA MX410 Wireless Inkjet Office all-in-one is ideal for Your home or small office.

The first parameter, --deploy-mode, specifies which mode to use, client or cluster.

To run the same application in cluster mode, replace --deploy-mode clientwith --deploy-mode cluster.

Monitor Your Spark Applications

When you submit a job, Spark Driver automatically starts a web UI on port 4040 that displays information about the application. However, when execution is finished, the Web UI is dismissed with the application driver and can no longer be accessed.

Spark provides a History Server that collects application logs from HDFS and displays them in a persistent web UI. The following steps will enable log persistence in HDFS:

Edit $SPARK_HOME/conf/spark-defaults.conf and add the following lines to enable Spark jobs to log in HDFS:
$SPARK_HOME/conf/spark-defaults.conf
Create the log directory in HDFS:
Configure History Server related properties in $SPARK_HOME/conf/spark-defaults.conf:
$SPARK_HOME/conf/spark-defaults.conf
You may want to use a different update interval than the default 10s. If you specify a bigger interval, you will have some delay between what you see in the History Server and the real time status of your application. If you use a shorter interval, you will increase I/O on the HDFS.
Run the History Server:
Repeat steps from previous section to start a job with spark-submit that will generate some logs in the HDFS:
Access the History Server by navigating to http://node-master:18080 in a web browser:

Run the Spark Shell

The Spark shell provides an interactive way to examine and work with your data.

Put some data into HDFS for analysis. This example uses the text of Alice In Wonderland from the Gutenberg project:
Start the Spark shell:

The Scala Spark API is beyond the scope of this guide. You can find the official documentation on Official Apache Spark documentation.

Where to Go Next ?

Now that you have a running Spark cluster, you can:

Learn any of the Scala, Java, Python, or R APIs to create Spark applications from the Apache Spark Programming Guide
Interact with your data with Spark SQL
Add machine learning capabilities to your applications with Apache MLib

More Information

You may wish to consult the following resources for additional information on this topic. While these are provided in the hope that they will be useful, please note that we cannot vouch for the accuracy or timeliness of externally hosted materials.

Join our Community

Please enable JavaScript to view the comments powered by Disqus.comments powered by Disqus

This guide is published under a CC BY-ND 4.0 license.

Apache Spark is a data analytics tool that can be used to process data from HDFS, S3 or other data sources in memory. In this post, we will install Apache Spark on a Ubuntu 17.10 machine.

For this guide, we will use Ubuntu version 17.10 (GNU/Linux 4.13.0-38-generic x86_64).

Apache Spark is a part of the Hadoop ecosystem for Big Data. Try Installing Apache Hadoop and make a sample application with it.

Updating existing packages

To start the installation for Spark, it is necessary that we update our machine with latest software packages available. We can do this with:

As Spark is based on Java, we need to install it on our machine. We can use any Java version above Java 6. Here, we will be using Java 8:

Downloading Spark files

All the necessary packages now exist on our machine. We’re ready to download the required Spark TAR files so that we can start setting them up and run a sample program with Spark as well.

In this guide, we will be installing Spark v2.3.0 available here:

Spark download page

Download the corresponding files with this command:

wget http://www-us.apache.org/dist/spark/spark-2.3.0/spark-2.3.0-bin-hadoop2.7.tgz

Install Spark Plugs

Depending upon the network speed, this can take up to a few minutes as the file is big in size:

Now that we have the TAR file downloaded, we can extract in the current directory:

This will take a few seconds to complete due to big file size of the archive:

Unarchived files in Spark

When it comes to upgrading Apache Spark in future, it can create problems due to Path updates. These issues can be avoided by creating a softlink to Spark. Run this command to make a softlink:

Adding Spark to Path

To execute Spark scripts, we will be adding it to the path now. To do this, open the bashrc file:

Add these lines to the end of the .bashrc file so that path can contain the Spark executable file path:

SPARK_HOME=/LinuxHint/spark
exportPATH=$SPARK_HOME/bin:$PATH

Now, the file looks like:

To activate these changes, run the following command for bashrc file:

Launching Spark Shell

Now when we are right outside the spark directory, run the following command to open apark shell:

We will see that Spark shell is openend now:

Launching Spark shell

We can see in the console that Spark has also opened a Web Console on port 404. Let’s give it a visit:

Though we will be operating on console itself, web environment is an important place to look at when you execute heavy Spark Jobs so that you know what is happening in each Spark Job you execute.

Check the Spark shell version with a simple command:

We will get back something like:

Making a sample Spark Application with Scala

Now, we will make a sample Word Counter application with Apache Spark. To do this, first load a text file into Spark Context on Spark shell:

scala> var Data = sc.textFile('/root/LinuxHint/spark/README.md')
Data: org.apache.spark.rdd.RDD[String] = /root/LinuxHint/spark/README.md MapPartitionsRDD[1] at textFile at :24
scala>

Now, the text present in the file must be broken into tokens which Spark can manage:

scala> var tokens = Data.flatMap(s => s.split(' '))
tokens: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at flatMap at :25
scala>

Now, initialise the count for each word to 1:

Install Spark Linux

scala> var tokens_1 = tokens.map(s =>(s,1))
tokens_1: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[3] at map at :25
scala>

Finally, calculate the frequency of each word of the file:

var sum_each = tokens_1.reduceByKey((a, b) => a + b)

Time to look at the output for the program. Collect the tokens and their respective counts:

scala> sum_each.collect()
res1: Array[(String, Int)] = Array((package,1), (For,3), (Programs,1), (processing.,1), (Because,1), (The,1), (page](http://spark.apache.org/documentation.html).,1), (cluster.,1), (its,1), ([run,1), (than,1), (APIs,1), (have,1), (Try,1), (computation,1), (through,1), (several,1), (This,2), (graph,1), (Hive,2), (storage,1), (['Specifying,1), (To,2), ('yarn',1), (Once,1), (['Useful,1), (prefer,1), (SparkPi,2), (engine,1), (version,1), (file,1), (documentation,1), (processing,1), (the,24), (are,1), (systems.,1), (params,1), (not,1), (different,1), (refer,2), (Interactive,2), (R,1), (given.,1), (if,4), (build,4), (when,1), (be,2), (Tests,1), (Apache,1), (thread,1), (programs,1), (including,4), (./bin/run-example,2), (Spark.,1), (package.,1), (1000).count(),1), (Versions,1), (HDFS,1), (D..
scala>

Excellent! We were able to run a simple Word Counter example using Scala programming language with a text file already present in the system.

Conclusion

In this lesson, we looked at how we can install and start using Apache Spark on Ubuntu 17.10 machine and run a sample application on it as well.