1. Spark Scala Cheat Sheet Pdf
  2. Spark Rdd Cheat Sheet Scala
  • Scala Tutorial
  • Scala Useful Resources
  • Selected Reading


  1. Apache Spark for Newbies. Scala Cheatsheet. Scala cheat sheet from Progfun Wiki.This cheat sheet originated from the forum, credits to Laurent Poulain. They copied it and changed or added a few things. There are certainly a lot of things that can be improved!
  2. Scala Cheat Sheet. Scala essentials. Display and Strings. Blocks and expressions; Method definitions; Conditional; Pattern matching; Exceptions; Parametric type; Object oriented programming. General hierarchy of classes / traits / objects; object; class; Arrays. Declaration of array; Access to the elements; Iteration on the elements of an array.

This chapter explains how Scala supports regular expressions through Regex class available in the scala.util.matching package.

Try the following example program where we will try to find out word Scala from a statement.


Save the above program in Demo.scala. The following commands are used to compile and execute this program.



We create a String and call the r( ) method on it. Scala implicitly converts the String to a RichString and invokes that method to get an instance of Regex. To find a first match of the regular expression, simply call the findFirstIn() method. If instead of finding only the first occurrence we would like to find all occurrences of the matching word, we can use the findAllIn( ) method and in case there are multiple Scala words available in the target string, this will return a collection of all matching words.

You can make use of the mkString( ) method to concatenate the resulting list and you can use a pipe ( ) to search small and capital case of Scala and you can use Regex constructor instead or r() method to create a pattern.

Try the following example program.


Save the above program in Demo.scala. The following commands are used to compile and execute this program.



If you would like to replace matching text, we can use replaceFirstIn( ) to replace the first match or replaceAllIn( ) to replace all occurrences.


Save the above program in Demo.scala. The following commands are used to compile and execute this program.



Forming Regular Expressions

Scala inherits its regular expression syntax from Java, which in turn inherits most of the features of Perl. Here are just some examples that should be enough as refreshers −

Following is the table listing down all the regular expression Meta character syntax available in Java.

^Matches beginning of line.
$Matches end of line.
.Matches any single character except newline. Using m option allows it to match newline as well.
[..]Matches any single character in brackets.
[^..]Matches any single character not in brackets
ABeginning of entire string
zEnd of entire string
ZEnd of entire string except allowable final line terminator.
re*Matches 0 or more occurrences of preceding expression.
re+Matches 1 or more of the previous thing
re?Matches 0 or 1 occurrence of preceding expression.
re{ n}Matches exactly n number of occurrences of preceding expression.
re{ n,}Matches n or more occurrences of preceding expression.
re{ n, m}Matches at least n and at most m occurrences of preceding expression.
a bMatches either a or b.
(re)Groups regular expressions and remembers matched text.
(?: re)Groups regular expressions without remembering matched text.
(?> re)Matches independent pattern without backtracking.
wMatches word characters.
WMatches nonword characters.
sMatches whitespace. Equivalent to [tnrf].
SMatches nonwhitespace.
dMatches digits. Equivalent to [0-9].
DMatches nondigits.
AMatches beginning of string.
ZMatches end of string. If a newline exists, it matches just before newline.
zMatches end of string.
GMatches point where last match finished.
nBack-reference to capture group number 'n'
bMatches word boundaries when outside brackets. Matches backspace (0x08) when inside brackets.
BMatches nonword boundaries.
n, t, etc.Matches newlines, carriage returns, tabs, etc.
QEscape (quote) all characters up to E
EEnds quoting begun with Q

Regular-Expression Examples

.Match any character except newline
[Rr]ubyMatch 'Ruby' or 'ruby'
rub[ye]Match 'ruby' or 'rube'
[aeiou]Match any one lowercase vowel
[0-9]Match any digit; same as [0123456789]
[a-z]Match any lowercase ASCII letter
[A-Z]Match any uppercase ASCII letter
[a-zA-Z0-9]Match any of the above
[^aeiou]Match anything other than a lowercase vowel
[^0-9]Match anything other than a digit
dMatch a digit: [0-9]
DMatch a nondigit: [^0-9]
sMatch a whitespace character: [ trnf]
SMatch nonwhitespace: [^ trnf]
wMatch a single word character: [A-Za-z0-9_]
WMatch a nonword character: [^A-Za-z0-9_]
ruby?Match 'rub' or 'ruby': the y is optional
ruby*Match 'rub' plus 0 or more ys
ruby+Match 'rub' plus 1 or more ys
d{3}Match exactly 3 digits
d{3,}Match 3 or more digits
d{3,5}Match 3, 4, or 5 digits
Dd+No group: + repeats d
(Dd)+/ Grouped: + repeats Dd pair
([Rr]uby(, )?)+Match 'Ruby', 'Ruby, ruby, ruby', etc.

Note − that every backslash appears twice in the string above. This is because in Java and Scala a single backslash is an escape character in a string literal, not a regular character that shows up in the string. So instead of ‘’, you need to write ‘’ to get a single backslash in the string.

Try the following example program.


Save the above program in Demo.scala. The following commands are used to compile and execute this program.


Spark Scala Cheat Sheet Pdf


This page contains a bunch of spark pipeline transformation methods, whichwe can use for different problems. Use this as a quick cheat on how we cando particular operation on spark dataframe or pyspark.

This code snippets are tested on spark-2.4.x version, mostly work onspark-2.3.x also, but not sure about older versions.

Read the partitioned json files from disk

applicable to all types of files supported

Save partitioned files into a single file.

Here we are merging all the partitions into one file and dumping it intothe disk, this happens at the driver node, so be careful with sie ofdata set that you are dealing with. Otherwise, the driver node may go out of memory.

Use coalesce method to adjust the partition size of RDD based on our needs.

Filter rows which meet particular criteria

Map with case class

Use case class if you want to map on multiple columns with a complexdata structure.

OR using Row class.

Use selectExpr to access inner attributes

Provide easily access the nested data structures like json and filter themusing any existing udfs, or use your udf to get more flexibility here.

How to access RDD methods from pyspark side

Using standard RDD operation via pyspark API isn’t straight forward, to get thatwe need to invoke the .rdd to convert the DataFrame to support these features.

For example, here we are converting a sparse vector to dense and summing it in column-wise.

Pyspark Map on multiple columns

Filtering a DataFrame column of type Seq[String]

Filter a column with custom regex and udf

Sum a column elements

Remove Unicode characters from tokens

Sometimes we only need to work with the ascii text, so it’s better to clean outother chars.

Connecting to jdbc with partition by integer column

Spark Rdd Cheat Sheet Scala

When using the spark to read data from the SQL database and then do theother pipeline processing on it, it’s recommended to partition the dataaccording to the natural segments in the data, or at least on an integercolumn, so that spark can fire multiple sql queries to read data from SQLserver and operate on it separately, the results are going to the sparkpartition.

Bellow commands are in pyspark, but the APIs are the same for the scala version also.

Parse nested json data

This will be very helpful when working with pyspark and want to pass verynested json data between JVM and Python processes. Lately spark community relay onapache arrow project to avoid multiple serialization/deserialization costs whensending data from java memory to python memory or vice versa.

So to process the inner objects you can make use of this getItem methodto filter out required parts of the object and pass it over to python memory viaarrow. In the future arrow might support arbitrarily nested data, but right now it won’tsupport complex nested formats. The general recommended option is to go without nesting.

'string ⇒ array<string>' conversion

Type annotation .as[String] avoid implicit conversion assumed.

A crazy string collection and groupby

This is a stream of operation on a column of type Array[String] and collectthe tokens and count the n-gram distribution over all the tokens.

How to access AWS s3 on spark-shell or pyspark

Most of the time we might require a cloud storage provider like s3 / gs etc, toread and write the data for processing, very few keeps in-house hdfs to handle the datathemself, but for majority, I think cloud storage easy to start with and don’t needto bother about the size limitations.

Supply the aws credentials via environment variable

Supply the credentials via default aws ~/.aws/config file

Recent versions of awscli expect its configurations are kept under ~/.aws/credentials file,but old versions looks at ~/.aws/config path, spark 2.4.x version now looks at the ~/.aws/config locationsince spark 2.4.x comes with default hadoop jars of version 2.7.x.

Set spark scratch space or tmp directory correctly

Clean my mac activation code free. This might require when working with a huge dataset and your machine can’t hold themall in memory for given pipeline steps, those cases the data will be spilled overto disk, and saved in tmp directory.

Set bellow properties to ensure, you have enough space in tmp location.

Pyspark doesn’t support all the data types.

When using the arrow to transport data between jvm to python memory, the arrow may throwbellow error if the types aren’t compatible to existing converters. The fixes may becomein the future on the arrow’s project. I’m keeping this here to know that how the pyspark getsdata from jvm and what are those things can go wrong in that process.

Work with spark standalone cluster manager

Start the spark clustering in standalone mode

Once you have downloaded the same version of the spark binary across the machinesyou can start the spark master and slave processes to form the standalone sparkcluster. Or you could run both these services on the same machine also.

Standalone mode,

  1. Worker can have multiple executors.

  2. Worker is like a node manager in yarn.

  3. We can set worker max core and memory usage settings.

  4. When defining the spark application via spark-shell or so, define the executor memory and cores.

When submitting the job to get 10 executor with 1 cpu and 2gb ram each,

This page will be updated as and when I see some reusable snippet of code for spark operations



SparkGo Top
Please enable JavaScript to view the comments powered by Disqus.