We will practice on our continents data.frame from module 2 and the gapminder data.frame. Note how these are tidy data: We have observations at the level of continent and at the level of country, so they go in different tables. The continent column in the gapminder data.frame allows us to link them now. If continents data.frame isn’t in your Environment, load it and recall what it consists of:
We can join the two data.frames using any of the
dplyr functions. We will pass the results to
str to avoid printing more than we can read, and to get more high-level information on the resulting data.frames.
These operations produce slightly different results, either 1704 or 1705 observations. Can you figure out why? Antarctica contains no countries so doesn’t appear in the gapminder data.frame. When we use
left_join it gets filtered from the results, but when we use
right_join it appears, with missing values for all of the country-level variables:
Or copy & paste this link into an email or IM. In the previous R syntax, I applied an inner join, but of cause you could also use a right, left, or full join in this step-by-step approach. Also note that there is a very smooth way to merge multiple data frames simultaneously by combining these data frames in a list. You can learn more about this approach in this tutorial. I want to left join in STATA but due to multiple identifiers all other options except m:m don't work. Joining by 'joinby' also led to wrong merging. Is there an equivalent of R's joinby in STATA.
- The R code implementation of these additional joins: # join datasets in r Outer Join: jointdataset merge(ChickWeight, LabResults,by = 'Diet', all=TRUE) Left Join: jointdataset merge(ChickWeight, LabResults, by = 'Diet', all.x= TRUE) Right Join: jointdataset merge(ChickWeight, LabResults, by = 'Diet', all.y=TRUE) Cross Join: jointdataset merge(ChickWeight, LabResults, by = Null).
- In SQL database terminology, the default value of all = FALSE gives a natural join, a special case of an inner join. Specifying all.x = TRUE gives a left (outer) join, all.y = TRUE a right (outer) join, and both (all = TRUE) a (full) outer join. DBMSes do not match NULL records, equivalent to incomparables = NA in R.
There’s another problem in this data.frame – it has two population measures, one by continent and one by country and it’s not clear which is which! Let’s rename a couple of columns.
Challenge – Putting the pieces together
A colleague suggests that the more land area an individual has, the greater their gdp will be and that this relationship will be observable at any scale of observation. You chuckle and mutter “Not at the continental scale,” but your colleague insists. Test your colleague’s hypothesis by:
- Calculating the total GDP of each continent,
- Hint: Use
- Hint: Use
- Joining the resulting data.frame to the
- Calculating the per-capita GDP for each continent, and
- Plotting per-capita gdp versus population density.
Solution to Challenge – Putting the pieces together
Arabic Vocabulary builder. This lesson is adapted from the Software Carpentry: R for Reproducible Scientific Analysis Multi-Table Joins materials and Brandon Hurr’s dplyr II: Joins and Set Ops presentation to the Davis R UsersGroup on Februrary 2, 2016.October 27, 2018
In this post in the R:case4base series we will look at one of the most common operations on multiple data frames - merge, also known as JOIN in SQL terms.
We will learn how to do the 4 basic types of join - inner, left, right and full join with base R and show how to perform the same with tidyverse’s dplyr and data.table’s methods. A quick benchmark will also be included.
To showcase the merging, we will use a very slightly modified dataset provided by Hadley Wickham’s nycflights13 package, mainly the
weather data frames. Let’s get right into it and simply show how to perform the different types of joins with base R.
First, we prepare the data and store the columns we will merge by (join on) into
Now, we show how to perform the 4 merges (joins):
Left (outer) join
Full (outer) join
The key arguments of base
merge data.frame method are:
x, y- the 2 data frames to be merged
by- names of the columns to merge on. If the column names are different in the two data frames to merge, we can specify
by.ywith the names of the columns in the respective data frames. The
byargument can also be specified by number, logical vector or left unspecified, in which case it defaults to the intersection of the names of the two data frames. From best practice perspective it is advisable to always specify the argument explicitly, ideally by column names.
all.y- default to
FALSEand can be used specify the type of join we want to perform:
all = FALSE(the default) - gives an inner join - combines the rows in the two data frames that match on the
all.x = TRUE- gives a left (outer) join - adds rows that are present in
x, even though they do not have a matching row in
yto the result for
all = FALSE
all.y = TRUE- gives a right (outer) join - adds rows that are present in
y, even though they do not have a matching row in
xto the result for
all = FALSE
all = TRUE- gives a full (outer) join. This is a shorthand for
all.x = TRUEand
all.y = TRUE
Other arguments include
TRUE(default), results are sorted on the
suffixes- length 2 character vector, specifying the suffixes to be used for making the names of columns in the result which are not used for merging unique
incomparables- for single-column merging only, a vector of values that cannot be matched. Any value in
xmatching a value in this vector is assigned the
nomatchvalue (which can be passed using
For this example, let us have a list of all the data frames included in the
nycflights13 package, slightly updated such that they can me merged with the default value for
by, purely for this exercise, and store them into a list called
merge is designed to work with 2 data frames, merging multiple data frames can of course be achieved by nesting the calls to merge:
We can however achieve this same goal much more elegantly, taking advantage of base R’s
Note that this example is oversimplified and the data was updated such that the default values for
by give meaningful joins. For example, in the original
planes data frame the column
year would have been matched onto the
year column of the
flights data frame, which is nonsensical as the years have different meanings in the two data frames. This is why we renamed the
year column in the
planes data frame to
yearmanufactured for the above example.
Using the tidyverse
dplyr package comes with a set of very user-friendly functions that seem quite self-explanatory:
We can also use the “forward pipe” operator
%>% that becomes very convenient when merging multiple data frames:
data.table package provides an S3 method for the
merge generic that has a very similar structure to the base method for data frames, meaning its use is very convenient for those familiar with that method. In fact the code is exactly the same as the base one for our example use.
One important difference worth noting is that the
by argument is by default constructed differently with data.table.
We however provide it explicitly, therefore this difference does not directly affect our example:
Alternatively, we can write
data.table joins as subsets:
For a quick overview, lets look at a basic benchmark without package loading overhead for each of the mentioned packages:
Full (outer) join
Visualizing the results in this case shows base R comes way behind the two alternatives, even with
sort = FALSE.
Note: The benchmarks are ran on a standard droplet by DigitalOcean, with 2GB of memory a 2vCPUs.
No time for reading? Click here to get just the code with commentary
- Animated inner join, left join, right join and full join by Garrick Aden-Buie for an easier understanding
- Joining Data in R with dplyr by Wiliam Surles
- Join (SQL) Wikipedia page
- The nycflights13 package on CRAN
Left Join Merge R
Exactly 100 years ago tomorrow, October 28th, 1918 the independence of Czechoslovakia was proclaimed by the Czechoslovak National Council, resulting in the creation of the first democratic state of Czechs and Slovaks in history.