Filtering joins keep cases from the left-hand data.frame: semijoin return all rows from x where there are matching values in y, keeping just columns from x. A semi join differs from an inner join because an inner join will return one row of x for each matching row of y, where a semi join will never duplicate rows of x. Could be written in dplyr as FundsWithReturns leftjoin (FundMonths, Returns, FundID FundID, yearmonth gmonth +3, yearmonth.
I am trying to join two data frames using dplyr. Neither data frame has a unique key column. The closest equivalent of the key column is the dates variable of monthly data. Each df has multiple entries per month, so the dates column has lots of duplicates.
I was able to find a solution from Stack Overflow, but I am having a really difficult time understanding that solution. Can you help me find a simpler solution that is easier for beginner level users to understand?
Here is a simple reproducible example:
Notice that rows 2 & 3 in df_1 both refer to '2018-06-01' (i.e. a duplicate in the key column, other columns have different data)
If I do a simple left_join, I get this:
I want a joined data frame that is something like this:
Here is the Stack Overflow solution that seems to match exactly what I am looking for:
Is it possible to create a solution that is (a) a bit easier to understand for beginners (b) uses the purr package or some other tidyverse solution?
Thanks in advance for any comments and guidance.
It’s rare that a data analysis involves only a single table of data. In practice, you’ll normally have many tables that contribute to an analysis, and you need flexible tools to combine them. In dplyr, there are three families of verbs that work with two tables at a time:
Mutating joins, which add new variables to one table from matching rows in another.
Imazing heic converter. Filtering joins, which filter observations from one table based on whether or not they match an observation in the other table.
Set operations, which combine the observations in the data sets as if they were set elements.
(This discussion assumes that you have tidy data, where the rows are observations and the columns are variables. If you’re not familiar with that framework, I’d recommend reading up on it first.)
All two-table verbs work similarly. The first two arguments are
y, and provide the tables to combine. The output is always a new table with the same type as
Mutating joins allow you to combine variables from multiple tables. For example, take the nycflights13 data. In one table we have flight information with an abbreviation for carrier, and in another we have a mapping between abbreviations and full names. You can use a join to add the carrier names to the flight data:
Controlling how the tables are matched
As well as
y, each mutating join takes an argument
by that controls which variables are used to match observations in the two tables. There are a few ways to specify it, as I illustrate below with various tables from nycflights13:
NULL, the default. dplyr will will use all variables that appear in both tables, a natural join. For example, the flights and weather tables match on their common variables: year, month, day, hour and origin.
A character vector,
by = 'x'. Like a natural join, but uses only some of the common variables. For example,
yearcolumns, but they mean different things so we only want to join by
Note that the year columns in the output are disambiguated with a suffix.
A named character vector:
by = c('x' = 'a'). This will match variable
y. The variables from use will be used in the output.
Each flight has an origin and destination
airport, so we need to specify which one we want to join to:
Types of join
Left Join In R Dplyr Different Column Names
There are four types of mutating join, which differ in their behaviour when a match is not found. We’ll illustrate each with a simple example:
inner_join(x, y)only includes observations that match in both
x y a b 1 2 10 a
left_join(x, y)includes all observations in
x, regardless of whether they match or not. This is the most commonly used join because it ensures that you don’t lose observations from your primary table.
right_join(x, y)includes all observations in
y. It’s equivalent to
left_join(y, x), but the columns and rows will be ordered differently.
full_join()includes all observations from
The left, right and full joins are collectively know as outer joins. When a row doesn’t match in an outer join, the new variables are filled in with missing values.
While mutating joins are primarily used to add new variables, they can also generate new observations. If a match is not unique, a join will add all possible combinations (the Cartesian product) of the matching observations:
Filtering joins match observations in the same way as mutating joins, but affect the observations, not the variables. There are two types:
semi_join(x, y)keeps all observations in
xthat have a match in
anti_join(x, y)drops all observations in
xthat have a match in
These are most useful for diagnosing join mismatches. For example, there are many flights in the nycflights13 dataset that don’t have a matching tail number in the planes table:
If you’re worried about what observations your joins will match, start with a
anti_join() never duplicate; they only ever remove observations.
The final type of two-table verb is set operations. These expect the
y inputs to have the same variables, and treat the observations like sets:
intersect(x, y): return only observations in both
union(x, y): return unique observations in
setdiff(x, y): return observations in
x, but not in
Given this simple data:
The four possibilities are:
Left Join In R Dplyr
dplyr does not provide any functions for working with three or more tables. Instead use
Reduce(), as described in Advanced R, to iteratively combine the two-table verbs to handle as many tables as you need.