8 min read

Dplyr provides a grammar for manipulating tables in R. This cheat sheet will guide you through the grammar, reminding you how to select, filter, arrange, mutate, summarise, group, and join. Currently dplyr supports four types of mutating joins, two types of filtering joins, and a nesting join. Mutating joins combine variables from the two data.frames: innerjoin return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned. Dplyr provides a grammar for manipulating tables in R. This cheatsheet will guide you through the grammar, reminding you how to select, filter, arrange, mutate, summarise, group, and join data frames and tibbles. ( Previous version) Updated January 17.

R Dplyr Cheat Sheet

  1. Thanks to dplyr and tidyr packages I no logner need to write long and redundant codes. This blog is where I write some tricks of using dplyr and tidyr. If you want to have a head-start, you can read these blogs ^1,^2. The cheat-sheat can be found here 1.
  2. Remember: Please Join our RBootcamp OHSU Group! We've been looking at datasets that fit the ggplot2 paradigm nicely; however, most data we encounter is really messy (missing values), or is a completely different format. In this chapter, we'll look at one of the most powerful tools in the tidyverse: dplyr, which lets you manipulate data frames.There is a function/action for most of the.
2020/05/04

Motivation

Dplyr Join Cheat Sheet

I use R to extract data held in Microsoft SQL Server databases on a daily basis.

When I first started I was confused by all the different ways to accomplish this task. I was a bit overwhelmed trying to choose the, “best,” option given the specific job at hand.

Dplyr

I want to share what approaches I’ve landed on to help others who may want a simple list of options to get started with.

Scope

This post is about reading data from a database, not writing to one.

I prefer to use packages in the tidyverse so I’ll focus on those packages.

While it’s possible to generalize many of the concepts I write about here to other DBMS systems I will focus exclusively on Microsoft SQL Server. I hope this will provide simple, prescriptive guidance for those working in a similar configuration.

The data for these examples is stored using Microsoft SQL Server Express. Free download available here.

One last thing - these are a few options I populated my toolbox with. They have served me well over the past two years as an analyst in an enterprise environment, but are definitely not the only options available.

Setup

Connect to the server

I use the keyring package to keep my credentials out of my R code. You can use the great documentation available from RStudio to learn how do the same.

Write some sample data

Note that I set the temporary argument to TRUE so that the data is written to the tempdb on SQL server, which will result in it being deleted on disconnection.

This results in dplyr prefixing the table name with, “##.”

SOURCE: https://db.rstudio.com/dplyr/#connecting-to-the-database

Option 1: Use dplyr syntax and let dbplyr handle the rest

When I use this option

This is my default option.

I do almost all of my analysis in R and this avoids fragmenting my work and thoughts across different tools.

Examples

Example 1: filter rows, and retrieve selected columns

Example 2: join across tables and retrieve selected columns

Example 3: Summarize and count

Quite a few tailnum values in flights, are not present in planes, interesting! What is my ip info.

Option 2: Write SQL syntax and have dplyr and dbplyr run the query

When I use this option

I use this option when I am reusing a fairly short, existing SQL querywith minor modifications.

Example 1: Simple selection of records using SQL syntax

Dplyr Join Cheat Sheet Pdf

Example 2: Use dplyr syntax to enhance a raw SQL query

Option 3: Store the SQL query in a text file and have dplyr and dbplyr run the query

When I use this option

I use this approach under the following conditions:

  1. I’m reusing existing SQL code or when collaborating with someone who will be writing new code in SQL
  2. The SQL code is longer than a line or two

I prefer to, “modularize,” my R code. Having an extremely long SQL statementin my R code doesn’t abstract away the complexity of the SQL query. Putting thequery into it’s own file helps achieve my desired level of abstraction.

In conjunction with source control it makes tracking changes to the definition of adata set simple.

More importantly, it’s a really useful way to collaborate with others whoare comfortable with SQL but don’t use R. For example, I recently used thisapproach on a project involving aggregation of multiple data sets.Another team member focused on building out the data collection logic forsome of the data sets in SQL. Once he had them built and validated he handed offthe query to me and I pasted it into a text file.

Dplyr Left Join

Step 1: Put your SQL code into a text file

Dplyr Cheat Sheet

Here is some example SQL code that might be in a file

Dplyr Join Cheat Sheet

Let’s say that SQL code was stored in a text file called, flights.sql

Step 2: Use the SQL code in the file to retrieve data and execute the query.