VizWhiz 1
Before we begin, a preamble…
Why plot raw data?
Until I1 started learning R about a year ago, I had only ever plotted averaged data. My plots looked like this…
Or this….
In 2016, I remember seeing #barbarplots t-shirts and tote bags at a conference, but back then I didn’t know that the figures were made in R.
Check out this video for more info about the #barbarplots kickstarter campaign. And learn how to explode bar plots using code from here.
So, if not bar plots, then what?
Bar plots with standard error bars obviously have their limitations, but what is the alternative? In this lesson, we’re going to learn about how to plot raw data using ggplot2
. As in prior units, we’ll be using the sydneybeaches dataset, so be sure to have that on hand!
Lesson Outcomes
By the end of this lesson, you should:
- 1.1 Be able to use
geom_point
,geom_jitter
, andgeom_quasirandom
to plot raw data - 1.2 Be able to use colour to differentiate subsets of data
- 1.3 Be able to use
facet_wrap
to plot subsets of data separately - 1.4 Be able to combine
dpylr
functions likefilter
withggplot
- 1.5 Be able to use
ggsave
to export plots
1.1 Plotting bug levels by year
If the councils that look after our beaches are doing their job, beaches should be less contaminated than they were 5 years ago, right? Let’s plot bug levels over time and see whether things seem to be getting better.
In this screencast, we’ll review:
- How to write cleaned data to a new csv file
- How to plot raw data using
geom_point
,geom_jitter
, andgeom_quasirandom
Here’s the plot for reference:
Watch the video and then carry out the following steps:
- Use
geom_point
to plot buglevels by year - Play with
geom_jitter
andgeom_quasirandom
; which one is your favourite? - Check out the
ggbeeswarm
vignette and try to make your quasirandom plot “smiley” or “frowny” - Try
geom_beeswarm
too!
1.2 Using colour to plot bug levels by site
There doesn’t seem to be an obvious improvement in bug levels in the past 5 years. I wonder whether some beaches are just more variable than others. Let’s use colour to differentiate between different sites.
In this screencast, we’ll review:
- How to plot raw data using
geom_jitter
- How to deal with NAs in your data
- How to use
coord_flip
to make axis labels visible - How to differentiate subsets within your data by changing the point colour
- How to coerce variables into a different data format
Here’s the plot for reference:
Watch the video and then carry out the following steps:
- Use
geom_jitter
to plot bug levels for each site, differentiating between values from 2013-2018 with different coloured points - Try colouring the points by another variable (perhaps council or month). How does the visualisation change?
1.3 Using facet_wrap() to plot sites separately
Using colour is one way to differentiate between data points associated with different variables. Alternatively, you can group the data into different mini-plots using facet_wrap. In this screencast, we’ll review:
- How to use
facet_wrap
to visualise beachbug levels across years, separately for each site.
Here’s the plot for reference:
Watch the video and perform the following steps:
- Use
geom_jitter
andfacet_wrap
to plot buglevels by year, separately for each site - Try plotting buglevels by site, separately for each year. Hint: maybe using
coord_flip
is a good idea!
1.4 Putting it all together (dplyr + ggplot)
What if you only want to compare a couple of sites? Or restrict the range of scores to exclude obvious outliers? You can combine dyplyr functions like filter
with ggplot
using the pipe %>%. (The pipe was covered in #RYouWithMe Unit Clean It Up Lesson 1! Need a refresher? Click here!)
In this screencast, we’ll review:
- How to use combine
filter
andggplot
to plot subsets of your data
Here’s the plot for reference:
Watch the video and perform the following steps:
- Pipe a
filter
function and aggplot
together to plot only values that are less than 1000 - Try changing the filter value to only plot values less than 500
- Use
filter
andggplot
to compare data from Coogee and Bondi - Pick your two favorite beaches and make a plot that compares the variability in beachbug values
1.5 How to get your plots out of R Studio
Of course, these beautiful plots are not much use if they are stuck in R Studio. The easiest way to export your plots and save them elsewhere on your computer is by using ggsave
.
In this screencast, we’ll review:
- How to export your plots using
ggsave
andhere
Watch the video and perform the following steps:
- Add a
ggsave
call after each chunk of code in your script to save your plots to a folder in your RYouWithMe project called “output” - Use the source button to run your code from top to bottom.
- For Sydney-based R-Ladies: Pick your favourite exported plot and post it to the #ryouwithme_3_vizwhiz channel on Slack!
Challenge - plot your own data
The next step is to plot your own data as raw values.
Steps to plotting your own raw data:
- Remember that ggplot only likes “long” data, so if you have observations across several columns, go back to Clean it up Lesson 4 and brush up on how to convert your wide data into long format
- Pick a categorical variable for your x axis
- Pick a continuous variable for your y axis
- Try out
geom_point
,geom_jitter
, orgeom_quasirandom
and see which one makes the most sense for your data - Export using
ggsave
On to Lesson 2
If you are one of our Sydney-based RLadies, share your success (and /or your frustration!) in our Slack channel #ryouwithme_3_vizwhiz!
Looking for more?
Here are some additional resources:
- Paul van der Laken makes a good argument against bar graphs in his blog post here
- Hadley Wickham’s book R for Data Science is a good place to start re data visualisation.
- Nick Tierney and Saskia Freytag were talking data visualisation with Di Cook on the Credibly Curious podcast recently. Check out the episode here.
- Jen Richmond, R-Ladies Sydney co-founder and #RYouWithMe screencaster! (Twitter: @JenRichmondPhD) ^