Content from Before we start
Last updated on 2023-11-21 | Edit this page
Overview
Questions
- What is R and RStudio?
- What is a working directory?
- How should files be set up to import into R?
- How can I look for help with R functions?
Objectives
- Explain what R and RStudio are, what they are used for, and how they relate to each other.
- Describe the purpose of the RStudio Script, Console, Environment, and Plots panes.
- Organize files and directories for a set of analyses as an R Project, and understand the purpose of the working directory.
- Use the built-in RStudio help interface to search for more information on R functions.
- Demonstrate how to provide sufficient information for troubleshooting with the R user community.
What is R? What is RStudio?
The term “R
” is used to refer to both the programming
language and the software that interprets the scripts written using
it.
RStudio is a popular way to write R scripts and interact with the R software. To function correctly, RStudio needs R and therefore both need to be installed on your computer.
Why learn R?
R does not involve lots of pointing and clicking, and that’s a good thing
In R, the results of your analysis rely on a series of written commands, and not on remembering a succession of pointing and clicking. That is a good thing! So, if you want to redo your analysis because you collected more data, you don’t have to remember which button you clicked in which order to obtain your results. With a stored series of commands in an R script, you can repeat running them and R will process the new dataset exactly the same way as before.
Working with scripts makes the steps you used in your analysis clear, and the code you write can be inspected by someone else who can give you feedback and spot mistakes.
Working with scripts forces you to have a deeper understanding of what you are doing, and facilitates your learning and comprehension of the methods you use.
R code is great for reproducibility
Reproducibility is when someone else, including your future self, can obtain the same results from the same dataset when using the same analysis.
R integrates with other tools to generate manuscripts from your code. If you collect more data, or fix a mistake in your dataset, the figures and the statistical tests in your manuscript are updated automatically.
R is widely used in academia and in industries such as pharma and biotech. These organisations expect analyses to be reproducible, so knowing R will give you an edge with these requirements.
R is interdisciplinary and extensible
With 10,000+ packages that can be installed to extend its capabilities, R provides a framework that allows you to combine statistical approaches from many scientific disciplines to best suit the analytical framework you need to analyze your data. For instance, R has packages for image analysis, GIS, time series, population genetics, and a lot more.
R works on data of all shapes and sizes
The skills you learn with R scale easily with the size of your dataset. Whether your dataset has hundreds or millions of lines, it won’t make much difference to you.
R is designed for data analysis. It comes with special data structures and data types that make handling of missing data and statistical factors convenient.
R can connect to spreadsheets, databases, and many other data formats, on your computer or on the web.
R produces high-quality graphics
The plotting functionalities in R are endless, and allow you to adjust any aspect of your graph to visualize your data more effectively.
R has a large and welcoming community
Thousands of people use R daily. Many of them are willing to help you through mailing lists and websites such as Stack Overflow, RStudio community, and Slack channels such as the R for Data Science online community (https://www.rfordatasci.com/). In addition, there are numerous online and in person meetups organised globally through organisations such as R Ladies Global (https://rladies.org/).
Knowing your way around RStudio
Let’s start by learning about RStudio, which is an Integrated Development Environment (IDE) for working with R.
The RStudio IDE open-source product is free under the Affero General Public License (AGPL) v3. The RStudio IDE is also available with a commercial license and priority email support from RStudio, PBC.
We will use RStudio IDE to write code, navigate the files on our computer, inspect the variables we are going to create, and visualize the plots we will generate. RStudio can also be used for other things (e.g., version control, developing packages, writing Shiny apps) that we will not cover during the workshop.
RStudio is divided into 4 “panes”:
- The Source for your scripts and documents (top-left, in the default layout)
- Your Environment/History (top-right) which shows all the objects in your working space (Environment) and your command history (History)
- Your Files/Plots/Packages/Help/Viewer (bottom-right)
- The R Console (bottom-left)
The placement of these panes and their content can be customized (see menu, Tools -> Global Options -> Pane Layout). For ease of use, settings such as background color, font color, font size, and zoom level can also be adjusted in this menu (Global Options -> Appearance).
One of the advantages of using RStudio is that all the information you need to write code is available in a single window. Additionally, with many shortcuts, autocompletion, and highlighting for the major file types you use while developing in R, RStudio will make typing easier and less error-prone.
Getting set up
It is good practice to keep a set of related data, analyses, and text self-contained in a single folder, called the working directory. All of the scripts within this folder can then use relative paths to files that indicate where inside the project a file is located (as opposed to absolute paths, which point to where a file is on a specific computer). Working this way allows you to move your project around on your computer and share it with others without worrying about whether or not the underlying scripts will still work.
RStudio provides a helpful set of tools to do this through its “Projects” interface, which not only creates a working directory for you, but also remembers its location (allowing you to quickly navigate to it) and optionally preserves custom settings and (re-)open files to assist resume work after a break. Go through the steps for creating an “R Project” for this tutorial below.
- Start RStudio.
- Under the
File
menu, click onNew Project
. ChooseNew Directory
, thenNew Project
. - Enter a name for this new folder (or “directory”), and choose a
convenient location for it. This will be your working
directory for the rest of the day (e.g.,
~/data-carpentry
). - Click on
Create Project
. - Download the code handout, place
it in your working directory and rename it (e.g.,
data-carpentry-script.R
). - (Optional) Set Preferences to ‘Never’ save workspace in RStudio.
A workspace is your current working environment in R which includes
any user-defined object. By default, all of these objects will be saved,
and automatically loaded, when you reopen your project. Saving a
workspace to .RData
can be cumbersome, especially if you
are working with larger datasets, and it can lead to hard to debug
errors by having objects in memory you forgot you had. Therefore, it is
often a good idea to turn this off. To do so, go to Tools –> ‘Global
Options’ and select the ‘Never’ option for ‘Save workspace to .RData’ on
exit.’
Organizing your working directory
Using a consistent folder structure across your projects will help keep things organized, and will help you to find/file things in the future. This can be especially helpful when you have multiple projects. In general, you may create directories (folders) for scripts, data, and documents.
-
data_raw/
&data/
Use these folders to store raw data and intermediate datasets you may create for the need of a particular analysis. For the sake of transparency and provenance, you should always keep a copy of your raw data accessible and do as much of your data cleanup and preprocessing programmatically (i.e., with scripts, rather than manually) as possible. Separating raw data from processed data is also a good idea. For example, you could have filesdata_raw/tree_survey.plot1.txt
and...plot2.txt
kept separate from adata/tree.survey.csv
file generated by thescripts/01.preprocess.tree_survey.R
script. -
documents/
This would be a place to keep outlines, drafts, and other text. -
scripts/
This would be the location to keep your R scripts for different analyses or plotting, and potentially a separate folder for your functions (more on that later). - Additional (sub)directories depending on your project needs.
For this workshop, we will need a data_raw/
folder to
store our raw data, and we will use data/
for when we learn
how to export data as CSV files, and a fig/
folder for the
figures that we will save.
- Under the
Files
tab on the right of the screen, click onNew Folder
and create a folder nameddata_raw
within your newly created working directory (e.g.,~/data-carpentry/
). (Alternatively, typedir.create("data_raw")
at your R console.) Repeat these operations to create adata
and afig
folder.
We are going to keep the script in the root of our working directory because we are only going to use one file. Later, when you start create more complex projects, it might make sense to organize scripts in sub-directories.
Your working directory should now look like this:
The working directory
The working directory is an important concept to understand. It is the place from where R will be looking for and saving the files. When you write code for your project, it should refer to files in relation to the root of your working directory and only need files within this structure.
RStudio assists you in this regard and sets the working directory
automatically to the directory where you have placed your project in. If
you need to check it, you can use getwd()
. If for some
reason your working directory is not what it should be, you can change
it in the RStudio interface by navigating in the file browser where your
working directory should be, and clicking on the blue gear icon “More”,
and select “Set As Working Directory”. Alternatively you can use
setwd("/path/to/working/directory")
to reset your working
directory. However, your scripts should not include this line because it
will fail on someone else’s computer.
Interacting with R
The basis of programming is that we write down instructions for the computer to follow, and then we tell the computer to follow those instructions. We write, or code, instructions in R because it is a common language that both the computer and we can understand. We call the instructions commands and we tell the computer to follow the instructions by executing (also called running) those commands.
There are two main ways of interacting with R: by using the console
or by using script files (plain text files that contain your code). The
console pane (in RStudio, the bottom left panel) is the place where
commands written in the R language can be typed and executed immediately
by the computer. It is also where the results will be shown for commands
that have been executed. You can type commands directly into the console
and press Enter
to execute those commands, but they will be
forgotten when you close the session.
Because we want our code and workflow to be reproducible, it is better to type the commands we want in the script editor, and save the script. This way, there is a complete record of what we did, and anyone (including our future selves!) can easily replicate the results on their computer.
RStudio allows you to execute commands directly from the script
editor by using the Ctrl
+
Enter
shortcut (on Macs,
Cmd
+ Return
will work,
too). The command on the current line in the script (indicated by the
cursor) or all of the commands in the currently selected text will be
sent to the console and executed when you press
Ctrl
+ Enter
. You can
find other keyboard shortcuts in this RStudio
cheatsheet about the RStudio IDE.
At some point in your analysis you may want to check the content of a
variable or the structure of an object, without necessarily keeping a
record of it in your script. You can type these commands and execute
them directly in the console. RStudio provides the
Ctrl
+ 1
and
Ctrl
+ 2
shortcuts allow
you to jump between the script and the console panes.
If R is ready to accept commands, the R console shows a
>
prompt. If it receives a command (by typing,
copy-pasting or sent from the script editor using
Ctrl
+ Enter
), R will
try to execute it, and when ready, will show the results and come back
with a new >
prompt to wait for new commands.
If R is still waiting for you to enter more data because it isn’t
complete yet, the console will show a +
prompt. It means
that you haven’t finished entering a complete command. This is because
you have not ‘closed’ a parenthesis or quotation, i.e. you don’t have
the same number of left-parentheses as right-parentheses, or the same
number of opening and closing quotation marks. When this happens, and
you thought you finished typing your command, click inside the console
window and press Esc
; this will cancel the
incomplete command and return you to the >
prompt.
Seeking help
Searching function documentation with ?
and
??
If you need help with a specific function, let’s say
mean()
, you can type ?mean
or press
F1 while your cursor is on the function name. If you are
looking for a function to do a particular task, but don’t know the
function name, you can use the double question mark ??
, for
example ??kruskall
. Both commands will open matching help
files in RStudio’s help panel in the lower right corner. You can also
use the help panel to search help directly, as seen in the
screenshot.
Automatic code completion
When you write code in RStudio, you can use its automatic code completion to remind yourself of a function’s name or arguments. Start typing the function name and pay attention to the suggestions that pop up. Use the up and down arrow to select a suggested code completion and Tab to apply it. You can also use code completion to complete function’s argument names, object, names and file names. It even works if you don’t get the spelling 100% correct.
Package vignettes and cheat sheets
In addition to the documentation for individual functions, many
packages have vignettes – instructions for how to use the
package to do certain tasks. Vignettes are great for learning by
example. Vignettes are accessible via the package help and by using the
function browseVignettes()
.
There is also a Help menu at the top of the RStudio window, that has cheat sheets for popular packages, RStudio keyboard shortcuts, and more.
Finding more functions and packages
RStudio’s help only searches the packages that you have installed on your machine, but there are many more available on CRAN and GitHub. To search across all available R packages, you can use the website rdocumentation.org. Often, a generic Google or internet search “R <task>” will send you to the appropriate package documentation or a forum where someone else has already asked your question. Many packages also have websites with additional help, tutorials, news and more (for example tidyverse.org).
Dealing with error messages
Don’t get discouraged if your code doesn’t run immediately! Error messages are common when programming, and fixing errors is part of any programmer’s daily work. Often, the problem is a small typo in a variable name or a missing parenthesis. Watch for the red x’s next to your code in RStudio. These may provide helpful hints about the source of the problem.
If you can’t fix an error yourself, start by googling it. Some error messages are too generic to diagnose a problem (e.g. “subscript out of bounds”). In that case it might help to include the name of the function or package you’re using in your query.
Asking for help
If your Google search is unsuccessful, you may want to ask other R users for help. There are different places where you can ask for help. During this workshop, don’t hesitate to talk to your neighbor, compare your answers, and ask for help. You might also be interested in organizing regular meetings following the workshop to keep learning from each other. If you have a friend or colleague with more experience than you, they might also be able and willing to help you.
Besides that, there are a few places on the internet that provide help:
- Stack Overflow: Many questions have already been answered, but the challenge is to use the right words in your search to find them. If your question hasn’t been answered before and is well crafted, chances are you will get an answer in less than 5 min. Remember to follow their guidelines on how to ask a good question.
- The R-help mailing list: it is used by a lot of people (including most of the R core team). If your question is valid (read its Posting Guide), you are likely to get an answer very fast, but the tone can be pretty dry and it is not always very welcoming to new users.
- If your question is about a specific package rather than a base R
function, see if there is a mailing list for the package. Usually it’s
included in the DESCRIPTION file of the package that can be accessed
using
packageDescription("<package-name>")
. - You can also try to contact the package author directly, by emailing them or opening an issue on the code repository (e.g., on GitHub).
- There are also some topic-specific mailing lists (GIS, phylogenetics, etc…). The complete list is on the R mailing lists website.
The key to receiving help from someone is for them to rapidly grasp your problem. Thus, you should be as precise as possible when describing your problem and help others to pinpoint where the issue might be. Try to…
Use the correct words to describe your problem. Otherwise you might get an answer pointing to the misuse of your words rather than answering your question.
Generalize what you are trying to do, so people outside your field can understand the question.
Reduce what does not work to a simple reproducible example. For instance, instead of using your real data set, create a small generic one. For more information on how to write a reproducible example see this article from the reprex package. Learning how to use the reprex package is also very helpful for this.
Include the output of
sessionInfo()
in your question. It provides information about your platform, the versions of R and the packages that you are using. As an example, here you can see the versions of R and all the packages that we are using to run the code in this lesson:
R
sessionInfo()
OUTPUT
#> R version 4.3.2 (2023-10-31)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.3 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#>
#> locale:
#> [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
#> [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
#> [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
#> [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] RSQLite_2.3.1 lubridate_1.9.2 forcats_1.0.0 stringr_1.5.0
#> [5] dplyr_1.1.3 purrr_1.0.2 readr_2.1.4 tidyr_1.3.0
#> [9] tibble_3.2.1 ggplot2_3.4.3 tidyverse_2.0.0 knitr_1.43
#>
#> loaded via a namespace (and not attached):
#> [1] bit_4.0.5 gtable_0.3.4 compiler_4.3.2 renv_1.0.3
#> [5] highr_0.10 tidyselect_1.2.0 blob_1.2.4 scales_1.2.1
#> [9] fastmap_1.1.1 yaml_2.3.7 R6_2.5.1 generics_0.1.3
#> [13] munsell_0.5.0 DBI_1.1.3 pillar_1.9.0 tzdb_0.4.0
#> [17] rlang_1.1.1 utf8_1.2.3 cachem_1.0.8 stringi_1.7.12
#> [21] xfun_0.40 bit64_4.0.5 memoise_2.0.1 timechange_0.2.0
#> [25] cli_3.6.1 withr_2.5.0 magrittr_2.0.3 grid_4.3.2
#> [29] rstudioapi_0.15.0 hms_1.1.3 lifecycle_1.0.3 vctrs_0.6.3
#> [33] evaluate_0.21 glue_1.6.2 fansi_1.0.4 colorspace_2.1-0
#> [37] tools_4.3.2 pkgconfig_2.0.3
How to learn more after the workshop?
The material we cover during this workshop will give you a taste of how you can use R to analyze data for your own research. However, to do advanced operations such as cleaning your dataset, using statistical methods, or creating beautiful graphics you will need to learn more.
The best way to become proficient and efficient at R, as with any other tool, is to use it to address your actual research questions. As a beginner, it can feel daunting to have to write a script from scratch, and given that many people make their code available online, modifying existing code to suit your purpose might get first hands-on experience using R for your own work and help you become comfortable eventually creating your own scripts.
More resources
More about R
- The Introduction to R can also be dense for people with little programming experience but it is a good place to understand the underpinnings of the R language.
- The R FAQ is dense and technical but it is full of useful information.
- To stay up to date, follow
#rstats
on twitter. Twitter can also be a way to get questions answered and learn about useful R packages and tipps (e.g., [@RLangTips])
How to ask good programming questions?
- The rOpenSci community call “How to ask questions so they get answered”, (rOpenSci site and video recording) includes a presentation of the reprex package and of its philosophy.
- blog.Revolutionanalytics.com and this blog post by Jon Skeet have comprehensive advice on how to ask programming questions.
Keypoints
- R is a programming language and RStudio is the IDE that assists in using R.
- There are many benefits to learning R, including writing reproducibile code, ability to use a variety of datasets, and a broad, open-source community of practioners.
- Files related to analysis should be organized within a single working directory.
- R uses commands containing functions to tell the computer what to do.
- Documentation for each function is available within RStudio, or users can ask for help from one of many online forums, cheatsheets, or email lists.
Content from Introduction to R
Last updated on 2023-11-21 | Edit this page
Overview
Questions
- How do you create objects in R?
- How do you save R code for later use?
- How do you manipulate data in R?
Objectives
- Define the following terms as they relate to R: object, assign, call, function, arguments, options.
- Create objects and assign values to them in R.
- Learn how to name objects.
- Save a script file for later use.
- Use comments to inform script.
- Solve simple arithmetic operations in R.
- Call functions and use arguments to change their default options.
- Inspect the content of vectors and manipulate their content.
- Subset and extract values from vectors.
- Analyze vectors with missing data.
Creating objects in R
You can get output from R simply by typing math in the console:
R
3 + 5
12 / 7
However, to do useful and interesting things, we need to assign
values to objects. To create an object, we need to
give it a name followed by the assignment operator <-
,
and the value we want to give it:
R
weight_kg <- 55
<-
is the assignment operator we will use in this
course. It assigns values on the right to objects on the left. So, after
executing x <- 3
, the value of x
is
3
. For historical reasons, you can also use =
for assignments, but not in every context. Because of the slight
differences
in syntax, it is good practice to always use <-
for
assignments.
In RStudio, typing Alt + - (push Alt
at the same time as the - key) will write <-
in a single keystroke in a PC, while typing Option +
- (push Option at the same time as the
- key) does the same in a Mac.
Objects can be given almost any name such as x
,
current_temperature
, or subject_id
. Here are
some further guidelines on naming objects:
- You want your object names to be explicit and not too long.
- They cannot start with a number (
2x
is not valid, butx2
is). - R is case sensitive, so for example,
weight_kg
is different fromWeight_kg
. - There are some names that cannot be used because they are the names
of fundamental functions in R (e.g.,
if
,else
,for
, see here for a complete list). In general, even if it’s allowed, it’s best to not use other function names (e.g.,c
,T
,mean
,data
,df
,weights
). If in doubt, check the help to see if the name is already in use. - It’s best to avoid dots (
.
) within names. Many function names in R itself have them and dots also have a special meaning (methods) in R and other programming languages. To avoid confusion, don’t include dots in names. - It is recommended to use nouns for object names and verbs for function names.
- Be consistent in the styling of your code, such as where you put
spaces, how you name objects, etc. Styles can include “lower_snake”,
“UPPER_SNAKE”, “lowerCamelCase”, “UpperCamelCase”, etc. Using a
consistent coding style makes your code clearer to read for your future
self and your collaborators. In R, three popular style guides come from
Google, Jean Fan and the tidyverse. The tidyverse style
is very comprehensive and may seem overwhelming at first. You can
install the
lintr
package to automatically check for issues in the styling of your code.
Objects vs. variables
What are known as objects
in R
are known as
variables
in many other programming languages. Depending on
the context, object
and variable
can have
drastically different meanings. However, in this lesson, the two words
are used synonymously. For more information see: https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Objects
When assigning a value to an object, R does not print anything. You can force R to print the value by using parentheses or by typing the object name:
R
weight_kg <- 55 # doesn't print anything
(weight_kg <- 55) # but putting parenthesis around the call prints the value of `weight_kg`
weight_kg # and so does typing the name of the object
Now that R has weight_kg
in memory, we can do arithmetic
with it. For instance, we may want to convert this weight into pounds
(weight in pounds is 2.2 times the weight in kg):
R
2.2 * weight_kg
We can also change an object’s value by assigning it a new one:
R
weight_kg <- 57.5
2.2 * weight_kg
This means that assigning a value to one object does not change the
values of other objects. For example, let’s store the animal’s weight in
pounds in a new object, weight_lb
:
R
weight_lb <- 2.2 * weight_kg
and then change weight_kg
to 100.
R
weight_kg <- 100
What do you think is the current content of the object
weight_lb
? 126.5 or 220?
Saving your code
Up to now, your code has been in the console. This is useful for
quick queries but not so helpful if you want to revisit your work for
any reason. A script can be opened by pressing Ctrl +
Shift + N. It is wise to save your script file
immediately. To do this press Ctrl + S. This will
open a dialogue box where you can decide where to save your script file,
and what to name it. The .R
file extension is added
automatically and ensures your file will open with RStudio.
Don’t forget to save your work periodically by pressing Ctrl + S.
Comments
The comment character in R is #
. Anything to the right
of a #
in a script will be ignored by R. It is useful to
leave notes and explanations in your scripts. For convenience, RStudio
provides a keyboard shortcut to comment or uncomment a paragraph: after
selecting the lines you want to comment, press at the same time on your
keyboard Ctrl + Shift + C. If you only
want to comment out one line, you can put the cursor at any location of
that line (i.e. no need to select the whole line), then press
Ctrl + Shift + C.
Functions and their arguments
Functions are “canned scripts” that automate more complicated sets of
commands including operations assignments, etc. Many functions are
predefined, or can be made available by importing R packages
(more on that later). A function usually takes one or more inputs called
arguments. Functions often (but not always) return a
value. A typical example would be the function
sqrt()
. The input (the argument) must be a number, and the
return value (in fact, the output) is the square root of that number.
Executing a function (‘running it’) is called calling the
function. An example of a function call is:
R
weight_kg <- sqrt(10)
Here, the value of 10 is given to the sqrt()
function,
the sqrt()
function calculates the square root, and returns
the value which is then assigned to the object weight_kg
.
This function takes one argument, other functions might take
several.
The return ‘value’ of a function need not be numerical (like that of
sqrt()
), and it also does not need to be a single item: it
can be a set of things, or even a dataset. We’ll see that when we read
data files into R.
Arguments can be anything, not only numbers or filenames, but also other objects. Exactly what each argument means differs per function, and must be looked up in the documentation (see below). Some functions take arguments which may either be specified by the user, or, if left out, take on a default value: these are called options. Options are typically used to alter the way the function operates, such as whether it ignores ‘bad values’, or what symbol to use in a plot. However, if you want something specific, you can specify a value of your choice which will be used instead of the default.
Let’s try a function that can take multiple arguments:
round()
.
R
round(3.14159)
OUTPUT
#> [1] 3
Here, we’ve called round()
with just one argument,
3.14159
, and it has returned the value 3
.
That’s because the default is to round to the nearest whole number. If
we want more digits we can see how to do that by getting information
about the round
function. We can use
args(round)
to find what arguments it takes, or look at the
help for this function using ?round
.
R
args(round)
OUTPUT
#> function (x, digits = 0)
#> NULL
R
?round
We see that if we want a different number of digits, we can type
digits = 2
or however many we want.
R
round(3.14159, digits = 2)
OUTPUT
#> [1] 3.14
If you provide the arguments in the exact same order as they are defined you don’t have to name them:
R
round(3.14159, 2)
OUTPUT
#> [1] 3.14
And if you do name the arguments, you can switch their order:
R
round(digits = 2, x = 3.14159)
OUTPUT
#> [1] 3.14
It’s good practice to put the non-optional arguments (like the number you’re rounding) first in your function call, and to then specify the names of all optional arguments. If you don’t, someone reading your code might have to look up the definition of a function with unfamiliar arguments to understand what you’re doing.
Vectors and data types
A vector is the most common and basic data type in R, and is pretty
much the workhorse of R. A vector is composed by a series of values,
which can be either numbers or characters. We can assign a series of
values to a vector using the c()
function. For example we
can create a vector of animal weights and assign it to a new object
weight_g
:
R
weight_g <- c(50, 60, 65, 82)
weight_g
A vector can also contain characters:
R
animals <- c("mouse", "rat", "dog")
animals
The quotes around “mouse”, “rat”, etc. are essential here. Without
the quotes R will assume objects have been created called
mouse
, rat
and dog
. As these
objects don’t exist in R’s memory, there will be an error message.
There are many functions that allow you to inspect the content of a
vector. length()
tells you how many elements are in a
particular vector:
R
length(weight_g)
length(animals)
An important feature of a vector, is that all of the elements are the
same type of data. The function class()
indicates what kind
of object you are working with:
R
class(weight_g)
class(animals)
The function str()
provides an overview of the structure
of an object and its elements. It is a useful function when working with
large and complex objects:
R
str(weight_g)
str(animals)
You can use the c()
function to add other elements to
your vector:
R
weight_g <- c(weight_g, 90) # add to the end of the vector
weight_g <- c(30, weight_g) # add to the beginning of the vector
weight_g
In the first line, we take the original vector weight_g
,
add the value 90
to the end of it, and save the result back
into weight_g
. Then we add the value 30
to the
beginning, again saving the result back into weight_g
.
We can do this over and over again to grow a vector, or assemble a dataset. As we program, this may be useful to add results that we are collecting or calculating.
An atomic vector is the simplest R data
type and is a linear vector of a single type. Above, we saw 2
of the 6 main atomic vector types that R uses:
"character"
and "numeric"
(or
"double"
). These are the basic building blocks that all R
objects are built from. The other 4 atomic vector types
are:
-
"logical"
forTRUE
andFALSE
(the boolean data type) -
"integer"
for integer numbers (e.g.,2L
, theL
indicates to R that it’s an integer) -
"complex"
to represent complex numbers with real and imaginary parts (e.g.,1 + 4i
) and that’s all we’re going to say about them -
"raw"
for bitstreams that we won’t discuss further
You can check the type of your vector using the typeof()
function and inputting your vector as the argument.
Vectors are one of the many data structures that R
uses. Other important ones are lists (list
), matrices
(matrix
), data frames (data.frame
), factors
(factor
) and arrays (array
).
R implicitly converts them to all be the same type
Vectors can be of only one data type. R tries to convert (coerce) the content of this vector to find a “common denominator” that doesn’t lose any information.
Only one. There is no memory of past data types, and the coercion
happens the first time the vector is evaluated. Therefore, the
TRUE
in num_logical
gets converted into a
1
before it gets converted into "1"
in
combined_logical
.
Challenge (continued)
- You’ve probably noticed that objects of different types get converted into a single, shared type within a vector. In R, we call converting objects from one class into another class coercion. These conversions happen according to a hierarchy, whereby some types get preferentially coerced into other types. Can you draw a diagram that represents the hierarchy of how these data types are coerced?
logical → numeric → character ← logical
Subsetting vectors
If we want to extract one or several values from a vector, we must provide one or several indices in square brackets. For instance:
R
animals <- c("mouse", "rat", "dog", "cat")
animals[2]
OUTPUT
#> [1] "rat"
R
animals[c(3, 2)]
OUTPUT
#> [1] "dog" "rat"
We can also repeat the indices to create an object with more elements than the original one:
R
more_animals <- animals[c(1, 2, 3, 2, 1, 4)]
more_animals
OUTPUT
#> [1] "mouse" "rat" "dog" "rat" "mouse" "cat"
R indices start at 1. Programming languages like Fortran, MATLAB, Julia, and R start counting at 1, because that’s what human beings typically do. Languages in the C family (including C++, Java, Perl, and Python) count from 0 because that’s simpler for computers to do.
Conditional subsetting
Another common way of subsetting is by using a logical vector.
TRUE
will select the element with the same index, while
FALSE
will not:
R
weight_g <- c(21, 34, 39, 54, 55)
weight_g[c(TRUE, FALSE, FALSE, TRUE, TRUE)]
OUTPUT
#> [1] 21 54 55
Typically, these logical vectors are not typed by hand, but are the output of other functions or logical tests. For instance, if you wanted to select only the values above 50:
R
weight_g > 50 # will return logicals with TRUE for the indices that meet the condition
OUTPUT
#> [1] FALSE FALSE FALSE TRUE TRUE
R
## so we can use this to select only the values above 50
weight_g[weight_g > 50]
OUTPUT
#> [1] 54 55
You can combine multiple tests using &
(both
conditions are true, AND) or |
(at least one of the
conditions is true, OR):
R
weight_g[weight_g > 30 & weight_g < 50]
OUTPUT
#> [1] 34 39
R
weight_g[weight_g <= 30 | weight_g == 55]
OUTPUT
#> [1] 21 55
R
weight_g[weight_g >= 30 & weight_g == 21]
OUTPUT
#> numeric(0)
Here, >
for “greater than”, <
stands
for “less than”, <=
for “less than or equal to”, and
==
for “equal to”. The double equal sign ==
is
a test for numerical equality between the left and right hand sides, and
should not be confused with the single =
sign, which
performs variable assignment (similar to <-
).
A common task is to search for certain strings in a vector. One could
use the “or” operator |
to test for equality to multiple
values, but this can quickly become tedious. The function
%in%
allows you to test if any of the elements of a search
vector are found:
R
animals <- c("mouse", "rat", "dog", "cat", "cat")
# return both rat and cat
animals[animals == "cat" | animals == "rat"]
OUTPUT
#> [1] "rat" "cat" "cat"
R
# return a logical vector that is TRUE for the elements within animals
# that are found in the character vector and FALSE for those that are not
animals %in% c("rat", "cat", "dog", "duck", "goat", "bird", "fish")
OUTPUT
#> [1] FALSE TRUE TRUE TRUE TRUE
R
# use the logical vector created by %in% to return elements from animals
# that are found in the character vector
animals[animals %in% c("rat", "cat", "dog", "duck", "goat", "bird", "fish")]
OUTPUT
#> [1] "rat" "dog" "cat" "cat"
When using “>” or “<” on strings, R compares their alphabetical order. Here “four” comes after “five”, and therefore is “greater than” it.
Missing data
As R was designed to analyze datasets, it includes the concept of
missing data (which is uncommon in other programming languages). Missing
data are represented in vectors as NA
.
When doing operations on numbers, most functions will return
NA
if the data you are working with include missing values.
This feature makes it harder to overlook the cases where you are dealing
with missing data. You can add the argument na.rm = TRUE
to
calculate the result as if the missing values were removed
(rm
stands for ReMoved) first.
R
heights <- c(2, 4, 4, NA, 6)
mean(heights)
max(heights)
mean(heights, na.rm = TRUE)
max(heights, na.rm = TRUE)
If your data include missing values, you may want to become familiar
with the functions is.na()
, na.omit()
, and
complete.cases()
. See below for examples.
R
## Extract those elements which are not missing values.
heights[!is.na(heights)]
## Returns the object with incomplete cases removed.
#The returned object is an atomic vector of type `"numeric"` (or #`"double"`).
na.omit(heights)
## Extract those elements which are complete cases.
#The returned object is an atomic vector of type `"numeric"` (or #`"double"`).
heights[complete.cases(heights)]
Recall that you can use the typeof()
function to find
the type of your atomic vector.
Challenge
- Using this vector of heights in inches, create a new vector,
heights_no_na
, with the NAs removed.
R
heights <- c(63, 69, 60, 65, NA, 68, 61, 70, 61, 59, 64, 69, 63, 63, NA, 72, 65, 64, 70, 63, 65)
Use the function
median()
to calculate the median of theheights
vector.Use R to figure out how many people in the set are taller than 67 inches.
R
heights <- c(63, 69, 60, 65, NA, 68, 61, 70, 61, 59, 64, 69, 63, 63, NA, 72, 65, 64, 70, 63, 65)
# 1.
heights_no_na <- heights[!is.na(heights)]
# or
heights_no_na <- na.omit(heights)
# or
heights_no_na <- heights[complete.cases(heights)]
# 2.
median(heights, na.rm = TRUE)
# 3.
heights_above_67 <- heights_no_na[heights_no_na > 67]
length(heights_above_67)
Now that we have learned how to write scripts, and the basics of R’s data structures, we are ready to start working with the Portal dataset we have been using in the other lessons, and learn about data frames.
Keypoints
-
<-
is used to assign values on the right to objects on the left - Code should be saved within the Source pane in RStudio to help you
return to your code later.
- ‘#’ can be used to add comments to your code.
- Functions can automate more complicated sets of commands, and require arguments as inputs.
- Vectors are composed by a series of values and can take many forms.
- Data structures in R include ‘vector’, ‘list’, ‘matrix’, ‘data.frame’, ‘factor’, and ‘array’.
- Vectors can be subset by indexing or through logical vectors.
- Many functions exist to remove missing data from data structures.
Content from Starting with data
Last updated on 2023-11-21 | Edit this page
Overview
Questions
- What is a data.frame?
- How can I read a complete csv file into R?
- How can I get basic summary information about my dataset?
- How can extract specific information from a dataframe?
- What are factors, and how are they different from other datatypes?
- How can I rename factors?
- How are dates represented in R and how can I change the format?
Objectives
- Load external data from a .csv file into a data frame.
- Install and load packages.
- Describe what a data frame is.
- Summarize the contents of a data frame.
- Use indexing to subset specific portions of data frames.
- Describe what a factor is.
- Convert between strings and factors.
- Reorder and rename factors.
- Change how character strings are handled in a data frame.
- Format dates.
Loading the survey data
We are investigating the animal species diversity and weights found within plots at our study site. The dataset is stored as a comma separated value (CSV) file. Each row holds information for a single animal, and the columns represent:
Column | Description |
---|---|
record_id | Unique id for the observation |
month | month of observation |
day | day of observation |
year | year of observation |
plot_id | ID of a particular experimental plot of land |
species_id | 2-letter code |
sex | sex of animal (“M”, “F”) |
hindfoot_length | length of the hindfoot in mm |
weight | weight of the animal in grams |
genus | genus of animal |
species | species of animal |
taxon | e.g. Rodent, Reptile, Bird, Rabbit |
plot_type | type of plot |
Downloading the data
We created the folder that will store the downloaded data
(data_raw
) in the chapter “Before
we start”. If you skipped that part, it may be a good idea to have a
look now, to make sure your working directory is set up properly.
We are going to use the R function download.file()
to
download the CSV file that contains the survey data from Figshare, and
we will use read_csv()
to load the content of the CSV file
into R.
Inside the download.file
command, the first entry is a
character string with the source URL (“https://ndownloader.figshare.com/files/2292169”). This
source URL downloads a CSV file from figshare. The text after the comma
(“data_raw/portal_data_joined.csv”) is the destination of the file on
your local machine. You’ll need to have a folder on your machine called
“data_raw” where you’ll download the file. So this command downloads a
file from Figshare, names it “portal_data_joined.csv” and adds it to a
preexisting folder named “data_raw”.
R
download.file(url = "https://ndownloader.figshare.com/files/2292169",
destfile = "data_raw/portal_data_joined.csv")
Reading the data into R
The file has now been downloaded to the destination you specified,
but R has not yet loaded the data from the file into memory. To do this,
we can use the read_csv()
function from the
tidyverse
package.
Packages in R are basically sets of additional functions that let you
do more stuff. The functions we’ve been using so far, like
round()
, sqrt()
, or c()
come
built into R. Packages give you access to additional functions beyond
base R. A similar function to read_csv()
from the tidyverse
package is read.csv()
from base R. We don’t have time to
cover their differences but notice that the exact spelling determines
which function is used. Before you use a package for the first time you
need to install it on your machine, and then you should import it in
every subsequent R session when you need it.
To install the tidyverse
package, we
can type install.packages("tidyverse")
straight into the
console. In fact, it’s better to write this in the console than in our
script for any package, as there’s no need to re-install packages every
time we run the script. Then, to load the package type:
R
## load the tidyverse packages, incl. dplyr
library(tidyverse)
Now we can use the functions from the
tidyverse
package. Let’s use
read_csv()
to read the data into a data frame (we will
learn more about data frames later):
R
surveys <- read_csv("data_raw/portal_data_joined.csv")
OUTPUT
#> Rows: 34786 Columns: 13
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (6): species_id, sex, genus, species, taxa, plot_type
#> dbl (7): record_id, month, day, year, plot_id, hindfoot_length, weight
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
When you execute read_csv
on a data file, it looks
through the first 1000 rows of each column and guesses its data type.
For example, in this dataset, read_csv()
reads
weight
as col_double
(a numeric data type),
and species
as col_character
. You have the
option to specify the data type for a column manually by using the
col_types
argument in read_csv
.
Note
read_csv()
assumes that fields are delineated by commas.
However, in several countries, the comma is used as a decimal separator
and the semicolon (;) is used as a field delineator. If you want to read
in this type of files in R, you can use the read_csv2()
function. It behaves like read_csv()
but uses different
parameters for the decimal and the field separators. There is also the
read_tsv()
for tab separated data files and
read_delim()
for less common formats. Check out the help
for read_csv()
by typing ?read_csv
to learn
more.
In addition to the above versions of the csv format, you should develop the habits of looking at and recording some parameters of your csv files. For instance, the character encoding, control characters used for line ending, date format (if the date is not split into three variables), and the presence of unexpected newlines are important characteristics of your data files. Those parameters will ease up the import step of your data in R.
We can see the contents of the first few lines of the data by typing
its name: surveys
. By default, this will show you as many
rows and columns of the data as fit on your screen. If you wanted the
first 50 rows, you could type print(surveys, n = 50)
We can also extract the first few lines of this data using the
function head()
:
R
head(surveys)
OUTPUT
#> # A tibble: 6 × 13
#> record_id month day year plot_id species_id sex hindfoot_length weight
#> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
#> 1 1 7 16 1977 2 NL M 32 NA
#> 2 72 8 19 1977 2 NL M 31 NA
#> 3 224 9 13 1977 2 NL <NA> NA NA
#> 4 266 10 16 1977 2 NL <NA> NA NA
#> 5 349 11 12 1977 2 NL <NA> NA NA
#> 6 363 11 12 1977 2 NL <NA> NA NA
#> # ℹ 4 more variables: genus <chr>, species <chr>, taxa <chr>, plot_type <chr>
Unlike the print()
function, head()
returns
the extracted data. You could use it to assign the first 100 rows of
surveys
to an object using
surveys_sample <- head(surveys, 100)
. This can be useful
if you want to try out complex computations on a subset of your data
before you apply them to the whole data set. There is a similar function
that lets you extract the last few lines of the data set. It is called
(you might have guessed it) tail()
.
To open the dataset in RStudio’s Data Viewer, use the
view()
function:
R
view(surveys)
What are data frames?
When we loaded the data into R, it got stored as an object of class
tibble
, which is a special kind of data frame (the
difference is not important for our purposes, but you can learn more
about tibbles here). Data
frames are the de facto data structure for most tabular data,
and what we use for statistics and plotting. Data frames can be created
by hand, but most commonly they are generated by functions like
read_csv()
; in other words, when importing spreadsheets
from your hard drive or the web.
A data frame is the representation of data in the format of a table where the columns are vectors that all have the same length. Because columns are vectors, each column must contain a single type of data (e.g., characters, integers, factors). For example, here is a figure depicting a data frame comprising a numeric, a character, and a logical vector.
We can see this also when inspecting the structure of a data
frame with the function str()
:
R
str(surveys)
Inspecting data frames
We already saw how the functions head()
and
str()
can be useful to check the content and the structure
of a data frame. Here is a non-exhaustive list of functions to get a
sense of the content/structure of the data. Let’s try them out!
-
Size:
-
dim(surveys)
- returns a vector with the number of rows in the first element, and the number of columns as the second element (the dimensions of the object) -
nrow(surveys)
- returns the number of rows -
ncol(surveys)
- returns the number of columns
-
-
Content:
-
head(surveys)
- shows the first 6 rows -
tail(surveys)
- shows the last 6 rows
-
-
Names:
-
names(surveys)
- returns the column names (synonym ofcolnames()
fordata.frame
objects) -
rownames(surveys)
- returns the row names
-
-
Summary:
-
str(surveys)
- structure of the object and information about the class, length and content of each column -
summary(surveys)
- summary statistics for each column
-
Note: most of these functions are “generic”, they can be used on
other types of objects besides data.frame
.
R
str(surveys)
OUTPUT
#> spc_tbl_ [34,786 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
#> $ record_id : num [1:34786] 1 72 224 266 349 363 435 506 588 661 ...
#> $ month : num [1:34786] 7 8 9 10 11 11 12 1 2 3 ...
#> $ day : num [1:34786] 16 19 13 16 12 12 10 8 18 11 ...
#> $ year : num [1:34786] 1977 1977 1977 1977 1977 ...
#> $ plot_id : num [1:34786] 2 2 2 2 2 2 2 2 2 2 ...
#> $ species_id : chr [1:34786] "NL" "NL" "NL" "NL" ...
#> $ sex : chr [1:34786] "M" "M" NA NA ...
#> $ hindfoot_length: num [1:34786] 32 31 NA NA NA NA NA NA NA NA ...
#> $ weight : num [1:34786] NA NA NA NA NA NA NA NA 218 NA ...
#> $ genus : chr [1:34786] "Neotoma" "Neotoma" "Neotoma" "Neotoma" ...
#> $ species : chr [1:34786] "albigula" "albigula" "albigula" "albigula" ...
#> $ taxa : chr [1:34786] "Rodent" "Rodent" "Rodent" "Rodent" ...
#> $ plot_type : chr [1:34786] "Control" "Control" "Control" "Control" ...
#> - attr(*, "spec")=
#> .. cols(
#> .. record_id = col_double(),
#> .. month = col_double(),
#> .. day = col_double(),
#> .. year = col_double(),
#> .. plot_id = col_double(),
#> .. species_id = col_character(),
#> .. sex = col_character(),
#> .. hindfoot_length = col_double(),
#> .. weight = col_double(),
#> .. genus = col_character(),
#> .. species = col_character(),
#> .. taxa = col_character(),
#> .. plot_type = col_character()
#> .. )
#> - attr(*, "problems")=<externalptr>
R
## * class: data frame
## * how many rows: 34786, how many columns: 13
Indexing and subsetting data frames
Our survey data frame has rows and columns (it has 2 dimensions), if we want to extract some specific data from it, we need to specify the “coordinates” we want from it. Row numbers come first, followed by column numbers. However, note that different ways of specifying these coordinates lead to results with different classes.
R
# We can extract specific values by specifying row and column indices
# in the format:
# data_frame[row_index, column_index]
# For instance, to extract the first row and column from surveys:
surveys[1, 1]
# First row, sixth column:
surveys[1, 6]
# We can also use shortcuts to select a number of rows or columns at once
# To select all columns, leave the column index blank
# For instance, to select all columns for the first row:
surveys[1, ]
# The same shortcut works for rows --
# To select the first column across all rows:
surveys[, 1]
# An even shorter way to select first column across all rows:
surveys[1] # No comma!
# To select multiple rows or columns, use vectors!
# To select the first three rows of the 5th and 6th column
surveys[c(1, 2, 3), c(5, 6)]
# We can use the : operator to create those vectors for us:
surveys[1:3, 5:6]
# This is equivalent to head_surveys <- head(surveys)
head_surveys <- surveys[1:6, ]
# As we've seen, when working with tibbles
# subsetting with single square brackets ("[]") always returns a data frame.
# If you want a vector, use double square brackets ("[[]]")
# For instance, to get the first column as a vector:
surveys[[1]]
# To get the first value in our data frame:
surveys[[1, 1]]
:
is a special function that creates numeric vectors of
integers in increasing or decreasing order, test 1:10
and
10:1
for instance.
You can also exclude certain indices of a data frame using the
“-
” sign:
R
surveys[, -1] # The whole data frame, except the first column
surveys[-(7:nrow(surveys)), ] # Equivalent to head(surveys)
Data frames can be subset by calling indices (as shown previously), but also by calling their column names directly:
R
# As before, using single brackets returns a data frame:
surveys["species_id"]
surveys[, "species_id"]
# Double brackets returns a vector:
surveys[["species_id"]]
# We can also use the $ operator with column names instead of double brackets
# This returns a vector:
surveys$species_id
In RStudio, you can use the autocompletion feature to get the full and correct names of the columns.
Challenge
Create a
data.frame
(surveys_200
) containing only the data in row 200 of thesurveys
dataset.Notice how
nrow()
gave you the number of rows in adata.frame
?
- Use that number to pull out just that last row from the
surveys
dataset. - Compare that with what you see as the last row using
tail()
to make sure it’s meeting expectations. - Pull out that last row using
nrow()
instead of the row number. - Create a new data frame (
surveys_last
) from that last row.
Use
nrow()
to extract the row that is in the middle of the data frame. Store the content of this row in an object namedsurveys_middle
.Combine
nrow()
with the-
notation above to reproduce the behavior ofhead(surveys)
, keeping just the first through 6th rows of the surveys dataset.
R
## 1.
surveys_200 <- surveys[200, ]
## 2.
# Saving `n_rows` to improve readability and reduce duplication
n_rows <- nrow(surveys)
surveys_last <- surveys[n_rows, ]
## 3.
surveys_middle <- surveys[n_rows / 2, ]
## 4.
surveys_head <- surveys[-(7:n_rows), ]
Factors
When we did str(surveys)
we saw that several of the
columns consist of integers. The columns genus
,
species
, sex
, plot_type
, …
however, are of the class character
. Arguably, these
columns contain categorical data, that is, they can only take on a
limited number of values.
R has a special class for working with categorical data, called
factor
. Factors are very useful and actually contribute to
making R particularly well suited to working with data. So we are going
to spend a little time introducing them.
Once created, factors can only contain a pre-defined set of values, known as levels. Factors are stored as integers associated with labels and they can be ordered or unordered. While factors look (and often behave) like character vectors, they are actually treated as integer vectors by R. So you need to be very careful when treating them as strings.
When importing a data frame with read_csv()
, the columns
that contain text are not automatically coerced (=converted) into the
factor
data type, but once we have loaded the data we can
do the conversion using the factor()
function:
R
surveys$sex <- factor(surveys$sex)
We can see that the conversion has worked by using the
summary()
function again. This produces a table with the
counts for each factor level:
R
summary(surveys$sex)
By default, R always sorts levels in alphabetical order. For instance, if you have a factor with 2 levels:
R
sex <- factor(c("male", "female", "female", "male"))
R will assign 1
to the level "female"
and
2
to the level "male"
(because f
comes before m
, even though the first element in this
vector is "male"
). You can see this by using the function
levels()
and you can find the number of levels using
nlevels()
:
R
levels(sex)
nlevels(sex)
Sometimes, the order of the factors does not matter, other times you
might want to specify the order because it is meaningful (e.g., “low”,
“medium”, “high”), it improves your visualization, or it is required by
a particular type of analysis. Here, one way to reorder our levels in
the sex
vector would be:
R
sex # current order
OUTPUT
#> [1] male female female male
#> Levels: female male
R
sex <- factor(sex, levels = c("male", "female"))
sex # after re-ordering
OUTPUT
#> [1] male female female male
#> Levels: male female
In R’s memory, these factors are represented by integers (1, 2, 3),
but are more informative than integers because factors are self
describing: "female"
, "male"
is more
descriptive than 1
, 2
. Which one is “male”?
You wouldn’t be able to tell just from the integer data. Factors, on the
other hand, have this information built in. It is particularly helpful
when there are many levels (like the species names in our example
dataset).
R
surveys$taxa <- factor(surveys$taxa)
surveys$genus <- factor(surveys$genus)
summary(surveys)
nlevels(surveys$genus)
## * how many genera: There are 26 unique genera in the `genus` column.
## * how many rabbts: There are 75 rabbits in the `taxa` column.
Converting factors
If you need to convert a factor to a character vector, you use
as.character(x)
.
R
as.character(sex)
In some cases, you may have to convert factors where the levels
appear as numbers (such as concentration levels or years) to a numeric
vector. For instance, in one part of your analysis the years might need
to be encoded as factors (e.g., comparing average weights across years)
but in another part of your analysis they may need to be stored as
numeric values (e.g., doing math operations on the years). This
conversion from factor to numeric is a little trickier. The
as.numeric()
function returns the index values of the
factor, not its levels, so it will result in an entirely new (and
unwanted in this case) set of numbers. One method to avoid this is to
convert factors to characters, and then to numbers.
Another method is to use the levels()
function.
Compare:
R
year_fct <- factor(c(1990, 1983, 1977, 1998, 1990))
as.numeric(year_fct) # Wrong! And there is no warning...
as.numeric(as.character(year_fct)) # Works...
as.numeric(levels(year_fct))[year_fct] # The recommended way.
Notice that in the levels()
approach, three important
steps occur:
- We obtain all the factor levels using
levels(year_fct)
- We convert these levels to numeric values using
as.numeric(levels(year_fct))
- We then access these numeric values using the underlying integers of
the vector
year_fct
inside the square brackets
Renaming factors
When your data is stored as a factor, you can use the
plot()
function to get a quick glance at the number of
observations represented by each factor level. Let’s look at the number
of males and females captured over the course of the experiment:
R
## bar plot of the number of females and males captured during the experiment:
plot(surveys$sex)
However, as we saw when we used summary(surveys$sex)
,
there are about 1700 individuals for which the sex information hasn’t
been recorded. To show them in the plot, we can turn the missing values
into a factor level with the addNA()
function. We will also
have to give the new factor level a label. We are going to work with a
copy of the sex
column, so we’re not modifying the working
copy of the data frame:
R
sex <- surveys$sex
levels(sex)
OUTPUT
#> [1] "F" "M"
R
sex <- addNA(sex)
levels(sex)
OUTPUT
#> [1] "F" "M" NA
R
head(sex)
OUTPUT
#> [1] M M <NA> <NA> <NA> <NA>
#> Levels: F M <NA>
R
levels(sex)[3] <- "undetermined"
levels(sex)
OUTPUT
#> [1] "F" "M" "undetermined"
R
head(sex)
OUTPUT
#> [1] M M undetermined undetermined undetermined
#> [6] undetermined
#> Levels: F M undetermined
Now we can plot the data again, using plot(sex)
.
R
levels(sex)[1:2] <- c("female", "male")
sex <- factor(sex, levels = c("undetermined", "female", "male"))
plot(sex)
Challenge
- We have seen how data frames are created when using
read_csv()
, but they can also be created by hand with thedata.frame()
function. There are a few mistakes in this hand-crafteddata.frame
. Can you spot and fix them? Don’t hesitate to experiment!
R
<- data.frame(
animal_data animal = c(dog, cat, sea cucumber, sea urchin),
feel = c("furry", "squishy", "spiny"),
weight = c(45, 8 1.1, 0.8)
)
- Can you predict the class for each of the columns in the following
example? Check your guesses using
str(country_climate)
:
- Are they what you expected? Why? Why not?
- What would you need to change to ensure that each column had the accurate data type?
R
country_climate <- data.frame(
country = c("Canada", "Panama", "South Africa", "Australia"),
climate = c("cold", "hot", "temperate", "hot/temperate"),
temperature = c(10, 30, 18, "15"),
northern_hemisphere = c(TRUE, TRUE, FALSE, "FALSE"),
has_kangaroo = c(FALSE, FALSE, FALSE, 1)
)
The automatic conversion of data type is sometimes a blessing, sometimes an annoyance. Be aware that it exists, learn the rules, and double check that data you import in R are of the correct type within your data frame. If not, use it to your advantage to detect mistakes that might have been introduced during data entry (for instance, a letter in a column that should only contain numbers).
Learn more in this RStudio tutorial
Formatting dates
A common issue that new (and experienced!) R users have is converting
date and time information into a variable that is suitable for analyses.
One way to store date information is to store each component of the date
in a separate column. Using str()
, we can confirm that our
data frame does indeed have a separate column for day, month, and year,
and that each of these columns contains integer values.
R
str(surveys)
We are going to use the ymd()
function from the package
lubridate
(which belongs to the
tidyverse
; learn more here).
lubridate
gets installed as part as the
tidyverse
installation. When you load the
tidyverse
(library(tidyverse)
), the core packages (the packages used
in most data analyses) get loaded.
lubridate
however does not belong to the
core tidyverse, so you have to load it explicitly with
library(lubridate)
Start by loading the required package:
R
library(lubridate)
The lubridate
package has many useful
functions for working with dates. These can help you extract dates from
different string representations, convert between timezones, calculate
time differences and more. You can find an overview of them in the lubridate
cheat sheet.
Here we will use the function ymd()
, which takes a
vector representing year, month, and day, and converts it to a
Date
vector. Date
is a class of data
recognized by R as being a date and can be manipulated as such. The
argument that the function requires is flexible, but, as a best
practice, is a character vector formatted as “YYYY-MM-DD”.
Let’s create a date object and inspect the structure:
R
my_date <- ymd("2015-01-01")
str(my_date)
Now let’s paste the year, month, and day separately - we get the same result:
R
# sep indicates the character to use to separate each component
my_date <- ymd(paste("2015", "1", "1", sep = "-"))
str(my_date)
Now we apply this function to the surveys dataset. Create a character
vector from the year
, month
, and
day
columns of surveys
using
paste()
:
R
paste(surveys$year, surveys$month, surveys$day, sep = "-")
This character vector can be used as the argument for
ymd()
:
R
ymd(paste(surveys$year, surveys$month, surveys$day, sep = "-"))
WARNING
#> Warning: 129 failed to parse.
There is a warning telling us that some dates could not be parsed
(understood) by the ymd()
function. For these dates, the
function has returned NA
, which means they are treated as
missing values. We will deal with this problem later, but first we add
the resulting Date
vector to the surveys
data
frame as a new column called date
:
R
surveys$date <- ymd(paste(surveys$year, surveys$month, surveys$day, sep = "-"))
WARNING
#> Warning: 129 failed to parse.
R
str(surveys) # notice the new column, with 'date' as the class
Let’s make sure everything worked correctly. One way to inspect the
new column is to use summary()
:
R
summary(surveys$date)
OUTPUT
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> "1977-07-16" "1984-03-12" "1990-07-22" "1990-12-15" "1997-07-29" "2002-12-31"
#> NA's
#> "129"
Let’s investigate why some dates could not be parsed.
We can use the functions we saw previously to deal with missing data
to identify the rows in our data frame that are failing. If we combine
them with what we learned about subsetting data frames earlier, we can
extract the columns “year,”month”, “day” from the records that have
NA
in our new column date
. We will also use
head()
so we don’t clutter the output:
R
missing_dates <- surveys[is.na(surveys$date), c("year", "month", "day")]
head(missing_dates)
OUTPUT
#> # A tibble: 6 × 3
#> year month day
#> <dbl> <dbl> <dbl>
#> 1 2000 9 31
#> 2 2000 4 31
#> 3 2000 4 31
#> 4 2000 4 31
#> 5 2000 4 31
#> 6 2000 9 31
Why did these dates fail to parse? If you had to use these data for your analyses, how would you deal with this situation?
The answer is because the dates provided as input for the
ymd()
function do not actually exist. If we refer to the
output we got above, September and April only have 30 days, not 31 days
as it is specified in our dataset.
There are several ways you could deal with situation:
- If you have access to the raw data (e.g., field sheets) or supporting information (e.g., field trip reports/logs), check them and ensure the electronic database matches the information in the original data source.
- If you are able to contact the person responsible for collecting the data, you could refer to them and ask for clarification.
- You could also check the rest of the dataset for clues about the correct value for the erroneous dates.
- If your project has guidelines on how to correct this sort of errors, refer to them and apply any recommendations.
- If it is not possible to ascertain the correct value for these observations, you may want to leave them as missing data.
Regardless of the option you choose, it is important that you document the error and the corrections (if any) that you apply to your data.
Keypoints
- Use
read.csv
to read tabular data in R. - A data frame is the representation of data in the format of a table where the columns are vectors that all have the same length.
-
dplyr
provides many methods for inspecting and summarizing data in data frames. - Use factors to represent categorical data in R.
- The
lubridate
package has many useful functions for working with dates.
Content from Manipulating, analyzing and exporting data with tidyverse
Last updated on 2023-11-21 | Edit this page
Overview
Questions
- What are dplyr and tidyr?
- How can I select specific rows and/or columns from a dataframe?
- How can I combine multiple commands into a single command?
- How can I create new columns or remove existing columns from a dataframe?
Objectives
- Describe the purpose of the
dplyr
andtidyr
packages. - Select certain columns in a data frame with the
dplyr
functionselect
. - Extract certain rows in a data frame according to logical (boolean)
conditions with the
dplyr
functionfilter
. - Link the output of one
dplyr
function to the input of another function with the ‘pipe’ operator%>%
. - Add new columns to a data frame that are functions of existing
columns with
mutate
. - Use the split-apply-combine concept for data analysis.
- Use
summarize
,group_by
, andcount
to split a data frame into groups of observations, apply summary statistics for each group, and then combine the results. - Describe the concept of a wide and a long table format and for which purpose those formats are useful.
- Describe what key-value pairs are.
- Reshape a data frame from long to wide format and back with the
pivot_wider
andpivot_longer
commands from thetidyr
package. - Export a data frame to a .csv file.
Data manipulation using dplyr
and
tidyr
Bracket subsetting is handy, but it can be cumbersome and difficult
to read, especially for complicated operations. Enter
dplyr
. dplyr
is a package for helping with tabular data manipulation. It pairs nicely
with tidyr
which enables you to swiftly
convert between different data formats for plotting and analysis.
The tidyverse
package is an
“umbrella-package” that installs tidyr
,
dplyr
, and several other useful packages
for data analysis, such as ggplot2
,
tibble
, etc.
The tidyverse
package tries to address
3 common issues that arise when doing data analysis in R:
- The results from a base R function sometimes depend on the type of data.
- R expressions are used in a non standard way, which can be confusing for new learners.
- The existence of hidden arguments having default operations that new learners are not aware of.
You should already have installed and loaded the
tidyverse
package. If you haven’t already
done so, you can type install.packages("tidyverse")
straight into the console. Then, type library(tidyverse)
to
load the package.
What are dplyr
and
tidyr
?
The package dplyr
provides helper tools
for the most common data manipulation tasks. It is built to work
directly with data frames, with many common tasks optimized by being
written in a compiled language (C++). An additional feature is the
ability to work directly with data stored in an external database. The
benefits of doing this are that the data can be managed natively in a
relational database, queries can be conducted on that database, and only
the results of the query are returned.
This addresses a common problem with R in that all operations are conducted in-memory and thus the amount of data you can work with is limited by available memory. The database connections essentially remove that limitation in that you can connect to a database of many hundreds of GB, conduct queries on it directly, and pull back into R only what you need for analysis.
The package tidyr
addresses the common
problem of wanting to reshape your data for plotting and usage by
different R functions. For example, sometimes we want data sets where we
have one row per measurement. Other times we want a data frame where
each measurement type has its own column, and rows are instead more
aggregated groups (e.g., a time period, an experimental unit like a plot
or a batch number). Moving back and forth between these formats is
non-trivial, and tidyr
gives you tools for
this and more sophisticated data manipulation.
To learn more about dplyr
and
tidyr
after the workshop, you may want to
check out this handy
data transformation with dplyr
cheatsheet and this one
about tidyr
.
As before, we’ll read in our data using the read_csv()
function from the tidyverse package
readr
.
R
surveys <- read_csv("data_raw/portal_data_joined.csv")
OUTPUT
#> Rows: 34786 Columns: 13
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (6): species_id, sex, genus, species, taxa, plot_type
#> dbl (7): record_id, month, day, year, plot_id, hindfoot_length, weight
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
R
## inspect the data
str(surveys)
R
## preview the data
view(surveys)
Next, we’re going to learn some of the most common
dplyr
functions:
-
select()
: subset columns -
filter()
: subset rows on conditions -
mutate()
: create new columns by using information from other columns -
group_by()
andsummarize()
: create summary statistics on grouped data -
arrange()
: sort results -
count()
: count discrete values
Selecting columns and filtering rows
To select columns of a data frame, use select()
. The
first argument to this function is the data frame
(surveys
), and the subsequent arguments are the columns to
keep.
R
select(surveys, plot_id, species_id, weight)
To select all columns except certain ones, put a “-” in front of the variable to exclude it.
R
select(surveys, -record_id, -species_id)
This will select all the variables in surveys
except
record_id
and species_id
.
To choose rows based on a specific criterion, use
filter()
:
R
filter(surveys, year == 1995)
Pipes
What if you want to select and filter at the same time? There are three ways to do this: use intermediate steps, nested functions, or pipes.
With intermediate steps, you create a temporary data frame and use that as input to the next function, like this:
R
surveys2 <- filter(surveys, weight < 5)
surveys_sml <- select(surveys2, species_id, sex, weight)
This is readable, but can clutter up your workspace with lots of objects that you have to name individually. With multiple steps, that can be hard to keep track of.
You can also nest functions (i.e. one function inside of another), like this:
R
surveys_sml <- select(filter(surveys, weight < 5), species_id, sex, weight)
This is handy, but can be difficult to read if too many functions are nested, as R evaluates the expression from the inside out (in this case, filtering, then selecting).
The last option, pipes, are a recent addition to R. Pipes
let you take the output of one function and send it directly to the
next, which is useful when you need to do many things to the same
dataset. Pipes in R look like %>%
and are made available
via the magrittr
package, installed
automatically with dplyr
. If you use
RStudio, you can type the pipe with Ctrl
- Shift + M if you have a PC or Cmd + Shift + M if you have a Mac.
R
surveys %>%
filter(weight < 5) %>%
select(species_id, sex, weight)
In the above code, we use the pipe to send the surveys
dataset first through filter()
to keep rows where
weight
is less than 5, then through select()
to keep only the species_id
, sex
, and
weight
columns. Since %>%
takes the object
on its left and passes it as the first argument to the function on its
right, we don’t need to explicitly include the data frame as an argument
to the filter()
and select()
functions any
more.
Some may find it helpful to read the pipe like the word “then.” For
instance, in the example above, we took the data frame
surveys
, then we filter
ed for rows
with weight < 5
, then we select
ed
columns species_id
, sex
, and
weight
. The dplyr
functions
by themselves are somewhat simple, but by combining them into linear
workflows with the pipe we can accomplish more complex manipulations of
data frames.
If we want to create a new object with this smaller version of the data, we can assign it a new name:
R
surveys_sml <- surveys %>%
filter(weight < 5) %>%
select(species_id, sex, weight)
surveys_sml
Note that the final data frame is the leftmost part of this expression.
R
surveys %>%
filter(year < 1995) %>%
select(year, sex, weight)
Mutate
Frequently you’ll want to create new columns based on the values in
existing columns, for example to do unit conversions, or to find the
ratio of values in two columns. For this we’ll use
mutate()
.
To create a new column of weight in kg:
R
surveys %>%
mutate(weight_kg = weight / 1000)
You can also create a second new column based on the first new column
within the same call of mutate()
:
R
surveys %>%
mutate(weight_kg = weight / 1000,
weight_lb = weight_kg * 2.2)
If this runs off your screen and you just want to see the first few
rows, you can use a pipe to view the head()
of the data.
(Pipes work with non-dplyr
functions, too,
as long as the dplyr
or
magrittr
package is loaded).
R
surveys %>%
mutate(weight_kg = weight / 1000) %>%
head()
The first few rows of the output are full of NA
s, so if
we wanted to remove those we could insert a filter()
in the
chain:
R
surveys %>%
filter(!is.na(weight)) %>%
mutate(weight_kg = weight / 1000) %>%
head()
is.na()
is a function that determines whether something
is an NA
. The !
symbol negates the result, so
we’re asking for every row where weight is not an
NA
.
Challenge
Create a new data frame from the surveys
data that meets
the following criteria: contains only the species_id
column
and a new column called hindfoot_cm
containing the
hindfoot_length
values (currently in mm) converted to
centimeters. In this hindfoot_cm
column, there are no
NA
s and all values are less than 3.
Hint: think about how the commands should be ordered to produce this data frame!
R
surveys_hindfoot_cm <- surveys %>%
filter(!is.na(hindfoot_length)) %>%
mutate(hindfoot_cm = hindfoot_length / 10) %>%
filter(hindfoot_cm < 3) %>%
select(species_id, hindfoot_cm)
Split-apply-combine data analysis and the summarize()
function
Many data analysis tasks can be approached using the
split-apply-combine paradigm: split the data into groups, apply
some analysis to each group, and then combine the results. Key functions
of dplyr
for this workflow are
group_by()
and summarize()
.
The group_by()
and summarize()
functions
group_by()
is often used together with
summarize()
, which collapses each group into a single-row
summary of that group. group_by()
takes as arguments the
column names that contain the categorical variables for
which you want to calculate the summary statistics. So to compute the
mean weight
by sex:
R
surveys %>%
group_by(sex) %>%
summarize(mean_weight = mean(weight, na.rm = TRUE))
You may also have noticed that the output from these calls doesn’t
run off the screen anymore. It’s one of the advantages of
tbl_df
over data frame.
You can also group by multiple columns:
R
surveys %>%
group_by(sex, species_id) %>%
summarize(mean_weight = mean(weight, na.rm = TRUE)) %>%
tail()
OUTPUT
#> `summarise()` has grouped output by 'sex'. You can override using the `.groups`
#> argument.
Here, we used tail()
to look at the last six rows of our
summary. Before, we had used head()
to look at the first
six rows. We can see that the sex
column contains
NA
values because some animals had escaped before their sex
and body weights could be determined. The resulting
mean_weight
column does not contain NA
but
NaN
(which refers to “Not a Number”) because
mean()
was called on a vector of NA
values
while at the same time setting na.rm = TRUE
. To avoid this,
we can remove the missing values for weight before we attempt to
calculate the summary statistics on weight. Because the missing values
are removed first, we can omit na.rm = TRUE
when computing
the mean:
R
surveys %>%
filter(!is.na(weight)) %>%
group_by(sex, species_id) %>%
summarize(mean_weight = mean(weight))
OUTPUT
#> `summarise()` has grouped output by 'sex'. You can override using the `.groups`
#> argument.
Here, again, the output from these calls doesn’t run off the screen
anymore. If you want to display more data, you can use the
print()
function at the end of your chain with the argument
n
specifying the number of rows to display:
R
surveys %>%
filter(!is.na(weight)) %>%
group_by(sex, species_id) %>%
summarize(mean_weight = mean(weight)) %>%
print(n = 15)
OUTPUT
#> `summarise()` has grouped output by 'sex'. You can override using the `.groups`
#> argument.
Once the data are grouped, you can also summarize multiple variables at the same time (and not necessarily on the same variable). For instance, we could add a column indicating the minimum weight for each species for each sex:
R
surveys %>%
filter(!is.na(weight)) %>%
group_by(sex, species_id) %>%
summarize(mean_weight = mean(weight),
min_weight = min(weight))
OUTPUT
#> `summarise()` has grouped output by 'sex'. You can override using the `.groups`
#> argument.
It is sometimes useful to rearrange the result of a query to inspect
the values. For instance, we can sort on min_weight
to put
the lighter species first:
R
surveys %>%
filter(!is.na(weight)) %>%
group_by(sex, species_id) %>%
summarize(mean_weight = mean(weight),
min_weight = min(weight)) %>%
arrange(min_weight)
OUTPUT
#> `summarise()` has grouped output by 'sex'. You can override using the `.groups`
#> argument.
To sort in descending order, we need to add the desc()
function. If we want to sort the results by decreasing order of mean
weight:
R
surveys %>%
filter(!is.na(weight)) %>%
group_by(sex, species_id) %>%
summarize(mean_weight = mean(weight),
min_weight = min(weight)) %>%
arrange(desc(mean_weight))
OUTPUT
#> `summarise()` has grouped output by 'sex'. You can override using the `.groups`
#> argument.
Counting
When working with data, we often want to know the number of
observations found for each factor or combination of factors. For this
task, dplyr
provides count()
.
For example, if we wanted to count the number of rows of data for each
sex, we would do:
R
surveys %>%
count(sex)
The count()
function is shorthand for something we’ve
already seen: grouping by a variable, and summarizing it by counting the
number of observations in that group. In other words,
surveys %>% count()
is equivalent to:
R
surveys %>%
group_by(sex) %>%
summarize(count = n())
For convenience, count()
provides the sort
argument:
R
surveys %>%
count(sex, sort = TRUE)
Previous example shows the use of count()
to count the
number of rows/observations for one factor (i.e.,
sex
). If we wanted to count combination of
factors, such as sex
and species
, we
would specify the first and the second factor as the arguments of
count()
:
R
surveys %>%
count(sex, species)
With the above code, we can proceed with arrange()
to
sort the table according to a number of criteria so that we have a
better comparison. For instance, we might want to arrange the table
above in (i) an alphabetical order of the levels of the species and (ii)
in descending order of the count:
R
surveys %>%
count(sex, species) %>%
arrange(species, desc(n))
From the table above, we may learn that, for instance, there are 75
observations of the albigula species that are not specified for
its sex (i.e. NA
).
R
surveys %>%
count(plot_type)
R
surveys %>%
filter(!is.na(hindfoot_length)) %>%
group_by(species_id) %>%
summarize(
mean_hindfoot_length = mean(hindfoot_length),
min_hindfoot_length = min(hindfoot_length),
max_hindfoot_length = max(hindfoot_length),
n = n()
)
R
surveys %>%
filter(!is.na(weight)) %>%
group_by(year) %>%
filter(weight == max(weight)) %>%
select(year, genus, species, weight) %>%
arrange(year)
Reshaping with pivot_longer and pivot_wider
In the spreadsheet lesson, we discussed how to structure our data leading to the four rules defining a tidy dataset:
- Each variable has its own column
- Each observation has its own row
- Each value must have its own cell
- Each type of observational unit forms a table
Here we examine the fourth rule: Each type of observational unit forms a table.
In surveys
, the rows of surveys
contain the
values of variables associated with each record (the unit), values such
as the weight or sex of each animal associated with each record. What if
instead of comparing records, we wanted to compare the different mean
weight of each genus between plots? (Ignoring plot_type
for
simplicity).
We’d need to create a new table where each row (the unit) is
comprised of values of variables associated with each plot. In practical
terms this means the values in genus
would become the names
of column variables and the cells would contain the values of the mean
weight observed on each plot.
Having created a new table, it is therefore straightforward to explore the relationship between the weight of different genera within, and between, the plots. The key point here is that we are still following a tidy data structure, but we have reshaped the data according to the observations of interest: average genus weight per plot instead of recordings per date.
The opposite transformation would be to transform column names into values of a variable.
We can do both these of transformations with two tidyr
functions, pivot_wider()
and
pivot_longer()
.
These may sound like dramatically different data layouts, but there are some tools that make transitions between these layouts more straightforward than you might think! The gif below shows how these two formats relate to each other, and gives you an idea of how we can use R to shift from one format to the other.
Pivoting from long to wide format
pivot_wider()
takes three principal arguments:
- the data
- the names_from column variable whose values will become new column names.
- the values_from column variable whose values will fill the new column variables.
Further arguments include values_fill
which, if set,
fills in missing values with the value provided.
Let’s use pivot_wider()
to transform surveys to find the
mean weight of each genus in each plot over the entire survey period. We
use filter()
, group_by()
and
summarize()
to filter our observations and variables of
interest, and create a new variable for the
mean_weight
.
R
surveys_gw <- surveys %>%
filter(!is.na(weight)) %>%
group_by(plot_id, genus) %>%
summarize(mean_weight = mean(weight))
OUTPUT
#> `summarise()` has grouped output by 'plot_id'. You can override using the
#> `.groups` argument.
R
str(surveys_gw)
This yields surveys_gw
where the observations for each
plot are distributed across multiple rows, 196 observations of 3
variables. Using pivot_wider()
with the names from
genus
and with values from mean_weight
this
becomes 24 observations of 11 variables, one row for each plot.
R
surveys_wide <- surveys_gw %>%
pivot_wider(names_from = genus, values_from = mean_weight)
str(surveys_wide)
We could now plot comparisons between the weight of genera (one is called a genus, multiple are called genera) in different plots, although we may wish to fill in the missing values first.
R
surveys_gw %>%
pivot_wider(names_from = genus, values_from = mean_weight, values_fill = 0) %>%
head()
Pivoting from wide to long format
The opposing situation could occur if we had been provided with data
in the form of surveys_wide
, where the genus names are
column names, but we wish to treat them as values of a genus variable
instead.
In this situation we are reshaping the column names and turning them into a pair of new variables. One variable represents the column names as values, and the other variable contains the values previously associated with the column names.
pivot_longer()
takes four principal arguments:
- the data
- the names_to column variable we wish to create from column names.
- the values_to column variable we wish to create and fill with values.
- cols are the name of the columns we use to make this pivot (or to drop).
To recreate surveys_gw
from surveys_wide
we
would create a names variable called genus
and value
variable called mean_weight
.
In pivoting longer, we also need to specify what columns to reshape.
If the columns are directly adjacent as they are here, we don’t even
need to list the all out: we can just use the :
operator!
R
surveys_long <- surveys_wide %>%
pivot_longer(names_to = "genus", values_to = "mean_weight", cols = -plot_id)
str(surveys_long)
Note that now the NA
genera are included in the long
format data frame. Pivoting wider and then longer can be a useful way to
balance out a dataset so that every replicate has the same
composition
We could also have used a specification for what columns to exclude.
In this example, we will use all columns except
plot_id
for the names variable. By using the minus sign in
the cols
argument, we omit plot_id
from being
reshaped
R
surveys_wide %>%
pivot_longer(names_to = "genus", values_to = "mean_weight", cols = -plot_id) %>%
head()
Challenge
- Reshape the
surveys
data frame withyear
as columns,plot_id
as rows, and the number of genera per plot as the values. You will need to summarize before reshaping, and use the functionn_distinct()
to get the number of unique genera within a particular chunk of data. It’s a powerful function! See?n_distinct
for more.
R
surveys_wide_genera <- surveys %>%
group_by(plot_id, year) %>%
summarize(n_genera = n_distinct(genus)) %>%
pivot_wider(names_from = year, values_from = n_genera)
OUTPUT
#> `summarise()` has grouped output by 'plot_id'. You can override using the
#> `.groups` argument.
R
head(surveys_wide_genera)
R
surveys_wide_genera %>%
pivot_longer(names_to = "year", values_to = "n_genera", cols = -plot_id)
Challenge (continued)
- The
surveys
data set has two measurement columns:hindfoot_length
andweight
. This makes it difficult to do things like look at the relationship between mean values of each measurement per year in different plot types. Let’s walk through a common solution for this type of problem. First, usepivot_longer()
to create a dataset where we have a names column calledmeasurement
and avalue
column that takes on the value of eitherhindfoot_length
orweight
. Hint: You’ll need to specify which columns will be part of the reshape.
R
surveys_long <- surveys %>%
pivot_longer(names_to = "measurement", values_to = "value", cols = c(hindfoot_length, weight))
- With this new data set, calculate the average of each
measurement
in eachyear
for each differentplot_type
. Thenpivot_wider()
them into a data set with a column forhindfoot_length
andweight
. Hint: You only need to specify the names and values columns forpivot_wider()
.
R
surveys_long %>%
group_by(year, measurement, plot_type) %>%
summarize(mean_value = mean(value, na.rm=TRUE)) %>%
pivot_wider(names_from = measurement, values_from = mean_value)
OUTPUT
#> `summarise()` has grouped output by 'year', 'measurement'. You can override
#> using the `.groups` argument.
Exporting data
Now that you have learned how to use
dplyr
to extract information from or
summarize your raw data, you may want to export these new data sets to
share them with your collaborators or for archival.
Similar to the read_csv()
function used for reading CSV
files into R, there is a write_csv()
function that
generates CSV files from data frames.
Before using write_csv()
, we are going to create a new
folder, data
, in our working directory that will store this
generated dataset. We don’t want to write generated datasets in the same
directory as our raw data. It’s good practice to keep them separate. The
data_raw
folder should only contain the raw, unaltered
data, and should be left alone to make sure we don’t delete or modify
it. In contrast, our script will generate the contents of the
data
directory, so even if the files it contains are
deleted, we can always re-generate them.
In preparation for our next lesson on plotting, we are going to prepare a cleaned up version of the data set that doesn’t include any missing data.
Let’s start by removing observations of animals for which
weight
and hindfoot_length
are missing, or the
sex
has not been determined:
R
surveys_complete <- surveys %>%
filter(!is.na(weight), # remove missing weight
!is.na(hindfoot_length), # remove missing hindfoot_length
!is.na(sex)) # remove missing sex
Because we are interested in plotting how species abundances have changed through time, we are also going to remove observations for rare species (i.e., that have been observed less than 50 times). We will do this in two steps: first we are going to create a data set that counts how often each species has been observed, and filter out the rare species; then, we will extract only the observations for these more common species:
R
## Extract the most common species_id
species_counts <- surveys_complete %>%
count(species_id) %>%
filter(n >= 50)
## Only keep the most common species
surveys_complete <- surveys_complete %>%
filter(species_id %in% species_counts$species_id)
To make sure that everyone has the same data set, check that
surveys_complete
has 30463 rows and 13 columns by typing
dim(surveys_complete)
.
Now that our data set is ready, we can save it as a CSV file in our
data
folder.
R
write_csv(surveys_complete, file = "data/surveys_complete.csv")
Content from Data visualization with ggplot2
Last updated on 2023-11-21 | Edit this page
Overview
Questions
- How do you make plots using R?
- How do you customize and modify plots?
Objectives
- Produce scatter plots, boxplots, and time series plots using ggplot.
- Set universal plot settings.
- Describe what faceting is and apply faceting in ggplot.
- Modify the aesthetics of an existing ggplot plot (including axis labels and color).
- Build complex and customized plots from data in a data frame.
We start by loading the required packages.
ggplot2
is included in the
tidyverse
package.
R
library(tidyverse)
If not still in the workspace, load the data we saved in the previous lesson.
R
surveys_complete <- read_csv("data/surveys_complete.csv")
Plotting with ggplot2
ggplot2
is a plotting package that
provides helpful commands to create complex plots from data in a data
frame. It provides a more programmatic interface for specifying what
variables to plot, how they are displayed, and general visual
properties. Therefore, we only need minimal changes if the underlying
data change or if we decide to change from a bar plot to a scatterplot.
This helps in creating publication quality plots with minimal amounts of
adjustments and tweaking.
ggplot2
refers to the name of the
package itself. When using the package we use the function
ggplot()
to generate the plots, and so
references to using the function will be referred to as
ggplot()
and the package as a whole as
ggplot2
ggplot2
plots work best with data in
the ‘long’ format, i.e., a column for every variable, and a row for
every observation. Well-structured data will save you lots of time when
making figures with ggplot2
ggplot graphics are built layer by layer by adding new elements. Adding layers in this fashion allows for extensive flexibility and customization of plots.
To build a ggplot, we will use the following basic template that can be used for different types of plots:
ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) + <GEOM_FUNCTION>()
- use the
ggplot()
function and bind the plot to a specific data frame using thedata
argument
R
ggplot(data = surveys_complete)
- define an aesthetic mapping (using the aesthetic (
aes
) function), by selecting the variables to be plotted and specifying how to present them in the graph, e.g., as x/y positions or characteristics such as size, shape, color, etc.
R
ggplot(data = surveys_complete, mapping = aes(x = weight, y = hindfoot_length))
-
add ‘geoms’ – graphical representations of the data in the plot (points, lines, bars).
ggplot2
offers many different geoms; we will use some common ones today, including:-
geom_point()
for scatter plots, dot plots, etc. -
geom_boxplot()
for, well, boxplots! -
geom_line()
for trend lines, time series, etc.
-
To add a geom to the plot use +
operator. Because we
have two continuous variables, let’s use geom_point()
first:
R
ggplot(data = surveys_complete, aes(x = weight, y = hindfoot_length)) +
geom_point()
The +
in the ggplot2
package is particularly useful because it allows you to modify existing
ggplot
objects. This means you can easily set up plot
“templates” and conveniently explore different types of plots, so the
above plot can also be generated with code like this:
R
# Assign plot to a variable
surveys_plot <- ggplot(data = surveys_complete,
mapping = aes(x = weight, y = hindfoot_length))
# Draw the plot
surveys_plot +
geom_point()
Notes
- Anything you put in the
ggplot()
function can be seen by any geom layers that you add (i.e., these are universal plot settings). This includes the x- and y-axis you set up inaes()
. - You can also specify aesthetics for a given geom independently of
the aesthetics defined globally in the
ggplot()
function. - The
+
sign used to add layers must be placed at the end of each line containing a layer. If, instead, the+
sign is added in the line before the other layer,ggplot2
will not add the new layer and will return an error message. - You may notice that we sometimes reference ‘ggplot2’ and sometimes ‘ggplot’. To clarify, ‘ggplot2’ is the name of the most recent version of the package. However, any time we call the function itself, it’s just called ‘ggplot’.
- The previous version of the
ggplot2
package, calledggplot
, which also contained theggplot()
function is now unsupported and has been removed from CRAN in order to reduce accidental installations and further confusion.
R
# This is the correct syntax for adding layers
surveys_plot +
geom_point()
# This will not add the new layer and will return an error message
surveys_plot
+ geom_point()
Challenge (optional)
Scatter plots can be useful exploratory tools for small datasets. For
data sets with large numbers of observations, such as the
surveys_complete
data set, overplotting of points can be a
limitation of scatter plots. One strategy for handling such settings is
to use hexagonal binning of observations. The plot space is tessellated
into hexagons. Each hexagon is assigned a color based on the number of
observations that fall within its boundaries. To use hexagonal binning
with ggplot2
, first install the R package
hexbin
from CRAN:
R
install.packages("hexbin")
library(hexbin)
Then use the geom_hex()
function:
R
surveys_plot +
geom_hex()
- What are the relative strengths and weaknesses of a hexagonal bin plot compared to a scatter plot? Examine the above scatter plot and compare it with the hexagonal bin plot that you created.
Building your plots iteratively
Building plots with ggplot2
is
typically an iterative process. We start by defining the dataset we’ll
use, lay out the axes, and choose a geom:
R
ggplot(data = surveys_complete, aes(x = weight, y = hindfoot_length)) +
geom_point()
Then, we start modifying this plot to extract more information from
it. For instance, we can add transparency (alpha
) to avoid
overplotting:
R
ggplot(data = surveys_complete, aes(x = weight, y = hindfoot_length)) +
geom_point(alpha = 0.1)
We can also add colors for all the points:
R
ggplot(data = surveys_complete, mapping = aes(x = weight, y = hindfoot_length)) +
geom_point(alpha = 0.1, color = "blue")
Or to color each species in the plot differently, you could use a
vector as an input to the argument color.
ggplot2
will provide a different color
corresponding to different values in the vector. Here is an example
where we color with species_id
:
R
ggplot(data = surveys_complete, mapping = aes(x = weight, y = hindfoot_length)) +
geom_point(alpha = 0.1, aes(color = species_id))
R
ggplot(data = surveys_complete,
mapping = aes(x = species_id, y = weight)) +
geom_point(aes(color = plot_type))
Boxplot
We can use boxplots to visualize the distribution of weight within each species:
R
ggplot(data = surveys_complete, mapping = aes(x = species_id, y = weight)) +
geom_boxplot()
By adding points to the boxplot, we can have a better idea of the
number of measurements and of their distribution. Because the boxplot
will show the outliers by default these points will be plotted twice –
by geom_boxplot
and geom_jitter
. To avoid this
we must specify that no outliers should be added to the boxplot by
specifying outlier.shape = NA
.
R
ggplot(data = surveys_complete, mapping = aes(x = species_id, y = weight)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(alpha = 0.3, color = "tomato")
Notice how the boxplot layer is behind the jitter layer? What do you need to change in the code to put the boxplot in front of the points such that it’s not hidden?
Challenges
Boxplots are useful summaries, but hide the shape of the distribution. For example, if there is a bimodal distribution, it would not be observed with a boxplot. An alternative to the boxplot is the violin plot (sometimes known as a beanplot), where the shape (of the density of points) is drawn.
- Replace the box plot with a violin plot; see
geom_violin()
.
R
ggplot(data = surveys_complete, mapping = aes(x = species_id, y = weight)) +
geom_jitter(alpha = 0.3, color = "tomato") +
geom_violin()
Challenges (continued)
In many types of data, it is important to consider the scale of the observations. For example, it may be worth changing the scale of the axis to better distribute the observations in the space of the plot. Changing the scale of the axes is done similarly to adding/modifying other components (i.e., by incrementally adding commands). Try making these modifications:
- Represent weight on the log10 scale; see
scale_y_log10()
.
R
ggplot(data = surveys_complete, mapping = aes(x = species_id, y = weight)) +
scale_y_log10() +
geom_jitter(alpha = 0.3, color = "tomato") +
geom_boxplot(outlier.shape = NA)
R
ggplot(data = surveys_complete, mapping = aes(x = species_id, y = hindfoot_length)) +
geom_jitter(alpha = 0.3, color = "tomato") +
geom_boxplot(outlier.shape = NA)
Plotting time series data
Let’s calculate number of counts per year for each genus. First we need to group the data and count records within each group:
R
yearly_counts <- surveys_complete %>%
count(year, genus)
Timelapse data can be visualized as a line plot with years on the x-axis and counts on the y-axis:
R
ggplot(data = yearly_counts, aes(x = year, y = n)) +
geom_line()
Unfortunately, this does not work because we plotted data for all the
genera together. We need to tell ggplot to draw a line for each genus by
modifying the aesthetic function to include
group = genus
:
R
ggplot(data = yearly_counts, aes(x = year, y = n, group = genus)) +
geom_line()
We will be able to distinguish genera in the plot if we add colors
(using color
also automatically groups the data):
R
ggplot(data = yearly_counts, aes(x = year, y = n, color = genus)) +
geom_line()
Integrating the pipe operator with ggplot2
In the previous lesson, we saw how to use the pipe operator
%>%
to use different functions in a sequence and create
a coherent workflow. We can also use the pipe operator to pass the
data
argument to the ggplot()
function. The
hard part is to remember that to build your ggplot, you need to use
+
and not %>%
.
R
yearly_counts %>%
ggplot(mapping = aes(x = year, y = n, color = genus)) +
geom_line()
The pipe operator can also be used to link data manipulation with consequent data visualization.
R
yearly_counts_graph <- surveys_complete %>%
count(year, genus) %>%
ggplot(mapping = aes(x = year, y = n, color = genus)) +
geom_line()
yearly_counts_graph
Faceting
ggplot
has a special technique called faceting
that allows the user to split one plot into multiple plots based on a
factor included in the dataset. We will use it to make a time series
plot for each genus:
R
ggplot(data = yearly_counts, aes(x = year, y = n)) +
geom_line() +
facet_wrap(facets = vars(genus))
Now we would like to split the line in each plot by the sex of each
individual measured. To do that we need to make counts in the data frame
grouped by year
, genus
, and
sex
:
R
yearly_sex_counts <- surveys_complete %>%
count(year, genus, sex)
We can now make the faceted plot by splitting further by sex using
color
(within a single plot):
R
ggplot(data = yearly_sex_counts, mapping = aes(x = year, y = n, color = sex)) +
geom_line() +
facet_wrap(facets = vars(genus))
We can also facet both by sex and genus:
R
ggplot(data = yearly_sex_counts,
mapping = aes(x = year, y = n, color = sex)) +
geom_line() +
facet_grid(rows = vars(sex), cols = vars(genus))
You can also organise the panels only by rows (or only by columns):
R
# One column, facet by rows
ggplot(data = yearly_sex_counts,
mapping = aes(x = year, y = n, color = sex)) +
geom_line() +
facet_grid(rows = vars(genus))
R
# One row, facet by column
ggplot(data = yearly_sex_counts,
mapping = aes(x = year, y = n, color = sex)) +
geom_line() +
facet_grid(cols = vars(genus))
Note: ggplot2
before version 3.0.0 used
formulas to specify how plots are faceted. If you encounter
facet_grid
/wrap(...)
code containing
~
, please read https://ggplot2.tidyverse.org/news/#tidy-evaluation.
ggplot2
themes
Usually plots with white background look more readable when printed.
Every single component of a ggplot
graph can be customized
using the generic theme()
function, as we will see below.
However, there are pre-loaded themes available that change the overall
appearance of the graph without much effort.
For example, we can change our previous graph to have a simpler white
background using the theme_bw()
function:
R
ggplot(data = yearly_sex_counts,
mapping = aes(x = year, y = n, color = sex)) +
geom_line() +
facet_wrap(vars(genus)) +
theme_bw()
In addition to theme_bw()
, which changes the plot
background to white, ggplot2
comes with
several other themes which can be useful to quickly change the look of
your visualization. The complete list of themes is available at https://ggplot2.tidyverse.org/reference/ggtheme.html.
theme_minimal()
and theme_light()
are popular,
and theme_void()
can be useful as a starting point to
create a new hand-crafted theme.
The ggthemes package provides a wide variety of options.
R
yearly_weight <- surveys_complete %>%
group_by(year, species_id) %>%
summarize(avg_weight = mean(weight))
OUTPUT
#> `summarise()` has grouped output by 'year'. You can override using the
#> `.groups` argument.
R
ggplot(data = yearly_weight, mapping = aes(x=year, y=avg_weight)) +
geom_line() +
facet_wrap(vars(species_id)) +
theme_bw()
Customization
Take a look at the ggplot2
cheat sheet, and think of ways you could improve the plot.
Now, let’s change names of axes to something more informative than ‘year’ and ‘n’ and add a title to the figure:
R
ggplot(data = yearly_sex_counts, aes(x = year, y = n, color = sex)) +
geom_line() +
facet_wrap(vars(genus)) +
labs(title = "Observed genera through time",
x = "Year of observation",
y = "Number of individuals") +
theme_bw()
The axes have more informative names, but their readability can be
improved by increasing the font size. This can be done with the generic
theme()
function:
R
ggplot(data = yearly_sex_counts, mapping = aes(x = year, y = n, color = sex)) +
geom_line() +
facet_wrap(vars(genus)) +
labs(title = "Observed genera through time",
x = "Year of observation",
y = "Number of individuals") +
theme_bw() +
theme(text=element_text(size = 16))
Note that it is also possible to change the fonts of your plots. If
you are on Windows, you may have to install the extrafont
package, and follow the instructions included in the README for this
package.
After our manipulations, you may notice that the values on the x-axis
are still not properly readable. Let’s change the orientation of the
labels and adjust them vertically and horizontally so they don’t
overlap. You can use a 90 degree angle, or experiment to find the
appropriate angle for diagonally oriented labels. We can also modify the
facet label text (strip.text
) to italicize the genus
names:
R
ggplot(data = yearly_sex_counts, mapping = aes(x = year, y = n, color = sex)) +
geom_line() +
facet_wrap(vars(genus)) +
labs(title = "Observed genera through time",
x = "Year of observation",
y = "Number of individuals") +
theme_bw() +
theme(axis.text.x = element_text(colour = "grey20", size = 12, angle = 90, hjust = 0.5, vjust = 0.5),
axis.text.y = element_text(colour = "grey20", size = 12),
strip.text = element_text(face = "italic"),
text = element_text(size = 16))
If you like the changes you created better than the default theme, you can save them as an object to be able to easily apply them to other plots you may create:
R
grey_theme <- theme(axis.text.x = element_text(colour="grey20", size = 12,
angle = 90, hjust = 0.5,
vjust = 0.5),
axis.text.y = element_text(colour = "grey20", size = 12),
text=element_text(size = 16))
ggplot(surveys_complete, aes(x = species_id, y = hindfoot_length)) +
geom_boxplot() +
grey_theme
Challenge
With all of this information in hand, please take another five
minutes to either improve one of the plots generated in this exercise or
create a beautiful graph of your own. Use the RStudio ggplot2
cheat sheet for inspiration.
Here are some ideas:
- See if you can change the thickness of the lines.
- Can you find a way to change the name of the legend? What about its labels?
- Try using a different color palette (see https://r-graphics.org/chapter-colors).
Arranging plots
Faceting is a great tool for splitting one plot into multiple plots,
but sometimes you may want to produce a single figure that contains
multiple plots using different variables or even different data frames.
The patchwork
package allows us to combine
separate ggplots into a single figure while keeping everything aligned
properly. Like most R packages, we can install patchwork
from CRAN, the R package repository:
R
install.packages("patchwork")
After you have loaded the patchwork
package you can use
+
to place plots next to each other, /
to
arrange them vertically, and plot_layout()
to determine how
much space each plot uses:
R
library(patchwork)
plot_weight <- ggplot(data = surveys_complete, aes(x = species_id, y = weight)) +
geom_boxplot() +
labs(x = "Species", y = expression(log[10](Weight))) +
scale_y_log10()
plot_count <- ggplot(data = yearly_counts, aes(x = year, y = n, color = genus)) +
geom_line() +
labs(x = "Year", y = "Abundance")
plot_weight / plot_count + plot_layout(heights = c(3, 2))
You can also use parentheses ()
to create more complex
layouts. There are many useful examples on the patchwork website
Exporting plots
After creating your plot, you can save it to a file in your favorite
format. The Export tab in the Plot pane in RStudio will
save your plots at low resolution, which will not be accepted by many
journals and will not scale well for posters. The ggplot2
extensions website provides a list of packages that extend the
capabilities of ggplot2
, including
additional themes.
Instead, use the ggsave()
function, which allows you to
easily change the dimension and resolution of your plot by adjusting the
appropriate arguments (width
, height
and
dpi
):
R
my_plot <- ggplot(data = yearly_sex_counts,
aes(x = year, y = n, color = sex)) +
geom_line() +
facet_wrap(vars(genus)) +
labs(title = "Observed genera through time",
x = "Year of observation",
y = "Number of individuals") +
theme_bw() +
theme(axis.text.x = element_text(colour = "grey20", size = 12, angle = 90,
hjust = 0.5, vjust = 0.5),
axis.text.y = element_text(colour = "grey20", size = 12),
text = element_text(size = 16))
ggsave("name_of_file.png", my_plot, width = 15, height = 10)
## This also works for plots combined with patchwork
plot_combined <- plot_weight / plot_count + plot_layout(heights = c(3, 2))
ggsave("plot_combined.png", plot_combined, width = 10, dpi = 300)
Note: The parameters width
and height
also
determine the font size in the saved plot.
Keypoints
- start simple and build your plots iteratively
- the
ggplot()
function initiates a plot, andgeom_
functions add representations of your data - use
aes()
when mapping a variable from the data to a part of the plot - use
facet_
to partition a plot into multiple plots based on a factor included in the dataset - use premade
theme_
functions to broadly change appearance, and thetheme()
function to fine-tune - the
patchwork
library can combine separate plots into a single figure - use
ggsave()
to save plots in your favorite format and dimensions
Content from SQL databases and R
Last updated on 2023-11-21 | Edit this page
Overview
Questions
- How can I connect and get data from a database in R?
- How can I retrieve data from multiple database tables?
- How can I create an SQLite database?
Objectives
- Access a database from R.
- Run SQL queries in R using
RSQLite
anddplyr
. - Describe the lazy behavior of dplyr on data stored in a database outside of R.
- Prototype queries and retrieve all final results.
- Create complex queries across one or multiple database tables.
- Create an SQLite database from existing .csv files.
Introduction
So far, we have dealt with small datasets that easily fit into your computer’s memory. But what about datasets that are too large for your computer to handle as a whole? In this case, storing the data outside of R and organizing it in a database is helpful. Connecting to the database allows you to retrieve only the chunks needed for the current analysis.
Even better, many large datasets are already available in public or private databases. You can query them without having to download the data first.
R can connect to almost any existing database type. Most common
database types have R packages that allow you to connect to them (e.g.,
RSQLite
, RMySQL, etc). Furthermore, the dplyr
package you used in the previous chapter, in conjunction with dbplyr
supports connecting to the widely-used open source databases sqlite, mysql and postgresql, as well as Google’s
bigquery, and it can
also be extended to other database types (a vignette
in the dplyr
package explains how to do
it). RStudio has created a website
that provides documentation and best practices to work on database
interfaces.
Interfacing with databases using dplyr
focuses on retrieving and analyzing datasets by generating
SELECT
SQL statements, but it doesn’t modify the database
itself. dplyr
does not offer functions to
UPDATE
or DELETE
entries. If you need these
functionalities, you will need to use additional R packages (e.g.,
RSQLite
). Here we will demonstrate how to
interact with a database using dplyr
,
using both the dplyr
’s verb syntax and the
SQL syntax.
The portal_mammals database
We will continue to explore the surveys
data you are
already familiar with from previous lessons. First, we are going to
install the dbplyr
package:
R
install.packages(c("dbplyr", "RSQLite"))
The SQLite database is contained in a single file
portal_mammals.sqlite
that you generated during the
SQL lesson. If you don’t have it, you can download it from Figshare
into the data_raw
subdirectory using:
R
dir.create("data_raw", showWarnings = FALSE)
download.file(url = "https://ndownloader.figshare.com/files/2292171",
destfile = "data_raw/portal_mammals.sqlite", mode = "wb")
Connecting to databases
We can point R to this database using:
R
library(dplyr)
library(dbplyr)
OUTPUT
#>
#> Attaching package: 'dbplyr'
OUTPUT
#> The following objects are masked from 'package:dplyr':
#>
#> ident, sql
R
mammals <- DBI::dbConnect(RSQLite::SQLite(), "data_raw/portal_mammals.sqlite")
This command uses 2 packages that helps
dbplyr
and
dplyr
talk to the SQLite database.
DBI
is not something that you’ll use
directly as a user. It allows R to send commands to databases
irrespective of the database management system used. The
RSQLite
package allows R to interface with
SQLite databases.
This command does not load the data into the R session (as the
read_csv()
function did). Instead, it merely instructs R to
connect to the SQLite
database contained in the
portal_mammals.sqlite
file.
Using a similar approach, you could connect to many other database management systems that are supported by R including MySQL, PostgreSQL, BigQuery, etc.
Let’s take a closer look at the mammals
database we just
connected to:
R
src_dbi(mammals)
OUTPUT
#> src: sqlite 3.41.2 [/home/runner/work/R-ecology-lesson/R-ecology-lesson/site/built/data_raw/portal_mammals.sqlite]
#> tbls: plots, species, surveys
Just like a spreadsheet with multiple worksheets, a SQLite database
can contain multiple tables. In this case three of them are listed in
the tbls
row in the output above:
- plots
- species
- surveys
Now that we know we can connect to the database, let’s explore how to get the data from its tables into R.
Querying the database with the SQL syntax
To connect to tables within a database, you can use the
tbl()
function from dplyr
.
This function can be used to send SQL queries to the database. To
demonstrate this functionality, let’s select the columns “year”,
“species_id”, and “plot_id” from the surveys
table:
R
tbl(mammals, sql("SELECT year, species_id, plot_id FROM surveys"))
With this approach you can use any of the SQL queries we have seen in the database lesson.
Querying the database with the dplyr syntax
One of the strengths of dplyr
is that
the same operation can be done using
dplyr
’s verbs instead of writing SQL.
First, we select the table on which to do the operations by creating the
surveys
object, and then we use the standard
dplyr
syntax as if it were a data
frame:
R
surveys <- tbl(mammals, "surveys")
surveys %>%
select(year, species_id, plot_id)
In this case, the surveys
object behaves like a data
frame. Several functions that can be used with data frames can also be
used on tables from a database. For instance, the head()
function can be used to check the first 10 rows of the table:
R
head(surveys, n = 10)
OUTPUT
#> # Source: SQL [10 x 9]
#> # Database: sqlite 3.41.2 [/home/runner/work/R-ecology-lesson/R-ecology-lesson/site/built/data_raw/portal_mammals.sqlite]
#> record_id month day year plot_id species_id sex hindfoot_length weight
#> <int> <int> <int> <int> <int> <chr> <chr> <int> <int>
#> 1 1 7 16 1977 2 NL M 32 NA
#> 2 2 7 16 1977 3 NL M 33 NA
#> 3 3 7 16 1977 2 DM F 37 NA
#> 4 4 7 16 1977 7 DM M 36 NA
#> 5 5 7 16 1977 3 DM M 35 NA
#> 6 6 7 16 1977 1 PF M 14 NA
#> 7 7 7 16 1977 2 PE F NA NA
#> 8 8 7 16 1977 1 DM M 37 NA
#> 9 9 7 16 1977 1 DM F 34 NA
#> 10 10 7 16 1977 6 PF F 20 NA
This output of the head
command looks just like a
regular data.frame
: The table has 9 columns and the
head()
command shows us the first 10 rows. Note that the
columns plot_type
, taxa
, genus
,
and species
are missing. These are now located in the
tables plots
and species
which we will join
together in a moment.
However, some functions don’t work quite as expected. For instance,
let’s check how many rows there are in total using
nrow()
:
R
nrow(surveys)
OUTPUT
#> [1] NA
That’s strange - R doesn’t know how many rows the
surveys
table contains - it returns NA
instead. You might have already noticed that the first line of the
head()
output included ??
indicating that the
number of rows wasn’t known.
The reason for this behavior highlights a key difference between
using dplyr
on datasets in memory
(e.g. loaded into your R session via read_csv()
) and those
provided by a database. To understand it, we take a closer look at how
dplyr
communicates with our SQLite
database.
SQL translation
Relational databases typically use a special-purpose language, Structured Query Language (SQL), to manage and query data.
For example, the following SQL query returns the first 10 rows from
the surveys
table:
SQL
SELECT *
FROM `surveys`
LIMIT 10
Behind the scenes, dplyr
:
- translates your R code into SQL
- submits it to the database
- translates the database’s response into an R data frame
To lift the curtain, we can use dplyr
’s
show_query()
function to show which SQL commands are
actually sent to the database:
R
show_query(head(surveys, n = 10))
The output shows the actual SQL query sent to the database; it
matches our manually constructed SELECT
statement
above.
Instead of having to formulate the SQL query ourselves - and having
to mentally switch back and forth between R and SQL syntax - we can
delegate this translation to dplyr
. (You
don’t even need to know SQL to interact with a database via
dplyr
!)
dplyr
, in turn, doesn’t do the real
work of subsetting the table, either. Instead, it merely sends the query
to the database, waits for its response and returns it to us.
That way, R never gets to see the full surveys
table -
and that’s why it could not tell us how many rows it contains. On the
bright side, this allows us to work with large datasets - even too large
to fit into our computer’s memory.
dplyr
can translate many different
query types into SQL allowing us to, e.g., select()
specific columns, filter()
rows, or join tables.
To see this in action, let’s compose a few queries with
dplyr
.
Simple database queries
First, let’s only request rows of the surveys
table in
which weight
is less than 5 and keep only the species_id,
sex, and weight columns.
R
surveys %>%
filter(weight < 5) %>%
select(species_id, sex, weight)
OUTPUT
#> # Source: SQL [?? x 3]
#> # Database: sqlite 3.41.2 [/home/runner/work/R-ecology-lesson/R-ecology-lesson/site/built/data_raw/portal_mammals.sqlite]
#> species_id sex weight
#> <chr> <chr> <int>
#> 1 PF M 4
#> 2 PF F 4
#> 3 PF <NA> 4
#> 4 PF F 4
#> 5 PF F 4
#> 6 RM M 4
#> 7 RM F 4
#> 8 RM M 4
#> 9 RM M 4
#> 10 RM M 4
#> # ℹ more rows
Executing this command will return a table with 10 rows and the
requested species_id
, sex
and
weight
columns. Great!
… but wait, why are there only 10 rows?
The last line:
# ... with more rows
indicates that there are more results that fit our filtering criterion. Why was R lazy and only retrieved 10 of them?
Laziness
Hadley Wickham, the author of dplyr
explains:
When working with databases, dplyr
tries to be as lazy as possible:
- It never pulls data into R unless you explicitly ask for it.
- It delays doing any work until the last possible moment - it collects together everything you want to do and then sends it to the database in one step.
When you construct a dplyr
query, you
can connect multiple verbs into a single pipeline. For example, we
combined the filter()
and select()
verbs using
the %>%
pipe.
If we wanted to, we could add on even more steps, e.g. remove the
sex
column in an additional select
call:
R
data_subset <- surveys %>%
filter(weight < 5) %>%
select(species_id, sex, weight)
data_subset %>%
select(-sex)
OUTPUT
#> # Source: SQL [?? x 2]
#> # Database: sqlite 3.41.2 [/home/runner/work/R-ecology-lesson/R-ecology-lesson/site/built/data_raw/portal_mammals.sqlite]
#> species_id weight
#> <chr> <int>
#> 1 PF 4
#> 2 PF 4
#> 3 PF 4
#> 4 PF 4
#> 5 PF 4
#> 6 RM 4
#> 7 RM 4
#> 8 RM 4
#> 9 RM 4
#> 10 RM 4
#> # ℹ more rows
Just like the first select(species_id, sex, weight)
call, the select(-sex)
command is not executed by R. It is
sent to the database instead. Only the final result is
retrieved and displayed to you.
Of course, we could always add on more steps, e.g., we could filter
by species_id
or minimum weight
. That’s why R
doesn’t retrieve the full set of results - instead it only retrieves the
first 10 results from the database by default. (After all, you might
want to add an additional step and get the database to do more
work…)
To instruct R to stop being lazy, e.g. to retrieve all of the query
results from the database, we add the collect()
command to
our pipe. It indicates that our database query is finished: time to get
the final results and load them into the R session.
R
data_subset <- surveys %>%
filter(weight < 5) %>%
select(species_id, sex, weight) %>%
collect()
Now we have all 17 rows that match our query in a
data.frame
and can continue to work with them exclusively
in R, without communicating with the database.
Complex database queries
dplyr
enables database queries across
one or multiple database tables, using the same single- and
multiple-table verbs you encountered previously. This means you can use
the same commands regardless of whether you interact with a remote
database or local dataset! This is a really useful feature if you work
with large datasets: you can first prototype your code on a small subset
that fits into memory, and when your code is ready, you can change the
input dataset to your full database without having to change the
syntax.
On the other hand, being able to use SQL queries directly can be useful if your collaborators have already put together complex queries to prepare the dataset that you need for your analysis.
To illustrate how to use dplyr
with
these complex queries, we are going to join the plots
and
surveys
tables. The plots
table in the
database contains information about the different plots surveyed by the
researchers. To access it, we point the tbl()
command to
it:
R
plots <- tbl(mammals, "plots")
plots
OUTPUT
#> # Source: table<plots> [?? x 2]
#> # Database: sqlite 3.41.2 [/home/runner/work/R-ecology-lesson/R-ecology-lesson/site/built/data_raw/portal_mammals.sqlite]
#> plot_id plot_type
#> <int> <chr>
#> 1 1 Spectab exclosure
#> 2 2 Control
#> 3 3 Long-term Krat Exclosure
#> 4 4 Control
#> 5 5 Rodent Exclosure
#> 6 6 Short-term Krat Exclosure
#> 7 7 Rodent Exclosure
#> 8 8 Control
#> 9 9 Spectab exclosure
#> 10 10 Rodent Exclosure
#> # ℹ more rows
The plot_id
column also features in the
surveys
table:
R
surveys
OUTPUT
#> # Source: table<surveys> [?? x 9]
#> # Database: sqlite 3.41.2 [/home/runner/work/R-ecology-lesson/R-ecology-lesson/site/built/data_raw/portal_mammals.sqlite]
#> record_id month day year plot_id species_id sex hindfoot_length weight
#> <int> <int> <int> <int> <int> <chr> <chr> <int> <int>
#> 1 1 7 16 1977 2 NL M 32 NA
#> 2 2 7 16 1977 3 NL M 33 NA
#> 3 3 7 16 1977 2 DM F 37 NA
#> 4 4 7 16 1977 7 DM M 36 NA
#> 5 5 7 16 1977 3 DM M 35 NA
#> 6 6 7 16 1977 1 PF M 14 NA
#> 7 7 7 16 1977 2 PE F NA NA
#> 8 8 7 16 1977 1 DM M 37 NA
#> 9 9 7 16 1977 1 DM F 34 NA
#> 10 10 7 16 1977 6 PF F 20 NA
#> # ℹ more rows
Because plot_id
is listed in both tables, we can use it
to look up matching records, and join the two tables.
If we have two tables named x and y with a common column called “ID”, we can join them using ‘join’ functions, two of which are described and illustrated below.
inner_join() : This returns all rows from x where there are matching values in y, and all columns from x and y.
left_join() : This return all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns.
In both forms of join, if there are multiple matches between x and y, all combinations of the matches are returned. For the full list of ‘join’ functions, check out the tidyverse join page.
In our example, the two tables we want to join are ‘plots’ and ‘surveys’.
For example, to extract all surveys for the first plot, which has
plot_id
1, we can do:
R
plots %>%
filter(plot_id == 1) %>%
inner_join(surveys) %>%
collect()
OUTPUT
#> Joining with `by = join_by(plot_id)`
OUTPUT
#> # A tibble: 1,995 × 10
#> plot_id plot_type record_id month day year species_id sex
#> <int> <chr> <int> <int> <int> <int> <chr> <chr>
#> 1 1 Spectab exclosure 6 7 16 1977 PF M
#> 2 1 Spectab exclosure 8 7 16 1977 DM M
#> 3 1 Spectab exclosure 9 7 16 1977 DM F
#> 4 1 Spectab exclosure 78 8 19 1977 PF M
#> 5 1 Spectab exclosure 80 8 19 1977 DS M
#> 6 1 Spectab exclosure 218 9 13 1977 PF M
#> 7 1 Spectab exclosure 222 9 13 1977 DS M
#> 8 1 Spectab exclosure 239 9 13 1977 DS M
#> 9 1 Spectab exclosure 263 10 16 1977 DM M
#> 10 1 Spectab exclosure 270 10 16 1977 DM F
#> # ℹ 1,985 more rows
#> # ℹ 2 more variables: hindfoot_length <int>, weight <int>
Important Note: Without the collect()
statement, only the first 10 matching rows are returned. By adding
collect()
, the full set of 1,985 is retrieved.
Challenge
Write a query that returns the number of rodents observed in each plot in each year.
Hint: Connect to the species table and write a query that joins the species and survey tables together to exclude all non-rodents. The query should return counts of rodents by year.
Optional: Write a query in SQL that will produce the same result. You
can join multiple tables together using the following syntax where
foreign key refers to your unique id (e.g.,
species_id
):
SQL
SELECT table.col, table.col
FROM table1 JOIN table2
ON table1.key = table2.key
JOIN table3 ON table2.key = table3.key
R
## with dplyr syntax
species <- tbl(mammals, "species")
left_join(surveys, species) %>%
filter(taxa == "Rodent") %>%
group_by(taxa, year, plot_id) %>%
tally() %>%
collect()
OUTPUT
#> Joining with `by = join_by(species_id)`
R
## with SQL syntax
query <- paste("
SELECT a.year, b.taxa,count(*) as count
FROM surveys a
JOIN species b
ON a.species_id = b.species_id
AND b.taxa = 'Rodent'
GROUP BY b.taxa, a.year, a.plot_id",
sep = "" )
tbl(mammals, sql(query))
R
species <- tbl(mammals, "species")
genus_counts <- left_join(surveys, plots) %>%
left_join(species) %>%
filter(taxa == "Rodent") %>%
group_by(plot_type, genus) %>%
tally() %>%
collect()
This is useful if we are interested in estimating the number of
individuals belonging to each genus found in each plot type. But what if
we were interested in the number of genera found in each plot type?
Using tally()
gives the number of individuals, instead we
need to use n_distinct()
to count the number of unique
values found in a column.
R
species <- tbl(mammals, "species")
unique_genera <- left_join(surveys, plots) %>%
left_join(species) %>%
group_by(plot_type) %>%
summarize(
n_genera = n_distinct(genus)
) %>%
collect()
OUTPUT
#> Joining with `by = join_by(plot_id)`
#> Joining with `by = join_by(species_id)`
n_distinct
, like the other
dplyr
functions we have used in this
lesson, works not only on database connections but also on regular data
frames.
Creating a new SQLite database
So far, we have used a previously prepared SQLite database. But we
can also use R to create a new database, e.g. from existing
csv
files. Let’s recreate the mammals database that we’ve
been working with, in R. First let’s download and read in the
csv
files. We’ll import
tidyverse
to gain access to the
read_csv()
function.
R
download.file("https://ndownloader.figshare.com/files/3299483",
"data_raw/species.csv")
download.file("https://ndownloader.figshare.com/files/10717177",
"data_raw/surveys.csv")
download.file("https://ndownloader.figshare.com/files/3299474",
"data_raw/plots.csv")
library(tidyverse)
species <- read_csv("data_raw/species.csv")
OUTPUT
#> Rows: 54 Columns: 4
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (4): species_id, genus, species, taxa
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
R
surveys <- read_csv("data_raw/surveys.csv")
OUTPUT
#> Rows: 35549 Columns: 9
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (2): species_id, sex
#> dbl (7): record_id, month, day, year, plot_id, hindfoot_length, weight
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
R
plots <- read_csv("data_raw/plots.csv")
OUTPUT
#> Rows: 24 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (1): plot_type
#> dbl (1): plot_id
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Also, you can create new SQLite database with
dplyr
by adding an argument to the same
command we used above to open an existing .sqlite
file. The
create = TRUE
argument instructs R to create a new, empty
database instead.
Caution: When create = TRUE
is added,
any existing database at the same location is overwritten without
warning.
R
my_db_file <- "data/portal-database-output.sqlite"
my_db <- src_sqlite(my_db_file, create = TRUE)
WARNING
#> Warning: `src_sqlite()` was deprecated in dplyr 1.0.0.
#> ℹ Please use `tbl()` directly with a database connection
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
Currently, our new database is empty, it doesn’t contain any tables:
R
my_db
OUTPUT
#> src: sqlite 3.41.2 [/home/runner/work/R-ecology-lesson/R-ecology-lesson/site/built/data/portal-database-output.sqlite]
#> tbls:
To add tables, we copy the existing data.frames into the database one by one:
R
copy_to(my_db, surveys)
copy_to(my_db, plots)
my_db
If you check the location of our database you’ll see that data is
automatically being written to disk. R and
dplyr
not only provide ways to query
existing databases, they also provide functionality to create your own
databases from flat files!
Note: In this example, we first loaded all of the
data into the R session by reading the three csv
files.
Because all the data has to flow through R, this is not suitable for
very large datasets.
Note: Finally, to close the connection to the
mammals database you may use DBI::dbDisconnect(mammals)
;
this discards all pending work and frees resources, e.g. memory.