--- title: \vspace{3.5in} "A Beginner's Workshop in R ^[Adapted some code and content from Neil William's Welcome to R Workshop]" author: "Meridith LaVelle ^[University of Georgia, malavell@uga.edu]" date: "19 August 2022" output: word_document: toc: yes toc_depth: '4' pdf_document: toc: yes toc_depth: '4' html_document: df_print: paged highlight: kate theme: cerulean toc: yes toc_depth: 4 toc_float: yes --- \newpage ```{r, echo = FALSE} library(knitr) opts_chunk$set(tidy.opts=list(width.cutoff=65),tidy=TRUE) ``` # Introdcution Welcome! Today I'll be teaching you about the basics of using the R language used within RStudio. Before we begin, it's first important to talk about what RStudio is, and why learning R is important for our various graduate programs in SPIA. RStudio is an "integrated development environment (IDE) for R"[^1] that political scientists routinely use for statistical analysis. One of the major benefits of RStudio is that it is free, which means it is more accessible than other options for data analytics software that are currently available. Learning to use R does come with a slightly higher learning curve, but we hope that today's workshop will help with that. [^1]: Why is this software important for graduate students in SPIA? Whether you're an international relations major studying human rights or an American politics students interested in law and courts, the graduate programs at SPIA are designed to help students learn how to conduct their own empirical research, which often includes the use of quantitative data analysis. Further, the methods sequence required of many of our graduate students in SPIA will include a heavy focus on learning how to conduct research using statistical modelling techniques. Don't worry if you don't know what all of that is now - if you don't, then you're still in the right place. # Downloading and installing Before you can install RStudio, you will first need to download R. You can download the current version of R at . Once you've reached this page, you'll select the download for your operating system and follow the instructions given on the site. ![R Website](/Users/meridithlavelle/Desktop/Intro%20to%20R%20-%20Fall%202022/R_website.png) Once you have R installed, you can open the program it and see the following: Working in R isn't always the most ideal. Many academics and data scientists tend to prefer RStudio instead. To download it, you'll go to the following website: and select download RStudio for your operating system. Once you have installed RStudio and have opened the program, you should see the following: ![What RStudio looks like when opened it](/Users/meridithlavelle/Desktop/Intro%20to%20R%20-%20Fall%202022/RStudio_image.png) # Getting started Now that you've installed and open RStudio, we'll talk about some of the basics of RStudio. You may notice that RStudio screen is composed of a few different boxes: the console, environment, and Files/Plots/Packages/Help/Viewer. ## Console The console contains what we call our "commands lines". These are the lines that begin with `>`. While you can directly write lines of code here, it's considered not best practice. The console is still critical - this is where we will see the output of the code that we run using scripts (more on that later). Additionally, if you run some code and there is a problem with that code, the console will return messages that tell us about the errors within our code. These messages will tell you about the errors in your code, which will then help you to debug (i.e., fix your code). ## Environment Once we get started working with actual data, this section will hopefully make more sense. One feature of R that separates it from other statistical software is that RSTudio is an object oriented program. What this means is that the data that you store in RStudio is stored as an object, which can be one of many types of classes (e.g., data frames, numeric, dates, characters, etc.). Once you get to the point of working with actual data, you will "call on" these objects to perform different tasks/functions. All of the objects that you work with -- data frames, vectors, etc. -- will be stored in the environment. This allows you to see what you have already loaded into RStudio. ## Files/Plots/Packages/Help ### Files The files section allows users to navigate among the files within your current working directory, add new files, delete files, rename files, as well as perform other tasks such as setting your working directory. More on that soon. ### Plots Once you start creating plots - which we will later today -- you'll be able to view your plots here. ### Packages The packages tab allows you to install and update packages written for RStudio. There are many packages that you will undoubtedly use when working with data in RStudio. Once you've installed a package in R, you can click on the name of the package in blue, which will take you to the R Documentation guide to learn more about a package and how to use it. Note: Packages are collection of R functions, previously written code, and/or sample data that are stored and created for R/RStudio. Using these will make your lives much easier when using R. ### Help The help tab provides tons of manuals, RStudio support, documentation guides, etc. # R commands In this section, we'll go through running some basic commands in RStudio. Earlier I mentioned that we will not be writing code into the console, but instead we will open up a new R script, which is where you will write all of your code when working in RStudio. To do this, you will click on the top left-hand corner of RStudio and select "R Script" ![Opening a new script](/Users/meridithlavelle/Desktop/Intro%20to%20R%20-%20Fall%202022/R%20Script_image.png) Once you've opened a new script, RStudio will look like this: ![New script](/Users/meridithlavelle/Desktop/Intro%20to%20R%20-%20Fall%202022/New%20Script_image.png) Before we get started with learning how to perform basic mathematical operations, there's one more critical topic to talk about: the working drive. ## Working drive The working drive is the location on your device where RStudio is operating. This is important to know for several reasons. First, if you save an RStudio file (.R) and you don't know where your working drive is set, then you may have trouble finding your R files. Similarly, if you're loading in data that's located in one folder and your working drive is set elsewhere on your device, then you'll get error codes when you try to load your data in. R will act like the data file doesn't exist. You can find out where our working drive is set by running the following code: ```{r} getwd() ``` You can see that my working drive is set to my Desktop. What if I wanted to change it? There are a couple of ways to do this. For this step, you're going to practice this by changing your working drive to the folder called "Intro to R - Fall 2022". This should have been sent to you before the workshop. If you were not able to get it, you can just create a new folder called "Intro to R - Fall 2022". **Option 1:** In option one, you are writing out the file path to the working directory manually into your script file. ```{r} setwd("/Users/meridithlavelle/Desktop/Intro to R - Fall 2022") ``` Note: If you're working on a PC and are writing out the file path to change your working drive, your written code will look slightly different: ```{r eval = FALSE} setwd("C:/Users/meridithlavelle/Desktop/Intro to R - Fall 2022"") ``` **Option 2:** Under the files tab, find the "Intro to R - Fall 2022" folder. Click on the folder. You should see "Home \> Desktop \> Intro to R - Fall 2022" underneath the File/Plots/Packages tabs. Once you're in the folder, you can click on the blue cog next to "Rename", and select "Set As Working Directory" from the drop down menu. ## Saving your script files You should save your script files often to prevent losing your work. To do this, you can either select "Save" from the File drop down menu at the top of your screen or you can select the floppy drive icon at the top of your RStudio screen. ![Saving a script file](/Users/meridithlavelle/Desktop/Intro%20to%20R%20-%20Fall%202022/Saving_image.png) ## Writing code Now you can start running some code. First we'll start with writing some very basic mathematical equations. ```{r} 1 + 1 2 * 3 10 / 3 ``` The code above is pretty straightforward - it is similar to how you would type these basic equations into a calculator. You'll write exactly what it seen above into your script file and press the "Run" button to execute your code. See the image below: ![Writing and executing code from a script file](/Users/meridithlavelle/Desktop/Intro%20to%20R%20-%20Fall%202022/WRiting%20code_image.png) ## Error messages You will inevitably receive error messages when you use RStudio. What do they look like? ```{r, eval = FALSE, message= TRUE, echo = TRUE} 1 + a ``` Note: The error message that returns should say, `Error: object 'a' not found` Error messages will get much more complex than this, but here's another simple example of what an error message can look like. Suppose that I want to add 1 to the product of 2\*4, but I forget to add a parenthesis: ```{r, eval = FALSE, message= TRUE, echo = TRUE} 1 + (2 * 4 ``` Note: The error messages that returns should say, `Error: unexpected numeric constant in: 1 + (2*4` or `Error Incomplete expression: 1 + (2 * 4` You may also notice that the return output on your console shows a `+` instead of the usual `>`. This indicates that an operation could not be completed because something is missing. If you add in the final parenthesis and re-run the line of code, you'll get the correct answer and the `>` will return. RStudio is also pretty good about showing you exactly on which line of code your error can be found. You'll likely see a red circle or a red circle with an X inside. You can even hover over these with your cursor, and RStudio will tell you exactly the issue. # Installing and loading packages Earlier I talked a little bit about packages - what they are, where you can find installed packages in RStudio, and how you can find more information about specifics on each package via Documentation Guides. Now I'll go over how to install packages that you may not already have and how to use them. As a reminder, packages are collection of R functions, previously written code, and/or sample data that are stored and created for R/RStudio. Using these will make your lives much easier when using R. First, we'll install a package called "foreign". This package will allow us to read in file types that RStudio isn't automatically programmed to read. While there are many file types that R can read without installing additional programs (e.g. ".csv" files or data sets created in RStudio), this can be limiting. For example, you may come across some data in the future that was created in another statistical programming software called Stata. File types produced from Stata are .dta files, which RStudio can't load unless "foreign" is installed. ```{r eval = FALSE} install.packages("foreign") ``` What this line of code tells RStudio is to use the function `install.packages` to install `foreign`. When you install packages, this is always the syntax: \`install.packages("package name")\`\`\`. You have to include the quotations around the name of the package itself. This is very important to remember! Once the package is installed, we have to load the package to use it. To do this, you would write the following code: ```{r eval = FALSE} library(foreign) ``` If the package was successfully installed, your return output in the console upon loading the package would produce a blank line: `>`. Sometimes you may get warning messages or other messages about loading additional packages. These messages are standard and usually nothing to worry about. What happens if you try to load a package that you don't have installed? You would receive output like this: ```{r eval = FALSE} library(random) ``` The output in your console would produce an error message saying: `Error in library(random) : there is no package called 'random'` Let's install and load a few more important package used for loading data. The packages you'll need to install are "readxl", "readr", and "rio". These will allow you to read in excel files, load in rectangular data structures, and guess file formats, respectively. ```{r eval = FALSE} install.packages("readxl") ``` ```{r eval = FALSE} install.packages("readr") ``` ```{r eval = FALSE} install.packages("rio") ``` ```{r include=FALSE} library(readxl) library(readr) library(rio) ``` You can also install packages that are comprised of data sets rather than functions. These work the same way: ```{r eval = FALSE} install.packages("carData") ``` Then you would load in the package the same way as other packages that are more geared towards using functions/programs: ```{r include=FALSE} library(carData) ``` ## Help with packages I mentioned earlier that you can look at the `Help` on the bottom right corner of RStudio to get more information on packages. You can also write the following lines of code to learn more: ```{r} help(package = "carData") ``` The result from running the above line of code gives us the R Documentation Guide for the carData package. We can read the description file, package news, or learn about the individual data sets that make up the carData package. You may also want to learn more about certain features within a package. For example, the carData package has a long list of "Help Pages" for each data set. To learn more about a specific data set within carData, we can write the following: ```{r} ?WVS ``` This will pull up the R Documentation for the WVS included in the `carData` package. If you don't see anything pop up, you may need to click on the `Help` tab in the bottom right section of RStudio. # Objects and object Types I talked earlier about how RStudio is an object oriented program. Again, this means that you can create or use pre-made objects and work with them. There are many different types of objects, and this is important to understand upfront. Depending on the type of object you're working with, you may or may not be able to perform certain functions/operations on that object. Below is a list of common object types and a brief definition: - **Scalars** - these are items that contain single values/elements - **Vectors** (several types) - these are "one dimensional (one column and/or one row) arrays of numbers, character strings, or logical values. Single numbers, character strings, and logical values in R are treated as vector length one."[^2]; these have elements of the same type (e.g. [3, 2, 1]); usually only one row and/or column but can be a singular element - **Numeric** - numbers - **Characters** - words or letters; these are always written with quotation marks around them (e.g. think about when we installed packages) - **Factors** - numbers with labels/levels - **Logical** - object is coded as either TRUE or FALSE - **Matricies** - two-dimensional array (two or more columns and/or rows) of elements all of which are of the same class (e.g. character, numeric, logical)[^3] - **Data frames** - "two-dimensional data tables (i.e., more than one column and/or one row), with rows defining observations and columns defining variables. Data frames are heterogenous, in the sense that some columns may be numeric, some may be factors, and some may have character or logical data"[^4] - **Lists** - data structures that can be comprised of different types of elements [^2]: Fox and Weisberg 2011 [^3]: Fox and Weisenberg 2011 [^4]: Fox and Weisenberg 2011 Note: There are other types of objects, but these are some of the most common at the introductory level Let's look at some examples to see how the different types of objects differ: ## Numbers/numeric objects ```{r} x <- 1 x ``` ```{r} y <- 2 ``` Note: Remember earlier when we went over the environment (top right hand corner of RStudio)? Notice that you see x 1 and y 2. From the above code, you assigned x to have a value of one and y to have a value of two. The syntax `<-` means that you are assigning some information (left hand side of the arrow; in our case 1) to an object on the right-hand side of the arrow, x. This means that x and y are now objects on which we can perform various functions, so we don't have to keep typing 1 or 2. This might not seem like a lot right now, but using objects to store entire data or other information becomes crucial very quickly. Now that you've assigned x and y numerical values, which makes x and y numerical class objects, you can perform mathmatical functions with them. ```{r} x + y ``` Let's look at a few more examples below. Using a simple \* allows us to perform multiplication ```{r} x * y ``` Using / allows us to divide ```{r} x / y ``` Using the \^ allows us to raise objects to a certain power ```{r} y^2 ``` log() is a function that calculates the log of the object within the parentheses. ```{r} log(x) ``` Finally, exp() is the function that allows us to exponetiate an object. ```{r} exp(x) ``` How can you check if x or y are actually numeric? You can run the following: ```{r} class(x) ``` The output in the console shows us that x is numeric. Let's continue looking at some more examples. ## Vectors Remember, vectors are one dimension arrays of elements. This means, you will only be working with one row or column. Let's create a vector: ```{r} xvec <- c(1, 2, 3, 4, 5) ``` What this line of code above does is it is assigning a new object named `xvec` as a vector containing the elements 1, 2, 3, 4 and 5. c() means to "combine". You will use this very frequently when working through assignments and on your own research. To see the object and what it looks like, you can simply type the name of the object. ```{r} xvec ``` Here's another way to create another object that looks like xvec: ```{r} xvec2 <- seq(from = 1, to = 5, by = 1) xvec2 ``` You can see that this code produces an object that looks the same as xvec1. One thing you'll learn quickly is that you can write code in several different ways to produce the same outcome. In the code above, we achieve this by using the seq() function, which is a function used to generate sequences. One way to "translate" this line of code is to say that we want to create an object, xvec2, and we want to assign xvec2 values that are sequenced from 1 to 5 by increments of 1. Now let's ```{r} yvec <- rep(1,5) ``` What does the rep() function do? Let's use something you learned earlier to find out more. ```{r} ?rep() ``` If you use the code above, you can find out quickly that the function rep() replicates the values contained within the object (in our case, yvec). This means, that instead of creating a vector with values 1-5, the documentation tells us that in our code we are creating an object called yvec that contains the value 1, repeated 5 times. Let's check `yvec` to see if that's the case. ```{r} yvec ``` You have now created a few vectors. What can you do with them? First, you can perform a few mathematical operations on them and create a new vector: ```{r} zvec_add <- xvec + yvec zvec_add ``` The resulting vector, zvec_add, contains values of adding xvec and yvec together. Here's another example: ```{r} zvec_sub <- xvec - yvec zvec_sub ``` ## Matricies Now you'll learn some basics about creating and working with matricies. Remember, matricies are two-dimensional arrays (two or more columns and/or rows) made of elements that are of the same class (i.e., you can't have both numeric and character elements in a matrix) ```{r} mat1 <- matrix(data = c(1, 2, 3, 4, 5, 6), nrow = 3, byrow = TRUE) mat1 ``` You probably noticed right away that this code becomes a little more involved than the code used to create vectors. First, I'll talk about what the code above is doing. When you created the matrix above, you created a new object (mat1) by using `<-`. This tells RStudio you want to assign whatever comes on the right-hand side to make up our new object. On the right-hand side of the arrow, you will first call on the matrix() function. Inside of the (), you're defining what data you want to include within the matrix using the c() (combine) command. Within the c(), you'll include the elements/values that you would like to use. Next, you'll include a portion of code `nrow = 3`. This segment of the code tells RStudio that you want to divide elements into 3 rows. Finally, the byrow = TRUE attribute tells RStudio that you want the elements (1-6) to be written sequentially across rows. In other words, you want RStudio to create a matrix where the elements are counted in the order as you have written it, by rows. Try changing the code so `byrow = FALSE` and see the difference. When you do this, you'll see that RStudio produces a matrix that looks different - instead of listing the elements in the order you coded them by rows, it instead lists them sequentially going down the columns. Now let's make another matrix. ```{r} mat2 <- matrix(data = seq(from = 6, to = 3.5, by = -0.5), nrow = 2, byrow = TRUE) mat2 ``` Now, let's perform a basic operation on the two matricies: ```{r} mat1 %*% mat2 ``` Note: There are rules for perform many basic operations on vectors and matricies. We won't talk about that today. ## Data frames The final object type that we'll look at more in depth, which will be critical to the work you do in your graduate program, is the data frame (or data set). In this section you'll create a basic data frame. Most of the time, you'll be working with data that's already been created, but knowing how to create one yourself is important. In this example, you'll create four different vectors with different types of information and combine them to create a data set. ```{r} grade <- c("A", "D", "A-", "B+", "A", "A") days_absent <- c(1, 9, 2, 3, 0, 1) name <- c("Student 1", "Student 2", "Student 3", "Student 4", "Student 5", "Student 6") mydata <- data.frame(name, grade, days_absent) mydata ``` # Relational and Logical Operators Often, we use relational operators to compare two or expressions. For example, we may want to know if x is larger than another value. Below is a short list of commonly used relational operators [^5]: [^5]: ![Relational Operators](/Users/meridithlavelle/Desktop/Intro%20to%20R%20-%20Fall%202022/Relational_operators.png){width="75%"} Logical operators, on the other hand, generally require two or more conditions and will return whether a statement is true or false. For example, we may want to know if specific conditions about A and B are both true. Below is a short list of commonly used logical operators: [^6] [^6]: - & - and - returns true when both conditions are met - ex: c(20, 30) & c(30, 10) - && - and - returns true when both conditions are met, but it works on a single element - ex: If (age \> 18 && age \<= 25) - | - or - returns when true when at least one of the conditions is met - ex: c(20, 30) \| c(30, 10) - \|\| - or - same as above but it works on a single element - ex: If (age == 35 \|\| age \< 60) - ! - not - if the condition is true, logical NOT operator returns as false - ex : If age = 18 then !(age = 18) returns false # Indexing Indexing is how we often refer to the process of extracting specific elements from a data object. For example, assume you have the following vector: ```{r} vec1 <- c(2, 4, 6, 8) ``` Now, assume that you're interested in only extracting the third element of the vector. You would do this by writing the following: \``{r} vec1[3]` Indexing can get more complicated, just like all of the other topics we've discussed. In this next example, we'll using indexing on the data frame that we created earlier. If you've forgotten what the data frame object was called, you can look at the `Enrivonemnt` to find out what we named it. Now, assume that you want to index the elements from the first row of the matrix. To do this: ```{r} mydata[1,] ``` By indexing this first row, we can see all of the data related to Student 1. Now let's index the third column. ```{r} mydata[,3] ``` When we index the third column, the returned output should show us the numbers of days each student was absent. You may be interested in locating a specific element within a matrix or data frame. For example, you may want to see what grade Student 2 received. ```{r} mydata[2,2] ``` Another data point of interest could be that you're interested in seeing how many days Students 2, 3, and 4 missed. In this case, you would want to use a `:` to indicate that you are interested in all values contained from 2-4. You'll notice in the code below that I've also included `"days_absent"` in quotations to call on the specific variable (or column). ```{r} mydata[2:4, "days_absent"] ``` ```{r} mydata$name ``` With data frames, the code above allows us to look at the contents of a specific column/variable by using the following syntax: `name of the data frame object + $ + the variable of interest`. This will become extremely important as you work more with data frames (i.e., data sets). # Functions We've talked some about a few types of functions, including `matrix(), data.frame(), seq(), and rep()` to name a few. What exactly are functions? These are essentially black boxes where you enter in some kind of input based on previously set parameters and the function does all of the work for you by producing the output. Functions consist of a function name, its arguments, and specific values that are passed into the arguments. As a general example: ```{r eval = FALSE} function_name(argument1 = value, argument2 = value) ``` In some cases, functions may have default settings, so you can go in and change these by writing out the argument out when coding the function and changing the default setting. Think back to when you created your own matricies. When we wrote out the `byrow = TRUE` segment of our code, this was a case of changing a default setting. You can see that `byrow = FALSE` is the default setting when looking at the R documentation on the `matrix()` function. There are countless functions already built into R and even more that are part of packages, and you can also build functions yourself. For now, I'll provide a few examples of common functions in RStudio to help get you started:[^7] [^7]: - as.Date() - converts a character string into a Date class - as.factor() - converts a data object into the class factor - as.numeric() - converts a data object to a numeric class object - boxplot() - creates a boxplot - data.frame - creates a data frame - dim() - produces the dimensions of arrays, matricies, etc. - dnorm() - creates a standard normal distribution - is.na() - returns a logical (True/False) vector or matrix that indicates which elements are missing - ls() - returns the names of data objects currently loaded into the environment - max() - produces the maximum value of a vector, column, matrix, etc. - median() - computes the median value - min() - computes the minimum value of a vector, column, matrix, etc. - na.omit() - removes missing data - nchar() - returns the number of elements (i.e., letters) of a character object - ncol() - returns the number of columns within a matrix or data grame - nrow() - returns the number of rows of a matrix or data frame - plot() - produces a scatterplot or density plot - print() - returns the data object to the RStudio console - quantile - produces the sample quantiles - range() - produces the minimum and maximum of a data object - rbind() - combines vectors, matricies, and/or data frames by row - rbinom() - draw random number from a binomial density - rlogis() - returns logistically distributed random number - rbinom() - draw random number from negative binomial density - rnorm() - draw normally distributed random number - row.name() - get or set row names of a data frame or matrix - row() - returns the row indicies or labels - rowmeans() - produces the mean of each row of a numeric class object - sd() - returns the standard deviation - set.seed() - set a random number - sqrt() - computes the square root of a numeric data object - sum() - computes the sum of of a numeric input vector - summary() - coomputes the summary statistics of data and model objects - t() - transposes a data frame # Random numbers and distributions In the methods sequence of SPIA's graduate programs, you'll learn a lot about statistical inference, as this is at the core of a lot of the work that we do as social scientists. I won't go into the statistics side of what we're about to do, but I want to briefly introduce you to a few lines of code that will help you get started with drawing from distributions and plotting these random draws on density plots. ```{r} draws <- rnorm(1000, mean = 5, sd = 10) summary(draws) ``` ## Density plot ```{r} draws <- rnorm(1000, mean = 5, sd = 10) plot(density(draws), main = "Title of plot", xlab = "X-axis", ylab = "Y-Axis") ``` ## Histograms ```{r} draws <- rnorm(1000, mean = 5, sd = 10) hist(draws, main = "Histogram", xlab = "X-axis", ylab = "Y-axis") ``` # Working with data sets ## Importing data from your device In many cases, you will be working with data sets that have already been created. Unless you're building your own data from scratch, there are countless data sets available already made that we can easily work with in RStudio. There are many ways you can work with data and data sets (e.g., merging together data sets/data management, regression analysis, forecasting, etc.). Entire courses could be taught on these topics, but we're going to stick with the basics here. To start, you need to import a data set into RStudio. There are a couple of ways you can do this: **Option 1** ```{r} setwd("~/Desktop/Intro to R - Fall 2022") hr_conflict <- read.csv("hr_conflict.csv") ``` **Option 2** You can also load in data by using the `Files` tab on the bottom right-hand section of the RStudio screen. Assuming your working drive is correctly set (for this workshop, your working drive should be set to either the folder you were sent prior to the workshop or the folder you made at the beginning of the workshop). In either case, you will want to make sure that your working drive is set to the "Intro to R - Fall 2022" folder. If you were able to access the folder prior to the workshop, you should see in the list of contents in this folder a file called "hr_conflict.csv." Click on the file and select "Import Dataset...". You should see the following screen: ![Importing Data](/Users/meridithlavelle/Desktop/Intro%20to%20R%20-%20Fall%202022/Import_image.png) You'll notice there are a few settings you can adjust. For now, you can just ignore these and select "Import" at the bottom right-hand corner of the screen. Immediately, you should see the data loaded in a separate screen. If you click out of that tab and want to view the data again later, you can write the following: ```{r eval = FALSE} View(hr_conflict) ``` **Option 3** A third option for loading in data is to select "File" at the top of your screen, and from the drop down menu select "Import Data" and select the data type. For our example, we would select `From text(readr)`. You would then click on the `Browse` button on the top right hand corner of the box and select the "hr_conflict.csv" file and import. Note: This may be different on PC. ## Importing online data There are other ways to import data. You can also import data from a website. ```{r} online.data <- import("http://www.jkarreth.net/files/mydata.csv") ``` Note: to use this method of importing from online, you need to make sure that you have loaded the `rio` package. ## The data The data set you loaded contains data related to human rights and conflict. It includes information on many countries over time: the country, year, COW code, population, GDP per capita, polity scores, CIRI human rights scores, whether the country is at war, and some other indicators on legal systems and colonial legacies. One of the first things you can do is to look at some summary statistics of the data. ```{r} summary(hr_conflict) ``` The output produced from running the summary function on the data provides us with information on the variables: the minimum and maximum values for each variable, the mean and median values for each variable, the first and third quartile values, and the number of missing observations for each variable. We can use some of the other functions to calculate other quantities of interest from the data. For example, you might be interested in obtaining the standard deviation (or the measure of dispersion/variation of a set of data from its mean value) of a particular variable. ```{r eval} sd(hr_conflict$population) ``` The output we get is NA. Why? When you ran the summary statistics, you saw that the values vary across a range, but there were 171 missing observations. The missing observations may be messing up the output when trying to calculate the standard deviation. You can look at the R Documentation to see if there's some kind of revision you can make to the code. ```{r} ?sd() ``` Under usage, you should see that the function includes the use of a value (x), which in our case is the variable `population`, but there's also a default setting of `na.rm = FALSE`. This default setting means that the calculation is including the NAs (i.e. you do not want to remove the missing observations), so you need to change this in your code. Instead, you want the function to remove the missing observations. ```{r} sd(hr_conflict$population, na.rm = TRUE) ``` Other common values of interest are quantiles. You can set these to be at the 5th and 95th percentiles and use the `quantile()` function to calculation these values. ```{r eval = FALSE} quantile(hr_conflict$population, probs = c(0.05, 0.95), na.rm = TRUE) ``` Notice here how I included the `na.rm = TRUE` portion of the code. If you look at the R documentation for the `quantile()` function, you'll notice that like `sd()`, the default setting in the function is `na.rm` = FALSE\`\`\`, so you'll need to change it as long as you have missing data. # Plots and figures There are several ways that users can create plots and figures. I'll provide some basic information on "base graphics", but the majority of this section will focus on data visualization using the increasingly popular package `ggplot2`. One of the benefits of using R for data visualization - whether base R or ggplot2 - is that the user will have greater ability to control and customize your plots and figures. ## Data visualization in base R One of the benefits of using base R's graphics is that these are pretty useful for quickly spot checking results from various types of analyses you perform. First, we'll start with graphing a basic density plot using random draws from a normal distribution. We'll start by using the `set.seed()` function to ensure that the simulation starts with the same values each time. This is important so we can replicate our results. ```{r} set.seed(321) dist1 <- rnorm(n = 1000, mean = 0, sd = 1) set.seed(321) dist2 <- rnorm(n = 1000, mean = 0, sd = 2) plot(density(dist1)) lines(density(dist2), col = "red") ``` ### Scatter plots ```{r} set.seed(321) x <- 1:50 y <- rnorm(n = 50, mean = 0, sd = 2) plot(x, y, pch = 1, col = "blue") ``` In the code for the scatter plot, you'll notice a couple of new attributes: here, we're plotting the x and y coordinates first. Next, you'll notice `pch`. This attribute refers to the types of points on the scatter plot. You can change the number and see the differences in plot points. Finally, `col` refers to color of the points. For this attribute, you can either use a number that's programmed to be a certain color (like in the example above), or you can write out the color that you want. Note: if you write out the color, this needs to be in quotations. ### Box plots ```{r} set.seed(123) x1 <- rnorm(100) boxplot(x, col = "grey", main = "Box plot") ``` For the box plot example, you may notice that I included the segment of code `main = "Box Plot`. This is how you add a title to base R plots. Additionally, you can change the default x and y axis labels. ```{r} set.seed(123) x1 <- rnorm(100) boxplot(x, col = "grey", main = "Box plot", xlab = "X", ylab = "Y") ``` ### Exporting base R plots Finally, you may want to export your data visualization. To do this in using base R, you can use the `pdf()` function. ```{r} set.seed(321) dist1 <- rnorm(n = 1000, mean = 0, sd = 1) set.seed(321) dist2 <- rnorm(n = 1000, mean = 0, sd = 2) pdf("density_plot.pdf", width = 5, height = 5) plot(density(dist1)) lines(density(dist2), col = "red") dev.off() ``` You'll notice two new lines of code added to our first density plot example. The \``pdf()` function is what tells RStudio to save a pdf version of our density plot. The first element within the pdf() function is what we want to call the pdf file that we're saving. The second two elements provide the size of the plot that we are saving. `dev.off()`, the final line of code, is essentially just closing the newly saved file. The pdf file of your density plot should be save to your Desktop by default. ## Data Visualization with ggplot2 Before you can get started with using ggplot, you need to install and load the package. ```{r} #install.packages("ggplot2") library(ggplot2) library(reshape) ``` For this section, you'll use data from the `gapminder` library. You may need to install this first. Make sure to also load the gapminder library once you've installed the package. ```{r} #install.packages("gapminder") library(gapminder) gapminder <- gapminder ``` Now that you've got the `gapminder` data loaded, let's run a quick summary to learn a little more about the data. ```{r eval=FALSE} summary(gapminder) ``` We can tell pretty quickly that this data set contains information on life expectancy for populations of countries over time. We also have data on the population size and GDP per capita for each country. To do data visualization using `ggplot2`, the syntax is slightly different from base R. We'll start with a basic example. ```{r} ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) + geom_point() ``` From this example, you can see what life expectancy is given the level of GDP per capita. In terms of the code, I'll break down what is included: first you want to call on the ggplot function. Inside the function, you'll first include the data set object (gapminder), then you'll include an aes (short for aesthetic) parameter. Within the aesthetic parameter, you'll include your x and y variables of interest for ggplot2 to map (gdpperCap and lifeExp). Once those parentheses are closed, you'll include a + symbol. On the next line, you'll include the line of code `geom_point()`. This tells RStudio that you want to draw a scatter plot. There are several ways that you can customize plots from `ggplot2`. For example, you may want to display the same plot from above, but with each of your variables logged to transform the data output. ```{r} ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) + geom_point() + scale_x_log10() ``` Like in base R, you can also customize your plots in `ggplot2` with different colors. ```{r} ggplot(gapminder, aes(x = gdpPercap, y = lifeExp, color = continent)) + geom_point() ``` In this example, rather than selecting a certain color following the `color` attribute, I included a variable. By putting continent after the color attribute, we can now learn more about the relationships among GDP, life expectancy, and continent. `ggplot2` differentiates each continent with different colors automatically. We can try and learn even more by including additional attributes into the aes() portion of the ggplot2 function. For example, we may want to plot point of relative size of population: ```{r} ggplot(gapminder, aes(x = gdpPercap, y = lifeExp, color = continent, size = pop)) + geom_point() ``` Another thing we can do with the `ggplot2` package for data visualization is to "facet". This means that we can split a plot into multiple, disaggregated plots. You notice the line `facet_wrap(~continent)`. The `~` (along with the facet_wrap function) tells RStudio to divide up the plot by the variable provided. ```{r} ggplot(gapminder, aes(x = gdpPercap, y = lifeExp, color = continent, size = pop)) + geom_point() + scale_x_log10() + facet_wrap(~continent) ``` ### Bar plots Scatter plots are only one of the many types of data visualization users can perform using the `ggplot2` package. Below is an example of how we can create bar plots. This is done by swapping out the line `geom_point()` with `geom_col`. ```{r} ggplot(gapminder, aes(x = continent, y = lifeExp)) + geom_col() ``` Like in the scatter plot example, we can include colors to the various bins ```{r} ggplot(gapminder, aes(x = continent, y = lifeExp, color = continent)) + geom_col() ``` ### Density functions Going back to the example in the base R section, you can also draw density plots using `ggplot`. ```{r} set.seed(321) dist1 <- rnorm(n = 1000, mean = 0, sd = 1) set.seed(321) dist2 <- rnorm(n = 1000, mean = 0, sd = 2) dist.data <- data.frame(dist1, dist2) ggplot(dist.data, aes(dist1)) + geom_density() ``` Notice in the code above where I made a new object called `dist.data`. Remember, because `ggplot` requires users to include the name of a data frame object in the first position of the function, we need to make sure we've created a data from object. ### Other plots types and functions Below are some of the more common types of ggplots. To use these, you can swap out the second line for each of the examples (i.e., swap out `geom_col()` or `geom_point()`, etc.): - geom_bar - creates bar charts - geom_histogram - creates a histogram - geom_boxplot - creates a boxplot - geom_smooth - creates smoothed conditional means Finally, you might be wondering how you can add titles and change the names of your axes. To do this, we'll add additional lines to tell ggplot to carry out these changes in our plots. ```{r} ggplot(gapminder, aes(x = gdpPercap, y = lifeExp, color = continent, size = pop)) + geom_point() + scale_x_log10() + facet_wrap(~continent) + xlab("GDP per capita") + ylab("Life Expectancy") + ggtitle("GDP's effect on life expectancy by continent") ``` For more information on the `ggplot2` package, see the package's R Documentation Guide: ### Exporting plots in ggplot2 In ggplot2, you can use the `ggsave` function to save a plot. The easiest way to do this is to turn your plot into an object, then use the `ggsave()` function. ```{r} final.plot <- ggplot(gapminder, aes(x = gdpPercap, y = lifeExp, color = continent, size = pop)) + geom_point() + scale_x_log10() + facet_wrap(~continent) + xlab("GDP per capita") + ylab("Life Expectancy") + ggtitle("GDP's Effect on Life Expectancy by Continent") ggsave("final.plot.pdf") ``` # Conventions and workflow ## Conventions There are some other topics that are worth addressing as we finish up. First I'll go over some suggestions for creating an efficient workflow and other common conventions held when using RStudio. There are several online and physical resources out there that provide suggestions to good practices at writing clean, easy to read code in RStudio. I've included a list below of some "best practices" of using Rstudio: - Begin each RStudio session with `rm(list = ls())` as your first line of code. This will clear out your environment, allowing you to start a new session. - Also early in your R script, you should specify your working directory. - Never type commands into the R command line (R) or in the console (RStudio). Using a script file is usually preferred. You can save your script files to refer back to later, but you may lose your work if you work directly within the console. - Save your script files frequently! - Avoid using the `attach()` command - Create simple names for your variables/objects. - Be consistent in how you write your code. Here's a good [style guide for beginners](http://adv-r.had.co.nz/Style.html). - Try and keep your code to 80 characters per line, max. - Commenting: we didn't talk about this, but if you want to leave notes in your code, you can use a `#`. Any text or code that exists on the same line has the `#` will not process when you tell RStudio to run that line of code. Here are some additional resources to learn more about best practices in writing clean and efficient R code: - - - Long, J. Scott. 2009. *The Workflow fo Data Analysis Using Stata.* Stata Press. [^8] [^8]: Conventions should be similar even though the book is written for Stata, which is a different statistical analysis software with a different language. As you get into using RStudio more, you'll learn about more functions and packages that will expand your utility of RStudio. Some of these packages will likely include `dplyr`, `tidyr`, `lubridate`, `rmarkdown`, `shiny`, and many more. RStudio has several "cheatsheets", which are available [here](https://www.rstudio.com/resources/cheatsheets/). ## Workflow As you work more with R, you'll learn about a few options to improve your workflow using RStudio to help stay organized. Two ways to do this are through RStudio Projects and RMarkdown. ### RStudio projects RStudio provides an option to create a new directory, work from an existing directory, or use version control. While you don't have to use any of these features, these help users stay organized with their R scripts (and other associated files). Below are a couple of useful links to learn more about RStudio Projects: - - You'll find that everyone develops their own preferences when writing code, regardless of platform. With RStudio, many people don't like the setwd() approach that we talked about today and instead opt for RStudio projects instead. Both methods work well and usage boils down to preference.One of the resources above puts it well: your project will likely dictate the needs of your workflow. If you're working alone on a short script, `setwd()` may suffice, but if you're working on a collaborative project, then RStudio Projects may be preferable. I encourage you to try both and see what works best for you. ## RMarkdown Another option for workflow is RMarkdown. This approach is incredibly useful for project management, collaboration (especially via Github), and replication. Using RMarkdown allows users to combine data analysis projects and writing into one channel. I've included a couple of resources below to learn more about RMarkdown. One example of using RMarkdown is the document I've been using for the workshop. It's an incredibly useful and versatile way to improve your workflow, and one that I highly recommend! Sweave and knitr are also recommended solutions for workflow. - - # Additional resources - [Common error messages](https://blog.revolutionanalytics.com/2015/03/the-most-common-r-error-messages.html) - [WeAreRLadies (Twitter Community)](https://twitter.com/WeAreRLadies) - [R for Data Science (R4DS) Online Text Book](https://r4ds.had.co.nz/) - [R4DS (Twitter Community - based on the book)](https://twitter.com/R4DScommunity) - [R4DS (Slack Channel - based on the book)](https://rfordatascience.slack.com/join/shared_invite/zt-n46lijeb-2RRzQ70U34eH530~PyZsmg#/shared-invite/email) - [R Markdown Cheat Sheets](https://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf) - - Fox and Weisberg, An R Companion to Applied Regression (2011, print). - . This website offers well-explained computer code to complete most of the data analysis tasks we use in this workshop. - Website that focuses on the use of R in Political Science