Working with Data

Joshua F. Wiley

2020-03-24

1 Intro

Monash has asked that we watch this video.

Core ideas and code are in the main document. Extra information that may help understanding but is not the big picture I will go in margin notes, like this. Remember that if you use GitHub desktop, you should already have the Rmarkdown file on your computer (see Week 1).

To start off the lecture, please make sure that you:

Have downloaded the Rmarkdown file for the lectuere here https://jwiley.github.io/MonashHonoursStatistics/WorkData.rmd and that you have RStudio open and have opened the project you are using for this class and have the Rmarkdown file for this lecture open on your own laptop (or are following along with someone else).

2 Basic R Use

As a brief refresher, we’ll start just by using R as a basic calculator.

3 + 4 # addition
## [1] 7
2 * 6 # multiplication
## [1] 12
4 / 2 # division
## [1] 2
3^2 # powers [2 squared]
## [1] 9

Objects in R can be almost anything. They can be a single number, a vector or set of numbers, the results from a regression, a graph, a dataset. You can have many objects at the same time in your worksapce, which can be a useful way to save results or store something for later use. In R many things you do with an individual number can be done with a vector, a set of numbers. Here’s a quick refresher on creating vectors, using c() and on storing them in an object, here I used x.

Once you have created an object, it should show up in RStudion in the “Environment”. In addition, we can view or operate on all the elements of an object by referring to it by name.

c(1, 5, 4) ## create a vector
## [1] 1 5 4
x <- c(1, 3, 5) ## asign vector to x

x ## view x
## [1] 1 3 5
x + 2 ## add 2 to every element of x
## [1] 3 5 7

Computers do not work quite like people or regular math. Usually, numbers are represented using floating-point arithmetic (https://en.wikipedia.org/wiki/Floating-point_arithmetic) often a binary representation is used as much in the computing world is based on bits stored in 0s and 1s. For example, with a binary based, 1/5 cannot be represented exactly. An implication of this is that sometimes numbers are not represented in R quite like you might think. For example: 1/5 == 0.2000000000000001 is evaluated as false but 1/5 == 0.20000000000000001 is evaluated as true. The default precision for floating points in R cannot distinguish those two numbers. Finally, here are a few quick tips on formatting R output. Let’s say we have a number with lots of decimal points, we might not always need all that precision.

You can use R to round for you, using the round() function. You can also use the options() function to control the default options for how R prints output. This is helpful if you want everything printed to your screen to be rounded more. One difference is that options() actually uses significant figures
so you will not get a fixed number of decimal points, rather you get a fixed number of significant figures.

The round function has two main arguments (i.e., pieces of input it needs from you). The first is the number or vector to be rounded. It can be a single number or a whole vector of numbers. The second is how many digits you want R to use in the rounding. The results may not always print the number of digits you requested, which may be confusing. This can happen when one of the digits rounds to 0 (e.g., 1.199 rounds to 1.20 and the 0 would be dropped so 1.2).

1.214294254
## [1] 1.214294
round(1.214294254, digits = 2)
## [1] 1.21
options(digits = 2) ## set default for the R session

1.214294254
## [1] 1.2

3 Working with Data

To continue on, we’re going to open a few of the packages that you should have already installed from Week 1.

library(data.table)
## data.table 1.12.8 using 24 threads (see ?getDTthreads).  Latest news: r-datatable.com
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following objects are masked from 'package:data.table':
## 
##     dcast, melt
library(JWileymisc)
library(extraoperators)

Next, we are going to load some sample data. This is simulated data based on a 12-day study where people reported a few measures at baseline along with stress and affect three times per day. After loading it, we convert it to a data.table format, using the as.data.table() function so that we can use many of the useful tools from the data.table package for data management.

data(aces_daily)
d <- as.data.table(aces_daily)

3.1 Operators

In software, logical values and logical operators are often used to refer to things that return a boolean value. Booleans are one of two values, conventionally: TRUE or FALSE. By convention if you do any arithmetic on them, TRUE is treated as 1 and FALSE is treated as 0.

A lot of data management involves using logical operators. Operators take data on the left hand side and a few options arguments on the right hand side. Logical operators return TRUE or FALSE.

We use logical operators when working with data for lots of reasons. You might want to find outliers, values that are greater than or less than a specific score, check if values fall within the expected range, recode continuous variables into categorical variables (low, medium, high, etc.). Much of this work is underpinned by logical operators. Here is a list of some operators from R and expanded by the extraoperators package.

Operator What it does
<- assign whatever is on right to the left in R
:= assign right hand side to the left in data.table
== Are values / vectors equal
!= Are values / vector NOT equal
< OR %l% Less than
<= OR %le% Less than or equal
> OR %g% Greater than
>= OR %ge% Greater than or equal
%gl% Greater than AND less than
%gel% Greater than or equal AND less than
%gle% Greater than AND less than or equal
%gele% Greater than or equal AND less than or equal
%in% In
%!in%OR %nin% Not in
%c% Chain operations on the RHS together
%e% Set operator, to use set notation

3.2 Subsetting Data

A common task in analyses is to subset the data. There are many reasons for this, from ecluding outliers to selecting only participants who meet some criteria you want to use. Sometimes, the order you choose to subset matters too.

Most often, we subset data by using logical operators to pick specific rows of a dataset or specific values from a single variable. Below, we select cases/observations where Age = 18, this is done using the logical operator ==. R tests if the variable “Age” is equal to 18 and if it is returns TRUE and otherwise FALSE. Rows where the test returns TRUE are returned. So we don’t see a giant dataset, we just pick two variables: UserID and Age.

d[Age == 18, .(UserID, Age)]
##      UserID Age
##   1:     20  18
##   2:     20  18
##   3:     20  18
##   4:     20  18
##   5:     20  18
##  ---           
## 343:    163  18
## 344:    163  18
## 345:    163  18
## 346:    163  18
## 347:    163  18

Suppose we decide that under 20y or above 25y are extreme ages for our data. We might want to run analyses in data excluding people outside that range. We could create a copy of the dataset, d.noout, with “outliers” removed where we only include cases that are between 20-50y. Here we use %gele% to capture people greater than or equal to 20y and less than or equal to 25.

d.noout <- d[Age %gele% c(20, 25), .(UserID, Age)]

## frequency table for ages
table(d.noout$Age)
## 
##   20   21   22   23   24   25 
##  889 1278  821  583  396 1036
## view a few rows of the data
head(d.noout)
##    UserID Age
## 1:      1  21
## 2:      1  21
## 3:      1  21
## 4:      1  21
## 5:      1  21
## 6:      1  21

Sometimes, a single variable is not sufficient. You can chain conditions together using the operators: & and | which represent “and” and “or”, respectively.

## select only women who are 18y
d[Age == 18 & Female == 1, .(UserID, Age)]
##      UserID Age
##   1:     20  18
##   2:     20  18
##   3:     20  18
##   4:     20  18
##   5:     20  18
##  ---           
## 191:    163  18
## 192:    163  18
## 193:    163  18
## 194:    163  18
## 195:    163  18
## select anyone who is either a woman OR 18y
d[Age == 18 | Female == 1, .(UserID, Age)]
##       UserID Age
##    1:      2  23
##    2:      2  23
##    3:      2  23
##    4:      2  23
##    5:      2  23
##   ---           
## 4042:    191  21
## 4043:    191  21
## 4044:    191  21
## 4045:    191  21
## 4046:    191  21

If you want to select several values, the %in% operator is helpful. It tests if the variable/data on the left is in the set on the right hand side. This is much faster than writing many “ors”

## select anyone whose age is in 18, 19, or 20
d[Age %in% c(18, 19, 20), .(UserID, Age)]
##       UserID Age
##    1:     11  20
##    2:     11  20
##    3:     11  20
##    4:     11  20
##    5:     11  20
##   ---           
## 2140:    189  19
## 2141:    189  19
## 2142:    189  19
## 2143:    189  19
## 2144:    189  19

You can chain even more logical requirements together by using parentheses. Operations occur within parentheses first.

## 19 year old women or 18 year old men
d[(Age == 19 & Female == 1) | (Age == 18 & Female == 0),
  .(UserID, Age)]
##      UserID Age
##   1:     24  18
##   2:     24  18
##   3:     24  18
##   4:     24  18
##   5:     24  18
##  ---           
## 684:    189  19
## 685:    189  19
## 686:    189  19
## 687:    189  19
## 688:    189  19

Other common operators are greater than or less than, > and < or greater than or equal to and less than or equal to, >= and <=.

## anyone under age 20
d[Age < 20, .(UserID, Age)]
##       UserID Age
##    1:     20  18
##    2:     20  18
##    3:     20  18
##    4:     20  18
##    5:     20  18
##   ---           
## 1251:    189  19
## 1252:    189  19
## 1253:    189  19
## 1254:    189  19
## 1255:    189  19
## anyone age 20 or under
d[Age <= 20, .(UserID, Age)]
##       UserID Age
##    1:     11  20
##    2:     11  20
##    3:     11  20
##    4:     11  20
##    5:     11  20
##   ---           
## 2140:    189  19
## 2141:    189  19
## 2142:    189  19
## 2143:    189  19
## 2144:    189  19
## anyone who is not age 20
d[Age != 20, .(UserID, Age)]
##       UserID Age
##    1:      1  21
##    2:      1  21
##    3:      1  21
##    4:      1  21
##    5:      1  21
##   ---           
## 5486:    191  21
## 5487:    191  21
## 5488:    191  21
## 5489:    191  21
## 5490:    191  21
## anyone who is not age 18, 19, or 20
d[Age %!in% c(18, 19, 20), .(UserID, Age)]
##       UserID Age
##    1:      1  21
##    2:      1  21
##    3:      1  21
##    4:      1  21
##    5:      1  21
##   ---           
## 4451:    191  21
## 4452:    191  21
## 4453:    191  21
## 4454:    191  21
## 4455:    191  21

Finally, let us see why the order of operations may matter. Suppose that we want to look at people who have at least 30 non-missing stress values and look at just observations where STRESS > 3. First, we’ll remove cases that are missing on STRESS. We do this using the is.na() function, which returns TRUE if a value is NA (NA in R means not available, missing) and FALSE if a value is not missing. Then we negate that using ! to get “not missing”.

## remove missing stress observations
d2 <- d[!is.na(STRESS)]

Next, we can create a new variable in the dataset, which we’ll call “Count” and we’ll assign this the number of observations that occur by ID.

In R data.tables have three main parts DT i, j, by the DT represents the name of the data table, the i represents which row(s) we want to select. If we leave it blank, that means select all rows. After the first comma, the j represents columns/variables. That could be selecting only certain columns/variables to display or creating / modifying a column or variable. The last part, the by represents a grouping variable or some way or organising our operations. For example, we might want to perform the same operation for each ID in the dataset. Lastly, if there are NO commas, that means just give me the rows/cases that match my criteria, but give me all columns/variables.

Here we take all cases (we aren’t using any subsetting yet) and then create a new variable Count, we use the assignment operator in data.table, := which means, assign to the thing on the left, in this case a new variable, whatever the value on the right is. Here the value on the right is .N that means the “n” or count / sample size. Finally, we have data.table do this by ID, so that we don’t get the sample size of the entire dataset, we get the sample size, the number of surveys, completed by each ID or person.

d2[, Count := .N, by = UserID]

Now we can take people who have at least 30 non-missing stress values, and look at high stress scores.

d2 <- d2[Count >= 30]

d2 <- d2[STRESS > 3]

d2[, .(UserID, STRESS)] ## IDs and number of obs
##       UserID STRESS
##    1:      1      5
##    2:      1      7
##    3:      1      4
##    4:      1      4
##    5:      1      4
##   ---              
## 1608:    191      5
## 1609:    191      4
## 1610:    191      4
## 1611:    191      4
## 1612:    191      4

This whole sequence could be collapsed or chained together to write less code. In data tables in R you can chain operations by just adding more brackets as below. This accomplishes the same as the above.

d2 <- d[!is.na(STRESS)][, Count := .N, by = UserID][
  Count >= 30][STRESS > 3]

d2[, .(UserID, STRESS)] ## IDs and number of obs
##       UserID STRESS
##    1:      1      5
##    2:      1      7
##    3:      1      4
##    4:      1      4
##    5:      1      4
##   ---              
## 1608:    191      5
## 1609:    191      4
## 1610:    191      4
## 1611:    191      4
## 1612:    191      4

Here, the order is: first, remove missing stress observations, second take only surveys/rows where stress scores > 3, third, count how many observations are not missing by ID. Now look what happens if we first take stress values greater than 3. That is we change the order of the operations. We end up with much fewer people.

d2 <- d[!is.na(STRESS)][STRESS > 3][, 
        Count := .N, by = UserID][Count >= 30]

d2[, .(UserID, STRESS)] ## IDs and number of obs
##      UserID STRESS
##   1:     69      6
##   2:     69      8
##   3:     69      7
##   4:     69      8
##   5:     69     10
##  ---              
## 159:    157      6
## 160:    157      7
## 161:    157      5
## 162:    157      8
## 163:    157      6

In this new format, we are only taking people with 30 non missing stress values > 3 whereas before we took people with 30 non missing stress values (any value) and then only observations where stress > 3.

3.3 You Try It - Subsetting

Select observations where:

Use the template below filling in values where appropriate. You should end up with 2,455 observations.

d3 <- d[!is.na(  )][, Count := .N, by = UserID][   ][   ]

d3[, .(UserID, PosAff)] ## IDs and number of obs

3.4 Finding Specific Cases or Outliers

Suppose that based on graphs or other information, we had decided that negative affect scores above 4 were outliers and that participant 56 was an outlier overall. We can exclude ID 56 and select only observations with negative affect at or below 4 as below.

d[UserID != 56 & NegAff <= 4, 
  .(UserID, NegAff)]
##       UserID NegAff
##    1:      1    1.7
##    2:      1    1.0
##    3:      1    1.4
##    4:      1    1.0
##    5:      1    1.7
##   ---              
## 6285:    191    1.8
## 6286:    191    1.7
## 6287:    191    1.2
## 6288:    191    1.2
## 6289:    191    1.2

4 Data Types

There are many types of data. In R, you can find what type of data a particular variable is by asking for its class().

class(d$Age)
## [1] "numeric"
class(d$BornAUS)
## [1] "integer"
class(names(d))
## [1] "character"

Although these are the basic building blocks of data, there also are special formats that can be built off of these, often off of numeric type data.

5 Date Variables

Often when dates are read into R, they are character data. We will work with some different character formats and convert them to date class in R.

## sample character data
x1 <- c("2019-03-12", "2017-09-15")
x2 <- c("2019-Mar-12", "2017-Sep-15")
x3 <- c("12/3/2019", "15/9/2017")
x4 <- c("12/3/19", "15/9/17")

# these are character strings
class(x1)
## [1] "character"
## convert character strings to Date classes
as.Date(x1, format = "%Y-%m-%d")
## [1] "2019-03-12" "2017-09-15"
as.Date(x2, format = "%Y-%b-%d")
## [1] "2019-03-12" "2017-09-15"
as.Date(x3, format = "%d/%m/%Y")
## [1] "2019-03-12" "2017-09-15"
as.Date(x4, format = "%d/%m/%y")
## [1] "2019-03-12" "2017-09-15"
## save results
d1 <- as.Date(x1, format = "%Y-%m-%d")

## look at the class once converted
class(d1)
## [1] "Date"
## arithmtic on dats
d1 + 5 ## addition
## [1] "2019-03-17" "2017-09-20"
d1 - 7 ## subtract a week
## [1] "2019-03-05" "2017-09-08"
d1 * 5 ## multiplication does not work: error
## Error in Ops.Date(d1, 5): * not defined for "Date" objects
## extract just one date
d1[1]
## [1] "2019-03-12"
d1[2]
## [1] "2017-09-15"
## compare two dates
d1[1] - d1[2]
## Time difference of 543 days
## compare two dates and convert to a number
as.numeric(d1[1] - d1[2])
## [1] 543
## compare two dates and convert to a number
## and then to years
as.numeric(d1[1] - d1[2])/365.25
## [1] 1.5

To convert date into a date class in R, we need to tell it the format the dates are written in. This is done by specifying a “template”.

These pieces can be re-ordered an combined and use any separator. For example “%Y-%m-%d” for YYYY-mm-dd or “%Y/%m/%d” for YYYY/mm/dd or “%Y %m %d” for YYYY mm dd or “%d %m-%Y” for dd mm-YYYY. This flexibility is needed as different people and countries commonly write dates in many different formats.

5.1 You Try It - Dates

Now you try to convert the following character strings into date class objects in R by fillng in the correct template in quotes for the format = argument. If you do it incorrectly, you will get errors or NA indicating the text could not be converted to a date uing the template you supplied.

y1 <- c("5/28/2018", "6/30/2019")
as.Date(y1, format = "") ## you complete

y2 <- c("28 01 2018", "15 06 2019")
dy2 <- as.Date(y2, format = "") ## you complete
dy2 ## show results

y3 <- c("28-1-18", "15-6-19")
as.Date(y3, format = "") ## you complete

## how many days are there between 
## 28 January 2018 and 15 June 2019?
## use the object dy2 to complete like 
## we did above

5.2 Date Helper Functions

Dates have a variety of helpful functions that let you operate on or learn moe about them. We can use the weekdays(), months(), and quarters() functions to find out the day of week, month or quarter a particular date falls in. They return character strings.

weekdays(d1)
## [1] "Tuesday" "Friday"
months(d1)
## [1] "March"     "September"
quarters(d1)
## [1] "Q1" "Q3"

Dates also can be compared to determine if one is greater than or less than another. This can be useful for selecting observations that meet specific criteria. For example, suppose we wanted to exclude any dates that occurred before the semester started. Not everyone in the daily diary study is a student, but if they were, we might expect positive affect to be different on break than during semester. Suppose semester started 1 March 2017, we might want to only include observations that fell on or after that date. Note here we do not need to specify a template because R will assume YYYY-mm-dd when written this way.

## summary of dates
summary(d$SurveyDay)
##         Min.      1st Qu.       Median         Mean      3rd Qu.         Max. 
## "2017-02-22" "2017-03-01" "2017-03-04" "2017-03-04" "2017-03-07" "2017-03-14"
## select surveys where the survey date 
## is greater than or equal to 2017-03-01
d2 <- d[SurveyDay >= as.Date("2017-03-01")]

## summary of dates
summary(d2$SurveyDay)
##         Min.      1st Qu.       Median         Mean      3rd Qu.         Max. 
## "2017-03-01" "2017-03-03" "2017-03-06" "2017-03-06" "2017-03-09" "2017-03-14"
d2 <- d[weekdays(SurveyDay) %in% c("Monday")]

## frequency table of days of week
table(weekdays(d2$SurveyDay))
## 
## Monday 
##   1058

5.3 You Try It - Subset Dates

Now you try to create a dataset that excludes any weekends.

## complete to make a dataset that 
## excludes any weekend days
d3 <- d[ ]

## there should now be no surveys on weekends
table(weekdays(d3$SurveyDay))

6 Factor Variables

Factor variables can be made from numeric or integer or character data, but are converted into the factor format to tell R that they are discrete variables. This is done using the factor() function.

## some sample data
x <- c(0, 1, 2, 1, 1, 2, 0)

## convert to a factor in R
## adding specific labels
y <- factor(x, 
  levels = c(1, 0, 2),
  labels =c("Medication", "Control", "Psychotherapy"))
y
## [1] Control       Medication    Psychotherapy Medication    Medication   
## [6] Psychotherapy Control      
## Levels: Medication Control Psychotherapy
## if you do not specify levels or labels
## it will assume you want the levels 
## ordered numerically / alphabetically
## and the labels should be the same as the levels

## factor with labels, sorted according to 
## the level order (1, 0, 2)
table(y)
## y
##    Medication       Control Psychotherapy 
##             3             2             2
## default factor() approach, labels based on 
## levels and sorted numerically: 0, 1, 2
table(factor(x)) ## automatic conversion to factor 
## 
## 0 1 2 
## 2 3 2

6.1 You Try It - Factors

Now try using factors for dummy coding. In the daily study, there is a varible, SurveyInteger. 1 = morning, 2 = afternoon, 3 = evening. Make it a factor and then create a frequency table using table().

d[, SurveyIntegerF := factor(
  SurveyInteger,
  levels = c(    ),
  labels = c(    ))]

table(d$SurveyIntegerF)

7 Merging Data

Often data come from multiple sources and are in separate files. For example, a baseline and post intervention questionnaire may be collected separately, or you may have questionnaire data and sleep data from a FitBit.

Data can be combined by merging, and merging in R is typically done using the merge() function.

Merging in R is the same as joins which is the more common term for databases, such as SQL.

There are four main types of merges in R using the merge() function. Assume we are merging two datasets: x and y.

Merging or joins always involves two datasets. Even if you need to merge more than two datasets (e.g., you have baseline, post, and follow-up data) you always merge / join them two at a time. For example, first joining baseline and post into one bp dataset and then joining bp and your follow up data. In R the x dataset is always the left and the y dataset is always the right.

The type of merge performed is controlled by arguments to the merge() function.

This small, hypothetical study collected data via surveys and actigraphy. The surveys have peoples ages and the actigraphy has their sleep duration, which we wish to join together for analysis.

We will explore joins using two, small datasets. ID is used to indicate participant ID.

surveys <- data.table(
  ID = c(1, 2, 3),
  Age = c(19, 18, 20))

acti <- data.table(
  ID = c(2, 3, 4),
  Sleep = c(8, 6, 7))

## view the two small datasets
print(surveys)
##    ID Age
## 1:  1  19
## 2:  2  18
## 3:  3  20
print(acti)
##    ID Sleep
## 1:  2     8
## 2:  3     6
## 3:  4     7

7.1 Natural Join in R

Natural joins have only the rows / observations that are present in both datasets. In this example, the surveys and the actigraphy. You can pick whichever dataset you want to be x and and y by which you list first.

Only rows present in both datasets.

When we write by = ID what is meant is that it should be by the variable called ID. The word ID is not special, its just that we happened to call the variable containing our IDs, ID. If we had called the variable containings IDs Email or Name then we would write: by = Email or by = Name. In other words, match the text in quotes to an actual variable name in the data.

Joining data requires specifying how R should know which rows match each other. This is typically done by an ID variable using the code: by = "ID". The argument, all = FALSE is what tells R this should be a natural join.

merge(
  x = surveys,
  y = acti,
  by = "ID",
  all = FALSE)
##    ID Age Sleep
## 1:  2  18     8
## 2:  3  20     6

7.2 Full Outer Join

Full outer joins include cases that are present in either dataset. In this case, in either the surveys or the actigraphy. You can pick whichever dataset you want to be x and y by which you list first, although with a full outer join, it does not make much difference, other than which columns/variables come first and which come second. As before, we must specify what variable(s) to merge by so R knows how to match rows together, accomplished using the code: by = ID. Note that R fills in missing values for rows / observations and variables that are present in one dataset but not the other. So if two datasets do not have identical rows / observations, a full outer join will result in some missing values. These are shown in the print out as NA

All rows present in either dataset.

The argument, all = TRUE is what tells R this should be a full outer join. Full outer joins include all rows or observations.

merge(
  x = surveys,
  y = acti,
  by = "ID",
  all = TRUE)
##    ID Age Sleep
## 1:  1  19    NA
## 2:  2  18     8
## 3:  3  20     6
## 4:  4  NA     7

7.3 Left Outer Join

As before, we must specify what variable(s) to merge by so R knows how to match rows together, accomplished using the code: by = ID. Also as before, R fills in missing values for the variables. In this case, missing values for the rows / observations that are present in surveys, but not in acti (i.e., the variable Sleep and ID = 1).

All rows / observations that are present in left dataset (surveys).

The argument, all.x = TRUE is what tells R this should be a left outer join.

merge(
  x = surveys,
  y = acti,
  by = "ID",
  all.x = TRUE)
##    ID Age Sleep
## 1:  1  19    NA
## 2:  2  18     8
## 3:  3  20     6

7.4 Right Outer Join

As before, we must specify what variable(s) to merge by so R knows how to match rows together, accomplished using the code: by = ID. Note that R fills in missing values for the variables in surveys that are not in acti and the rows / observations that are present in acti, but not in surveys (i.e., the variable Age and ID = 4).

All rows / observations that are present in right dataset (acti). The argument, all.y = TRUE is what tells R this should be a right outer join.

merge(
  x = surveys,
  y = acti,
  by = "ID",
  all.y = TRUE)
##    ID Age Sleep
## 1:  2  18     8
## 2:  3  20     6
## 3:  4  NA     7

7.5 Rows and missing after joining

Knowing how many rows you expect gives you a quick way to check whether the join is working as you want or something unexpected may be happening. Practical experience is that joins can easily have some errors resulting in unintended consequences, like missing data or repetitions of rows that were not wanted.

It is helpful to know how many rows you expect in the results.

This behaviour is sometimes referred to as cross or cartesian joining. When there are multiple matches for a row/ID in both datasets, all possible combinations are returned. In this case, rows 2 and 3 are both ID 2 in surveys2 and rows 1 and 2 are both ID 2 in acti2, so those are crossed so you have all possible pairings of rows from each dataset associated with ID 2.

In each of these two new datasets, ID 2 is repeated twice. One might anticipate that a full outer join would return 5 rows, (IDs 1, 2, 2, 3, 4). R actually returns 7 rows, with four rows for ID 2, because all rows of ID 2 in surveys2 are repeated by all rows of ID 2 in acti2.

Merging with any duplicates can behave in unexpected ways. In a small dataset, it is easy to see what has happened. In a large dataset, checking whether the merge worked as expected requires using summaries or other checks, such as evaluating whether the number of rows matched what you wanted or needed.

surveys2 <- data.table(
  ID = c(1, 2, 2, 3),
  Age = c(19, 18, 18, 20))

acti2 <- data.table(
  ID = c(2, 2, 3, 4),
  Sleep = c(8, 7, 6, 7))

## view small datasets
print(surveys2)
##    ID Age
## 1:  1  19
## 2:  2  18
## 3:  2  18
## 4:  3  20
print(acti2)
##    ID Sleep
## 1:  2     8
## 2:  2     7
## 3:  3     6
## 4:  4     7
merge(surveys2, acti2, by = "ID", all = TRUE)
##    ID Age Sleep
## 1:  1  19    NA
## 2:  2  18     8
## 3:  2  18     7
## 4:  2  18     8
## 5:  2  18     7
## 6:  3  20     6
## 7:  4  NA     7

7.6 You Try It - Merging

Here are those two datasets shown as tables.

ds1 <- data.table(
  ID = c("A1", "B2", "C3"),
  Stress = c(4, 5, 6))

ds2 <- data.table(
  ID = c("A1", "B2", "D4", "E5"),
  Steps = c(9524, 15303, 7432, 4035))

Dataset: ds1

ID Stress
A1 4
B2 5
C3 6

Dataset: ds2

ID Steps
A1 9524
B2 15303
D4 7432
E5 4035

Before you try joining those datasets, try to answer these questions, which are similar to what might be on the exam. The next section has an answer guide. For these questions, ds1 is the “left” dataset and ds2 is the “right” dataset and you are joining by ID.

## using the datasets ds1 and ds2, do a natural join
merge(   )
## Error in as.data.frame(x): argument "x" is missing, with no default
## using the datasets ds1 and ds2, do a full outer join
merge(   )
## Error in as.data.frame(x): argument "x" is missing, with no default
## using the datasets ds1 and ds2, do a left outer join
merge(   )
## Error in as.data.frame(x): argument "x" is missing, with no default
## using the datasets ds1 and ds2, do a right outer join
merge(   )
## Error in as.data.frame(x): argument "x" is missing, with no default

7.6.0.1 Answer Key

These are the sort of questions that would be on the exam to assess understanding of joining / merging.

The key to these is to know what each join does and be able to predict expected behaviour. A natural join will be only those rows IDs present in both datasets, in this case: A1 and B2. Because a natural join only includes rows present in all datasets, you will not have any missing values on any variable, unless there was missing values to begin with. So in this case, you will have 2 rows and 0 missing values.

A full outer join will include all rows from either dataset. That means you will have IDs A1, B2, C3, D4, and E5 (5 rows). Stress values are only present for A1, B2, and C3, so you will have two missing stress values. Steps values are present for A1, B2, D4, and E5, so there will be one missing steps value. Together you will have 5 rows, and 3 missing values.

7.7 Merging Data by Different IDs

Often data are under our control, but sometimes, they are not or even if they are ID variables are not named consistently across datasets. As long as the actual ID values match, the variable names do not need to be the same. For example, in Qualtrics, if you create custom links, IDs may be called “RecipientFirstName” while on the actigraphy data, you may have called IDs, “ID”. R can handle this easily by using the arguments, by.x and by.y to give the variable names in the x and y datasets, respectively.

surveys2 <- data.table(
  FirstName = c(1, 2, 2, 3),
  Age = c(19, 18, 18, 20))

acti2 <- data.table(
  ID = c(2, 2, 3, 4),
  Sleep = c(8, 7, 6, 5))

merge(
  x = surveys2,
  y = acti2,
  by.x = "FirstName",
  by.y = "ID",
  all = TRUE)
##    FirstName Age Sleep
## 1:         1  19    NA
## 2:         2  18     8
## 3:         2  18     7
## 4:         2  18     8
## 5:         2  18     7
## 6:         3  20     6
## 7:         4  NA     5

7.8 Checking and Preparing Data for Merging

Because merging data can have unintended consequences if matched incorrectly, often it is a good practice to check the data in advance to make sure it is what you expect.

Generally, one-to-one merges, where each row of one dataset is matched to one and only one row of another dataset is easiest. You can have duplicates and these merge well in two cases: (1) the left, x dataset has repeated measures but the right, y dataset only has one row of data per ID or (2) the left, x dataset only has one row of data per ID but the right, y, dataset has repeated measures per ID. Many to many merges, involves repeated IDs in both datasets being merged often is not what is intended. Note that repeated IDs does not mean repeated measures. You can merge two repeated measures datasets but not have repeated IDs by using two variables as your ID variable or by creating unique IDs that combine ID and time, for example.

To check data, there are a few helpful functions in R. The anyDuplicated() function returns 0 if no values are duplicated and the position where the first duplicate occurs if any values are duplicated.

## there are duplicate IDs, starting in position 2
anyDuplicated(acti2$ID)
## [1] 2
## there are no duplicated sleep values
anyDuplicated(acti2$Sleep)
## [1] 0
## there are duplicated IDs here too
anyDuplicated(acti$ID)
## [1] 0
## however if we paste / concatenate
## both ID and time, there are no duplicates
anyDuplicated(paste0(acti$ID, acti$Time))
## [1] 0

If you do not expect duplciates, it is a good idea to check for them before merging, as R will not tell you if there are duplicates and it impacted the merge.

Relatedly, if there are duplicates, you can count how many there are, using the duplicated() function and making a frequency table using table().

table(duplicated(acti2$ID))
## 
## FALSE  TRUE 
##     3     1

Another way to approach this is to create a frequency table of IDs and then create a frequency table of the frequencies. If everyone only shows up once, you should have many frequencies of 1. If any ID shows up more than once, you will have non zero frequencies > 1. The frequency table below, shows that two IDs showed up once and one ID showed up two times.

table(table(acti2$ID))
## 
## 1 2 
## 2 1

Other checks are the number of unique IDs in each dataset. This is accomplished by using the unique() function to first exclude any duplicated values and then finding the length of the IDs.

length(unique(surveys2$FirstName))
## [1] 3
length(unique(acti2$ID))
## [1] 3

You can also check how many IDs from one dataset are in another dataset by combining the unique() function with the %in% logical operator to get a series of TRUE or FALSE values if an ID from one dataset is in another dataset. In large datasets, this may be many results, so instead, we can create a frequency table of the results using table(). Note that which set of IDs is on the left or the right can make a difference. If you expect one dataset to contain all IDs (e.g., you are merging a new variable into a large “master” dataset that you believe has every single one of your participants) this is a very helpful way to check whether that is true.

table(unique(surveys2$FirstName) %in% unique(acti$ID))
## 
## FALSE  TRUE 
##     1     2
table(unique(acti$ID) %in% unique(surveys2$FirstName))
## 
## FALSE  TRUE 
##     1     2

7.9 Checking Data You Try It

Using our daily data:

## you try to write the checking code here!

8 Reshaping Data

Another common data management task is reshaping data. This typically occurs when data are stored in one format, such as different columns for different time points, and you need data reshaped into another format, such as multiple rows for each ID for mixed models. What follows is a small sample of a “wide” dataset. That is, a dataset where each timepoint is stored in a separate variable.

dwide <- data.table(
  ID = c(1, 2, 3),
  STRESS1 = c(2, 1, 3),
  STRESS2 = c(4, 5, 6),
  STRESS3 = c(1, 1, 2),
  Happy1 = c(4, 5, 3),
  Happy2 = c(3, 2, 1),
  Happy3 = c(5, 5, 4))

Example ‘wide’ dataset where different time points are stored in separate variables.

ID STRESS1 STRESS2 STRESS3 Happy1 Happy2 Happy3
1 2 4 1 4 3 5
2 1 5 1 5 2 5
3 3 6 2 3 1 4

R has a function, reshape() that allows reshaping data from wide to long. It is quite complicated, so may not all be clear at first, but its easiest to learn through examples. For now, we will focus on five arguments to reshape().

reshape(
  data = dwide,
  varying = list(
    STRESS = c("STRESS1", "STRESS2", "STRESS3")
  ),
  timevar = "time",
  idvar = "ID",
  direction = "long")
##    ID Happy1 Happy2 Happy3 time STRESS1
## 1:  1      4      3      5    1       2
## 2:  2      5      2      5    1       1
## 3:  3      3      1      4    1       3
## 4:  1      4      3      5    2       4
## 5:  2      5      2      5    2       5
## 6:  3      3      1      4    2       6
## 7:  1      4      3      5    3       1
## 8:  2      5      2      5    3       1
## 9:  3      3      1      4    3       2

In this case, you can see that only STRESS was actually reshaped long. To get Happy reshaped long as well, we need to list all the variable names for happy in order in another vector using c() inside the varying = list() section, like this:

reshape(
  data = dwide,
  varying = list(
    STRESS = c("STRESS1", "STRESS2", "STRESS3"),
## extra line for happy
    Happy = c("Happy1", "Happy2", "Happy3") 
  ),
  timevar = "time",
  idvar = "ID",
  direction = "long")
##    ID time STRESS1 Happy1
## 1:  1    1       2      4
## 2:  2    1       1      5
## 3:  3    1       3      3
## 4:  1    2       4      3
## 5:  2    2       5      2
## 6:  3    2       6      1
## 7:  1    3       1      5
## 8:  2    3       1      5
## 9:  3    3       2      4

Right now, R is just using the first variable name for each variable in the long format, so we have “STRESS1” and “Happy1”. If we want, we can customise this, for example to get rid of the “1”.

reshape(
  data = dwide,
  varying = list(
    STRESS = c("STRESS1", "STRESS2", "STRESS3"),
    Happy = c("Happy1", "Happy2", "Happy3") 
  ),
## extra line to make it just STRESS and Happy without the "1"
  v.names = c("STRESS", "Happy"),
  timevar = "time",
  idvar = "ID",
  direction = "long")
##    ID time STRESS Happy
## 1:  1    1      2     4
## 2:  2    1      1     5
## 3:  3    1      3     3
## 4:  1    2      4     3
## 5:  2    2      5     2
## 6:  3    2      6     1
## 7:  1    3      1     5
## 8:  2    3      1     5
## 9:  3    3      2     4

Finally, by default, R just makes the new “time” variable integers from 1 to the maximum number of time points. Instead, we can set the exact times. For example, suppose that the first assessment is week 0, then week 6 and week 24. We might want to use 0, 6, 24 instead of the default 1, 2, 3.

dlong <- reshape(
  data = dwide,
  varying = list(
    STRESS = c("STRESS1", "STRESS2", "STRESS3"),
    Happy = c("Happy1", "Happy2", "Happy3") 
  ),
  v.names = c("STRESS", "Happy"),
  timevar = "weeks",
## extra line for customized times
  times = c(0, 6, 24),
  idvar = "ID",
  direction = "long")

print(dlong)
##    ID weeks STRESS Happy
## 1:  1     0      2     4
## 2:  2     0      1     5
## 3:  3     0      3     3
## 4:  1     6      4     3
## 5:  2     6      5     2
## 6:  3     6      6     1
## 7:  1    24      1     5
## 8:  2    24      1     5
## 9:  3    24      2     4

8.1 Complications in Reshaping

Not all reshaping is straight forward. Here is another wide dataset. Some variables are time varying, others (e.g., Age) are not. Even the time varying variables are not measured at all time points. In this case at the final time point, only Happy was measured and coping only measured twice.

dwide2 <- data.table(
  ID = 1:3,
  Age = c(18, 19, 22),
  STRESS1 = c(2, 1, 3),
  STRESS2 = c(4, 5, 6),
  STRESS3 = c(1, 1, 2),  
  COPE1 = c(1, 2, 3),
  COPE2 = c(6, 3, 4),
  Happy1 = c(4, 5, 3),
  Happy2 = c(3, 2, 1),
  Happy3 = c(5, 5, 4),
  Happy4 = c(1, 2, 3))

Following the previous template, we might try something like this:

reshape(
  data = dwide2,
  varying = list(
    STRESS = c("STRESS1", "STRESS2", "STRESS3"),
    COPE = c("COPE1", "COPE2"),    
    Happy = c("Happy1", "Happy2", "Happy3", "Happy4") 
  ),
  v.names = c("STRESS", "COPE", "Happy"),
  timevar = "weeks",
  times = c(0, 6, 12, 24),
  idvar = "ID",
  direction = "long")
## Error in reshapeLong(data, idvar = idvar, timevar = timevar, varying = varying, : 'varying' arguments must be the same length

This results in an error because we only have two stress and coping measures. To do the reshape, there must be an equal number of each variable. One way to address this is to create a useless “Pad” variable. We can repeat this “Pad” as much as needed (e.g., once for stress, twice for coping). Age is not time varying so we can ignore it. It will be used as a between person variable.

dwide2[, Pad := NA]

reshape(
  data = dwide2,
  varying = list(
    STRESS = c("STRESS1", "STRESS2", "STRESS3", "Pad"),
    COPE = c("COPE1", "COPE2", "Pad", "Pad"),    
    Happy = c("Happy1", "Happy2", "Happy3", "Happy4") 
  ),
  v.names = c("STRESS", "COPE", "Happy"),
  timevar = "weeks",
  times = c(0, 6, 12, 24),
  idvar = "ID",
  direction = "long")
##     ID Age weeks STRESS COPE Happy
##  1:  1  18     0      2    1     4
##  2:  2  19     0      1    2     5
##  3:  3  22     0      3    3     3
##  4:  1  18     6      4    6     3
##  5:  2  19     6      5    3     2
##  6:  3  22     6      6    4     1
##  7:  1  18    12      1   NA     5
##  8:  2  19    12      1   NA     5
##  9:  3  22    12      2   NA     4
## 10:  1  18    24     NA   NA     1
## 11:  2  19    24     NA   NA     2
## 12:  3  22    24     NA   NA     3

8.2 You Try It Reshape Long

Using this dataset, you try reshaping a wide dataset into a long format.

dwtry <- data.table(
  Name = c("Jane", "John"), 
  SLEEP1 = c(2, 1),
  SLEEP2 = c(4, 5),
  Dep1 = c(4, 5),
  Dep2 = c(3, 2),
  Dep3 = c(5, 5))

Wide dataset for you to try reshaping.

Name SLEEP1 SLEEP2 Dep1 Dep2 Dep3
Jane 2 4 4 3 5
John 1 5 5 2 5
dltry <- reshape(   )
## Error in c("reshapeLong", "reshapeWide") %in% names(attributes(data)): argument "data" is missing, with no default
print(dltry)
## Error in print(dltry): object 'dltry' not found

8.3 Reshape Data Wide

We can also reshape data from long to wide. To do that, we will begin with the long dataset we made earlier, dlong. Again we use the reshape() function. This time, we do not need to indicate varying and we set direction = "wide".

reshape(
  data = dlong,
  v.names = c("STRESS", "Happy"),
  timevar = "weeks",
  idvar = "ID",
  direction = "wide")
##    ID STRESS.0 Happy.0 STRESS.6 Happy.6 STRESS.24 Happy.24
## 1:  1        2       4        4       3         1        5
## 2:  2        1       5        5       2         1        5
## 3:  3        3       3        6       1         2        4

To create the different variables for each time point, R has used the base variable name, a period, and then the time value (0, 6, 24). We can customize the separator so that instead of a period, there is nothing or an underscore, or whatever we wanted.

reshape(
  data = dlong,
  v.names = c("STRESS", "Happy"),
  timevar = "weeks",
  idvar = "ID",
  direction = "wide",
  sep = "_T")
##    ID STRESS_T0 Happy_T0 STRESS_T6 Happy_T6 STRESS_T24 Happy_T24
## 1:  1         2        4         4        3          1         5
## 2:  2         1        5         5        2          1         5
## 3:  3         3        3         6        1          2         4

8.4 You Try It Reshape Wide

Using the long dataset from the previous You Try It, reshape dltry wide.

reshape(   )
## Error in c("reshapeLong", "reshapeWide") %in% names(attributes(data)): argument "data" is missing, with no default

9 Bonus Content - Merging by Multiple IDs

Sometimes a single variable is not enough to merge two datasets. This is particularly true in longitudinal studies where there are repeated measures. R can handle this by using multiple variables as IDs for merging. Alternately, a second, unique ID can be created by combining two or more pieces of information. The following shows an example.

surveys <- data.table(
  ID = c(1, 1, 2, 2),
  Time = c(1, 2, 1, 2),
  Stress = c(6, 4, 3, 4))

acti <- data.table(
  ID = c(1, 1, 1, 3, 3),
  Time = c(1, 2, 3, 1, 2),
  Sleep = c(7, 8, 9, 8, 8))

## incorrect
merge(surveys, acti,
      by = "ID",
      all = TRUE)
##     ID Time.x Stress Time.y Sleep
##  1:  1      1      6      1     7
##  2:  1      1      6      2     8
##  3:  1      1      6      3     9
##  4:  1      2      4      1     7
##  5:  1      2      4      2     8
##  6:  1      2      4      3     9
##  7:  2      1      3     NA    NA
##  8:  2      2      4     NA    NA
##  9:  3     NA     NA      1     8
## 10:  3     NA     NA      2     8
## correct
merge(surveys, acti,
      by = c("ID", "Time"),
      all = TRUE)
##    ID Time Stress Sleep
## 1:  1    1      6     7
## 2:  1    2      4     8
## 3:  1    3     NA     9
## 4:  2    1      3    NA
## 5:  2    2      4    NA
## 6:  3    1     NA     8
## 7:  3    2     NA     8

9.1 Merging by Multiple IDs - You Try It

Let’s say that one participant had technical problems with the surveys from the daily data we have worked with and so filled out the last few on paper and pencil. Now we want to merge this data into our master daily dataset, d, using a full outer join. The resulting dataset, d2 should have 6,602 rows.

pencil <- data.table(
  UserID = 1L,
  SurveyDay = as.Date("2017-3-08"),
  Survey = c("Morning", "Afternoon", "Evening"),
  STRESS = c(1, 3, 1))

## there is no survey integer in pencil, so we need to make that
## to merge with the d dataset.
pencil[Survey == "Morning", SurveyInteger := 1L]
pencil[Survey == "Afternoon", SurveyInteger := 2L]
pencil[Survey == "Evening", SurveyInteger := 3L]

d2 <- merge(   )
## Error in as.data.frame(x): argument "x" is missing, with no default