When working with data in R, it’s essential to have it organized in a clear and consistent format. This makes analysis easier and more reliable. The tidyr
package in R helps us achieve this by providing simple functions to tidy up our data.
What is Tidy Data?
Tidy data means that:
- Each variable has its own column. For example, if you’re recording information about students, variables could be “Name,” “Age,” and “Grade,” each in separate columns.
- Each observation has its own row. Each student would have their own row with their respective information.
- Each value has its own cell. The intersection of a row and column should contain a single value, like a student’s age or grade.
Installing and Loading tidyr
Before using tidyr
, you need to install and load it into your R environment:
install.packages("tidyr") # Install tidyr
library(tidyr) # Load tidyr
Common Functions in tidyr
Here are some basic functions in tidyr
that help in tidying data:
1. pivot_longer()
This function transforms data from a wide format to a long format. It’s useful when you have multiple columns that represent similar information.
Example:
Suppose you have a dataset of students’ scores in different subjects:
Name | Math | Science | English |
---|---|---|---|
Alice | 85 | 90 | 88 |
Bob | 78 | 82 | 85 |
To tidy this data:
library(tidyr)
# Original data
students <- data.frame(
Name = c("Alice", "Bob"),
Math = c(85, 78),
Science = c(90, 82),
English = c(88, 85)
)
# Use pivot_longer to tidy the data
tidy_students <- pivot_longer(students, cols = Math:English, names_to = "Subject", values_to = "Score")
print(tidy_students)
The result will be:
Name | Subject | Score |
---|---|---|
Alice | Math | 85 |
Alice | Science | 90 |
Alice | English | 88 |
Bob | Math | 78 |
Bob | Science | 82 |
Bob | English | 85 |
Now, each row represents a single observation of a student’s score in a subject.
2. pivot_wider()
This function does the opposite of pivot_longer()
. It transforms data from a long format back to a wide format.
Example:
Using the tidy_students
data from above:
# Use pivot_wider to spread the data
wide_students <- pivot_wider(tidy_students, names_from = Subject, values_from = Score)
print(wide_students)
The result will be:
Name | Math | Science | English |
---|---|---|---|
Alice | 85 | 90 | 88 |
Bob | 78 | 82 | 85 |
This returns the data to its original wide format.
3. separate()
This function splits a single column into multiple columns based on a separator.
Example:
Suppose you have a dataset with full names:
FullName |
---|
Alice Johnson |
Bob Smith |
To separate the full names into first and last names:
# Original data
names <- data.frame(
FullName = c("Alice Johnson", "Bob Smith")
)
# Use separate to split the FullName column
separated_names <- separate(names, col = FullName, into = c("FirstName", "LastName"), sep = " ")
print(separated_names)
The result will be:
FirstName | LastName |
---|---|
Alice | Johnson |
Bob | Smith |
Now, the full names are split into two separate columns.
4. unite()
This function combines multiple columns into a single column.
Example:
Using the separated_names
data from above:
# Use unite to combine FirstName and LastName
united_names <- unite(separated_names, col = FullName, FirstName, LastName, sep = " ")
print(united_names)
The result will be:
FullName |
---|
Alice Johnson |
Bob Smith |
This combines the first and last names back into a single column.