Descriptive statistics summarize and describe the main features of a dataset. They provide insights into the distribution, central tendency, variability, and relationships between variables. In this tutorial, we will use R to compute summary statistics and create visualizations to explore the data.
2. Setting Up R
Ensure R is installed on your machine. You can download it from CRAN or use RStudio for a user-friendly interface.
3. Loading Necessary Libraries
We’ll be using several libraries to assist with data manipulation and visualization:
# Install necessary libraries (only run once)
# install.packages("dplyr")
# install.packages("ggplot2")
# Load libraries
library(dplyr) # For data manipulation
library(ggplot2) # For data visualization
4. Importing Data
To work with data, we need to load it into R. For this tutorial, we’ll assume you have a CSV file, but the process is similar for other formats like Excel, JSON, etc.
# Import CSV data
data <- read.csv("your_data.csv")
5. Summary Statistics
Summary statistics help us understand key features of the dataset. Let’s break it down into several important measures:
a) Mean (Average)
The mean is the average value of a dataset.
# Calculate mean
mean_value <- mean(data$your_variable)
print(mean_value)
b) Median (Middle Value)
The median is the middle value when the data is sorted in ascending order.
# Calculate median
median_value <- median(data$your_variable)
print(median_value)
c) Standard Deviation (Spread of Data)
The standard deviation measures how much the values in the dataset vary from the mean.
# Calculate standard deviation
sd_value <- sd(data$your_variable)
print(sd_value)
d) Minimum and Maximum
Minimum and maximum values show the smallest and largest values in the dataset.
# Minimum and Maximum
min_value <- min(data$your_variable)
max_value <- max(data$your_variable)
print(c(min_value, max_value))
e) Summary Statistics
The summary()
function provides an overview of all important statistics at once.
# Summary statistics
summary(data$your_variable)
6. Visual Exploration
Visualization helps us visualize patterns and relationships within the data. Below are some common visualizations:
a) Histograms
A histogram shows the distribution of a single variable.
ggplot(data, aes(x = your_variable)) +
geom_histogram(binwidth = 1, fill = "blue", color = "black") +
theme_minimal() +
labs(title = "Histogram of Your Variable", x = "Your Variable", y = "Frequency")
b) Boxplots
Boxplots visualize the spread and summary statistics such as median, interquartile range (IQR), and outliers.
ggplot(data, aes(y = your_variable)) +
geom_boxplot(fill = "purple") +
theme_minimal() +
labs(title = "Boxplot of Your Variable", y = "Your Variable")
c) Scatter Plots
Scatter plots are useful for understanding relationships between two variables.
ggplot(data, aes(x = your_variable1, y = your_variable2)) +
geom_point() +
theme_minimal() +
labs(title = "Scatter Plot", x = "Your Variable 1", y = "Your Variable 2")
7. Additional Summary Statistics
Descriptive statistics can be broken down further into categories such as:
- Quantiles (e.g., quartiles, percentiles)
- Variance – which shows the spread of data relative to the mean.
- Interquartile Range (IQR) – the range between the first and third quartiles.
Here’s how to calculate quantiles and IQR:
# Quantiles (e.g., quartiles)
quantiles <- quantile(data$your_variable, probs = c(0.25, 0.5, 0.75))
print(quantiles)
# IQR
iqr_value <- IQR(data$your_variable)
print(iqr_value)