Descriptive Statistics - Sangy Academy

Descriptive statistics summarize and describe the main features of a dataset. They provide insights into the distribution, central tendency, variability, and relationships between variables. In this tutorial, we will use R to compute summary statistics and create visualizations to explore the data.

2. Setting Up R

Ensure R is installed on your machine. You can download it from CRAN or use RStudio for a user-friendly interface.

3. Loading Necessary Libraries

We’ll be using several libraries to assist with data manipulation and visualization:

# Install necessary libraries (only run once)
# install.packages("dplyr")
# install.packages("ggplot2")

# Load libraries
library(dplyr)   # For data manipulation
library(ggplot2) # For data visualization

4. Importing Data

To work with data, we need to load it into R. For this tutorial, we’ll assume you have a CSV file, but the process is similar for other formats like Excel, JSON, etc.

# Import CSV data
data <- read.csv("your_data.csv")

5. Summary Statistics

Summary statistics help us understand key features of the dataset. Let’s break it down into several important measures:

a) Mean (Average)

The mean is the average value of a dataset.

# Calculate mean
mean_value <- mean(data$your_variable)
print(mean_value)

b) Median (Middle Value)

The median is the middle value when the data is sorted in ascending order.

# Calculate median
median_value <- median(data$your_variable)
print(median_value)

c) Standard Deviation (Spread of Data)

The standard deviation measures how much the values in the dataset vary from the mean.

# Calculate standard deviation
sd_value <- sd(data$your_variable)
print(sd_value)

d) Minimum and Maximum

Minimum and maximum values show the smallest and largest values in the dataset.

# Minimum and Maximum
min_value <- min(data$your_variable)
max_value <- max(data$your_variable)
print(c(min_value, max_value))

e) Summary Statistics

The summary() function provides an overview of all important statistics at once.

# Summary statistics
summary(data$your_variable)

6. Visual Exploration

Visualization helps us visualize patterns and relationships within the data. Below are some common visualizations:

a) Histograms

A histogram shows the distribution of a single variable.

ggplot(data, aes(x = your_variable)) +
  geom_histogram(binwidth = 1, fill = "blue", color = "black") +
  theme_minimal() +
  labs(title = "Histogram of Your Variable", x = "Your Variable", y = "Frequency")

b) Boxplots

Boxplots visualize the spread and summary statistics such as median, interquartile range (IQR), and outliers.

ggplot(data, aes(y = your_variable)) +
  geom_boxplot(fill = "purple") +
  theme_minimal() +
  labs(title = "Boxplot of Your Variable", y = "Your Variable")

c) Scatter Plots

Scatter plots are useful for understanding relationships between two variables.

ggplot(data, aes(x = your_variable1, y = your_variable2)) +
  geom_point() +
  theme_minimal() +
  labs(title = "Scatter Plot", x = "Your Variable 1", y = "Your Variable 2")

7. Additional Summary Statistics

Descriptive statistics can be broken down further into categories such as:

Quantiles (e.g., quartiles, percentiles)
Variance – which shows the spread of data relative to the mean.
Interquartile Range (IQR) – the range between the first and third quartiles.

Here’s how to calculate quantiles and IQR:

# Quantiles (e.g., quartiles)
quantiles <- quantile(data$your_variable, probs = c(0.25, 0.5, 0.75))
print(quantiles)

# IQR
iqr_value <- IQR(data$your_variable)
print(iqr_value)

< Visualization

Regression >