Reading Excel Files in R
Excel files remain a common data format you’ll encounter. The readxl package handles both .xls and .xlsx formats efficiently without external dependencies, making it the standard choice for Excel import in R.
Installation
Install the package once per R environment:
install.packages("readxl")
Load it in your session or script:
library(readxl)
Basic Import
The read_excel() function handles both xls and xlsx formats automatically:
data <- read_excel("path/to/file.xlsx")
The function returns a tibble by default. To convert to a data frame if needed:
data <- as.data.frame(read_excel("path/to/file.xlsx"))
Working with Multiple Sheets
Excel workbooks often contain multiple sheets. Specify which sheet to read by name or position:
# By sheet name
budget_data <- read_excel("financials.xlsx", sheet = "Budget")
# By position (1-indexed)
first_sheet <- read_excel("financials.xlsx", sheet = 1)
second_sheet <- read_excel("financials.xlsx", sheet = 2)
# List all sheet names
excel_sheets("financials.xlsx")
Column Type Control
By default, read_excel() guesses column types. You can specify them explicitly to avoid type conversion issues:
data <- read_excel(
"sales.xlsx",
col_types = c("text", "numeric", "date", "logical", "skip")
)
Available types: “skip”, “guess”, “logical”, “numeric”, “date”, “text”, “list”.
The “skip” type excludes a column from the import entirely, useful for ignoring unwanted columns.
Handling Ranges and Headers
Read a specific cell range:
# Read range B3:D10
data <- read_excel("report.xlsx", range = "B3:D10")
# Include sheet name in range
data <- read_excel("workbook.xlsx", range = "Budget!B2:G14")
Control column naming:
# Use first row as column names (default)
data <- read_excel("file.xlsx", col_names = TRUE)
# No column names, use X1, X2, X3 etc.
data <- read_excel("file.xlsx", col_names = FALSE)
# Provide custom column names
data <- read_excel("file.xlsx", col_names = c("ID", "Name", "Score"))
Skipping Rows and Limiting Data
Skip header rows before actual data:
# Skip first 5 rows before reading
data <- read_excel("file.xlsx", skip = 5)
Limit the number of rows read:
# Read only first 1000 data rows
data <- read_excel("file.xlsx", n_max = 1000)
Missing Values
Specify how missing values are represented in the file:
# Treat "NA", "N/A", and empty cells as missing
data <- read_excel("file.xlsx", na = c("", "NA", "N/A"))
Whitespace Handling
Control leading/trailing whitespace:
# Strip whitespace (default)
data <- read_excel("file.xlsx", trim_ws = TRUE)
# Keep whitespace
data <- read_excel("file.xlsx", trim_ws = FALSE)
Column Name Repair
Handle problematic column names automatically:
# Default: ensures unique, non-empty names
data <- read_excel("file.xlsx", .name_repair = "unique")
# Alternative strategies
data <- read_excel("file.xlsx", .name_repair = "minimal")
Interactive File Selection
Choose a file via GUI dialog:
data <- read_excel(file.choose())
RStudio will open a file browser for selection.
Complete Example
A practical workflow combining multiple features:
library(readxl)
# Check available sheets
sheets <- excel_sheets("quarterly_report.xlsx")
print(sheets)
# Read specific sheet with type control and row limit
sales <- read_excel(
"quarterly_report.xlsx",
sheet = "Sales",
col_types = c("text", "date", "numeric", "numeric", "skip"),
skip = 2,
n_max = 500,
na = c("", "N/A", "-")
)
# View first few rows
head(sales)
# Check structure
str(sales)
Performance Considerations
For large files (100K+ rows), read_excel() performs adequately but consider:
- Use
col_typesexplicitly to avoid type guessing overhead - Use
n_maxto read in chunks if processing the entire file isn’t necessary - Use
rangeto import only needed cell ranges - For repeated large imports, consider exporting data to CSV or Parquet for faster loading
Alternatives
If you need more advanced Excel manipulation (writing, modifying sheets), use openxlsx:
library(openxlsx)
write.xlsx(data, "output.xlsx")
For reading from remote URLs or complex scenarios, readxlsx and other packages exist, but readxl covers the vast majority of import use cases efficiently.
