Reading Excel xls/xlsx Files in R

Many useful data are stored in Excel xls or xlsx format files. R is good at doing analysis on top of data. We would then need to import the data to R by reading the xls/xlsx files to be R data frames. In this post, we check how to import data from Excel files to R programs.

The package we will use for reading Excel files is readxl.

Install readxl package in R

If the R environment used does not have the readxl package yet, we need to install the package first by running the following R statements in the R environment, such as in an RStudio.

install.packages("readxl")

Reading Excel files

After the readxl package is installed, we can load it in our R program/session.

library("readxl")

Then we can read Excel xls/xlsx files using the “readxl” package’s read_excel function in R.

Import xls file

To read an xls file /path/to/data.xls, use

xlsdata <- read_excel("/path/to/data.xls")

Import xlsx file

To read an xlsx file /path/to/data.xlsx, use

xlsdata <- read_excel("/path/to/data.xlsx")

Import an xls/xlsx file chosen from GUI dialog

We can interactively choose the file to be read and imported to R by using the file.choose() function.

mydata <- read_excel(file.choose())

A file chosen dialog (using RStudio) will open when this statement is executed to choose an Excel file as input.

Import a specific sheet from an Excel file

An Excel file may have multiple sheets. The read_excel function also has options to specify which sheet to read.

We can choose the sheet by its name (e.g. with a sheet named “SheetName”):

mydata <- read_excel(file.choose(), sheet = "SheetName")

Or choose the sheet by its index

mydata <- read_excel(file.choose(), sheet = 2)

Reference: read_excel() usage

For a full reference, the usage of the read_excel() function usage and the possible options are as follows.

Usage

read_excel(path, sheet = NULL, range = NULL, col_names = TRUE,
  col_types = NULL, na = "", trim_ws = TRUE, skip = 0,
  n_max = Inf, guess_max = min(1000, n_max),
  progress = readxl_progress(), .name_repair = "unique")

Arguments

path
Path to the xls/xlsx file.

sheet
Sheet to read. Either a string (the name of a sheet), or an integer (the position of the sheet). Ignored if the sheet is specified via range. If neither argument specifies the sheet, defaults to the first sheet.

range
A cell range to read from, as described in cell-specification. Includes typical Excel ranges like “B3:D87”, possibly including the sheet name like “Budget!B2:G14”, and more. Interpreted strictly, even if the range forces the inclusion of leading or trailing empty rows or columns. Takes precedence over skip, n_max and sheet.

col_names
TRUE to use the first row as column names, FALSE to get default names, or a character vector giving a name for each column. If user provides col_types as a vector, col_names can have one entry per column, i.e. have the same length as col_types, or one entry per unskipped column.

col_types
Either NULL to guess all from the spreadsheet or a character vector containing one entry per column from these options: “skip”, “guess”, “logical”, “numeric”, “date”, “text” or “list”. If exactly one col_type is specified, it will be recycled. The content of a cell in a skipped column is never read and that column will not appear in the data frame output. A list cell loads a column as a list of length 1 vectors, which are typed using the type guessing logic from col_types = NULL, but on a cell-by-cell basis.

na
Character vector of strings to interpret as missing values. By default, readxl treats blank cells as missing data.

trim_ws
Should leading and trailing whitespace be trimmed?

skip
Minimum number of rows to skip before reading anything, be it column names or data. Leading empty rows are automatically skipped, so this is a lower bound. Ignored if range is given.

n_max
Maximum number of data rows to read. Trailing empty rows are automatically skipped, so this is an upper bound on the number of rows in the returned tibble. Ignored if range is given.

guess_max
Maximum number of data rows to use for guessing column types.

progress
Display a progress spinner? By default, the spinner appears only in an interactive session, outside the context of knitting a document, and when the call is likely to run for several seconds or more. See readxl_progress() for more details.

.name_repair
Handling of column names. By default, readxl ensures column names are not empty and are unique. If the tibble package version is recent enough, there is full support for .name_repair as documented in tibble::tibble(). If an older version of tibble is present, readxl falls back to name repair in the style of tibble v1.4.2.

Leave a Reply

Your email address will not be published. Required fields are marked *