Many useful data are stored in Excel xls or xlsx format files. R is good at doing analysis on top of data. We would then need to import the data to R by reading the xls/xlsx files to be R data frames. In this post, we check how to import data from Excel files to R programs.
The package we will use for reading Excel files is
readxl package in R
If the R environment used does not have the
readxl package yet, we need to install the package first by running the following R statements in the R environment, such as in an RStudio.
Reading Excel files
readxl package is installed, we can load it in our R program/session.
Then we can read Excel xls/xlsx files using the “readxl” package’s
read_excel function in R.
Import xls file
To read an xls file
xlsdata <- read_excel("/path/to/data.xls")
Import xlsx file
To read an xlsx file
xlsdata <- read_excel("/path/to/data.xlsx")
Import an xls/xlsx file chosen from GUI dialog
We can interactively choose the file to be read and imported to R by using the
mydata <- read_excel(file.choose())
A file chosen dialog (using RStudio) will open when this statement is executed to choose an Excel file as input.
Import a specific sheet from an Excel file
An Excel file may have multiple sheets. The
read_excel function also has options to specify which sheet to read.
We can choose the sheet by its name (e.g. with a sheet named “SheetName”):
mydata <- read_excel(file.choose(), sheet = "SheetName")
Or choose the sheet by its index
mydata <- read_excel(file.choose(), sheet = 2)
For a full reference, the usage of the
read_excel() function usage and the possible options are as follows.
read_excel(path, sheet = NULL, range = NULL, col_names = TRUE, col_types = NULL, na = "", trim_ws = TRUE, skip = 0, n_max = Inf, guess_max = min(1000, n_max), progress = readxl_progress(), .name_repair = "unique")
Path to the xls/xlsx file.
Sheet to read. Either a string (the name of a sheet), or an integer (the position of the sheet). Ignored if the sheet is specified via range. If neither argument specifies the sheet, defaults to the first sheet.
A cell range to read from, as described in cell-specification. Includes typical Excel ranges like “B3:D87”, possibly including the sheet name like “Budget!B2:G14”, and more. Interpreted strictly, even if the range forces the inclusion of leading or trailing empty rows or columns. Takes precedence over skip, n_max and sheet.
TRUE to use the first row as column names, FALSE to get default names, or a character vector giving a name for each column. If user provides col_types as a vector, col_names can have one entry per column, i.e. have the same length as col_types, or one entry per unskipped column.
Either NULL to guess all from the spreadsheet or a character vector containing one entry per column from these options: “skip”, “guess”, “logical”, “numeric”, “date”, “text” or “list”. If exactly one col_type is specified, it will be recycled. The content of a cell in a skipped column is never read and that column will not appear in the data frame output. A list cell loads a column as a list of length 1 vectors, which are typed using the type guessing logic from col_types = NULL, but on a cell-by-cell basis.
Character vector of strings to interpret as missing values. By default, readxl treats blank cells as missing data.
Should leading and trailing whitespace be trimmed?
Minimum number of rows to skip before reading anything, be it column names or data. Leading empty rows are automatically skipped, so this is a lower bound. Ignored if range is given.
Maximum number of data rows to read. Trailing empty rows are automatically skipped, so this is an upper bound on the number of rows in the returned tibble. Ignored if range is given.
Maximum number of data rows to use for guessing column types.
Display a progress spinner? By default, the spinner appears only in an interactive session, outside the context of knitting a document, and when the call is likely to run for several seconds or more. See readxl_progress() for more details.
Handling of column names. By default, readxl ensures column names are not empty and are unique. If the tibble package version is recent enough, there is full support for .name_repair as documented in tibble::tibble(). If an older version of tibble is present, readxl falls back to name repair in the style of tibble v1.4.2.