When was the last time you used stats::reshape? Probably never, in the presence of {tidyr}’s pivot_* and {data.table}’s melt/dcast reshaping (aka pivoting) functions.
A while ago, while perusing through the {data.table} vignettes, I learned about the reshape function. The {data.table} core team described it as “It is an extremely useful and often underrated function.”. As a member of the “dependency-frets” (as reshape is part of base R), I’ve been using it fairly since. However, it has some annoyances that might lead you to think that the function isn’t yet quite “fully ripe”, which could explain its “underratting”.
First, it’s very slow; no further comment is needed, but that could be a deal breaker for some.1
Args of the reshape function are also a bit cryptic. I have to hit ?reshape time and again because it is not easy to remember what is what. What (or not) to pass to varying (a vector or a list (or a matrix)) and how to order its values? Like varying=c("gdp_2020", "gdp_2021", "pop_2020", "pop_2021"), or c("gdp_2020", "pop_2020", "gdp_2021", "pop_2021")? What does v.names (and frankly most args) do in reshape to wide vs. long? Would not it be helpful if reshape was broken down into two separate functions (reshape_{long,wide})?2 Also it is hard to tell which args are optional and which get ignored silently even when you explicitly provide them.3
Futhermore, you have to “nicely” format your column names to unlock reshape to long’s potential: that is use a sep (defaults to .) between the time varying variables and the timevar (like gdp.2020, pop.2020).
Warning
Do you really think about renaming your columns when reshaping? Don’t! 🙃
Even if you “nicely” format your column names say with sep="_", your’re doomed if you have “_” somewhere in the column names as you can’t use perl-compatible regex to force the location of the sep (e.g., you have gdp_ppp_2020 and you’re not allowed to use sep="_(?=\\d{4})" or split(regexp = "_(?=\\d{4})", include = FALSE, fixed = FALSE)). If you look at the source, this could have been done as split and regexpr (where sep and split are used) can handle perl-compatible regex with perl = TRUE.
While reshape is fundamentally built to deal with panel data, it does not handle “unbalanced” (panel) data. A special case that traps reshape (to long) is when a variable (say population size) is available for 2019 but missing for 2020 (for all units), while another variable (GDP) is measured for both 2019 and 2020 (for some or all units).4 You may think this is a useless edge scenario, but both {data.table} and {tidyr} support it. This rare edge case may frequently creep into typical use cases, doesn’t it?
Reshaping with stats::reshape
Let’s quickly look at how we can use reshape to convert data from wide to long format and vice versa and touch on some of the issues along the way, with sample datasets from the World Development Indicators (WDI) database. The WDI datasets provide a perfect use case for reshaping because you can download the data in wide or long format.
# import data (source WBG WDI)wdi_data_wide =read.csv("wdi_data_wide.csv")wdi_data_long =read.csv("wdi_data_long.csv")# constantsdata_url ="https://databank.worldbank.org/id/8dfced2d"varying_patt ="^((?:pop|gdp).*)[_](\\d{4})$"# regex for measure varsvarying_patt_vec =c("^pop", "^gdp")
About the sample WDI datasets
I downloaded the data in two formats: “table” and “list”, which roughly correspond to wide and long formats, respectively. The data contains information on 2 indicators (“Population, total” and “GDP per capita, PPP (constant 2021 international $)”), for countries Korea, Rep., Ethiopia, and Germany for the years 2019-2020. Here is a direct link to the original dataset.
To pivot this into a long format, we just collect the suffixed years in the (series) columns into a single column “year”, and each series makes (or gets collapsed into) a single column.
Reshape to long with other packages
IMO, all three solutions are straighforward. We get a proper long format, a row represents a record for a country at a point-in-time, that is, this is in a typical panel (repeated-crossection or longitudinal) data format.
duckdb<< EOFcreate or replace table wdi_data_wide asfrom read_csv('wdi_data_wide.csv');set variable measure_patt = '^([a-z_]+)_(\d{4})$';.mode markdownwith unpivoted as ( select country, regexp_extract(column_name, getvariable('measure_patt'), 2) as year, regexp_extract(column_name, getvariable('measure_patt'), 1) as measure, column_value as value from ( unpivot wdi_data_wide on COLUMNS(getvariable('measure_patt')) into name column_name value column_value ))pivot unpivotedon measureusing first(value)order by country, year;EOF
According to the doc, the easiest way to reshape to long with reshape is by using “nicely” formatted column names: that is, use a sep (which defaults to .) between the time varying variables and the timevar, like gdp_pc_ppp.2020 and pop_total.2020. This nice formatting has the added benefit that v.names (gdp_pc_ppp, pop_total) and times (i.e., the values for timevar) (2020), will be auto-guessed; otherwise, we need to supply them ourselves (which is error prone).
### Typical usage for converting from wide to long format:### If names of wide-format variables are in a 'nice' format# reshape(data, direction = "long",# varying = c(___), # vector# sep) # to help guess 'v.names' and 'times'
So let’s rename our varying columns using a special sep and try reshaping to long as follows:
Code
sep ="."# let's replace "_" (just before the year) with the sep to "nicely" format the column namesnicely_format_names =function(nms) {sub("_(?=\\d)", sep, nms, perl =TRUE)}nice_nms =nicely_format_names(names(wdi_data_wide))wdi_data_wide_long1 =reshape(setNames(wdi_data_wide, nice_nms),direction ="long",varying =grepv(sprintf("^[a-z_]+%s\\d{4}$", sep), nice_nms),idvar ="country", # if ommitted, an `id` column will be added `seq_along(data)`timevar ="year", # otherwise `time`sep = sep # the default is `.`)wdi_data_wide_long1 |>head()
That is a hell of unnatural detour for reshaping. What if we want to go along with our column names? Enter explicit-specification:
### Typical usage for converting from wide to long format:# ...### To specify long-format variable names explicitly# reshape(data, direction = "long",# varying = ___, # list / matrix / vector (use with care)# v.names = ___, # vector of variable names in long format# timevar, times, # name / values of constructed time variable# idvar, ids) # name / values of constructed id variable
That is very verbose and error prone: (1) we need to specify varying “with care” (choosing the list form), as urged by the doc, (2) the values of timevar (named as year) which are controlled by times will be auto-generated instead of the actual values we see from the data, had we not provided them like so times = c("2019", "2020"), (3) not only do we need to know the unique values of timevar, but also their correct order.
Importantly, were we should note that we provided a list to varying. What if you accidentially gave a vector like c(c("pop_total_2019", "pop_total_2020"), c("gdp_pc_ppp_2019", "gdp_pc_ppp_2020"))? Well, that would be wrong because if varying takes on a vector, we should rather provide it in this order c("pop_total_2019", "gdp_pc_ppp_2019", "pop_total_2020", "gdp_pc_ppp_2020").
The correct order requires all columns corresponding to the same time to be contiguous; this is the same intrinsic column-major ordering in the matrix [ or list] […] form. It is best to avoid the atomic vector form of varying unless v.names is being omitted.
— Using the reshape function (The R Core Team, April 2025)
Looks like passing a list to varying is safe and straighforward when it comes to the order of the time-varying variables. But as we saw we need to manually define the names/values of several variables (v.names, timevar, times). Before abandoning passing a vector for varying, since we actually want names/values auto-guessed, what if we tell reshape to split at sep = "_"? However, since our varying variables have _, this would be impossible to do without a perl-compatible regex option, which reshape hides from us. If you edit the function definition of reshape to allow for this (i.e, add perl = TRUE), we can use sep = "_(?=\\d{4})" and get the same result as the nicely formatted naming option above.
We can also play with split to provide a custom sep (regexp) as shown below but the names will have trailing _s (because include = FALSE fails), but we need to pass a vector of names for varying (to take advantage of auto-guessing, so much complication!).
I would argue that this is a better approach (than the nicely renaming columns we saw above) and is more flexible and usable: we can force the location of the split without perl-compatible regex, and have auto-guessing of names/values.
# A tibble: 6 × 4
country year pop_total gdp_pc_ppp
<chr> <chr> <int> <dbl>
1 Korea, Rep. 2019 51764822 46757.
2 Korea, Rep. 2020 51836239 NA
3 Ethiopia 2019 115737383 2438.
4 Ethiopia 2020 118917671 NA
5 Germany 2019 83092962 62971.
6 Germany 2020 83160871 NA
Code
# duckdb solutionduck_cmd = r"{duckdb << EOF create or replace table wdi_data_wide as from read_csv('wdi_data_wide.csv'); set variable measure_patt = '^([a-z_]+)_(\d{4})$'; pivot ( select country, regexp_extract(column_name, getvariable('measure_patt'), 2) as year, regexp_extract(column_name, getvariable('measure_patt'), 1) as measure, column_value as value from ( unpivot wdi_data_wide on COLUMNS( * exclude(country, gdp_pc_ppp_2020)) into name column_name value column_value ) ) on measure using first(value) order by country, year;EOF}"cat(system(duck_cmd, intern =TRUE), sep ="\n")
Reshaping from long to wide is supposed to be straightforward. However, reshape handles one timevar at a time. This is because reshape is designed to fundamentally deal with longitudinal data. However, the sample long dataset (see Table 1 (b)) is not quite in that form as we have series and year as timevars. So we need to transform the dataset into a longitudinal form (country, year, pop_total, gdp_pc_ppp), which means we reshape to wide with timevar="series". Then, we reshape this again with timevar="year". Thus, the solution involves reshaping twice along the two dimensions (series, year). However, data.table::dcast, tidyr::pivot_wider, and duckdb::pivot handle this in one step.
See https://duckdb.org/docs/stable/sql/statements/pivot
Code
duckdb<< EOFcreate or replace table wdi_data_long asfrom read_csv('wdi_data_long.csv');.mode markdownpivot wdi_data_long-- with a custom sep-- on series || '.' || yearon series, yearusing first(value);EOF
To conclude, this is not a diss at reshape. In fact, it is great even with the abovementioned idiosyncriasies, if we use it with care. But, I hope it gets better and more usable in future versions of R.
When you download the data in “table” format, it is kinda in wide format (because the indicators are stacked in the Series Name column), as shown in Table 1. However, I cleaned (including shortening the names of the series) the original dataset to transform it into “true” wide format shown in Table 2.↩︎