Underwhelmed by stats::reshape

r
base-r
tutorial
programming
Author
Published

August 11, 2025

When was the last time you used stats::reshape? Probably never, in the presence of {tidyr}’s pivot_* and {data.table}’s melt/dcast reshaping (aka pivoting) functions.

A while ago, while perusing through the {data.table} vignettes, I learned about the reshape function. The {data.table} core team described it as “It is an extremely useful and often underrated function.”. As a member of the “dependency-frets” (as reshape is part of base R), I’ve been using it fairly since. However, it has some annoyances that might lead you to think that the function isn’t yet quite “fully ripe”, which could explain its “underratting”.

First, it’s very slow; no further comment is needed, but that could be a deal breaker for some.1

Args of the reshape function are also a bit cryptic. I have to hit ?reshape time and again because it is not easy to remember what is what. What (or not) to pass to varying (a vector or a list (or a matrix)) and how to order its values? Like varying=c("gdp_2020", "gdp_2021", "pop_2020", "pop_2021"), or c("gdp_2020", "pop_2020", "gdp_2021", "pop_2021")? What does v.names (and frankly most args) do in reshape to wide vs. long? Would not it be helpful if reshape was broken down into two separate functions (reshape_{long,wide})?2 Also it is hard to tell which args are optional and which get ignored silently even when you explicitly provide them.3

Futhermore, you have to “nicely” format your column names to unlock reshape to long’s potential: that is use a sep (defaults to .) between the time varying variables and the timevar (like gdp.2020, pop.2020).

Warning

Do you really think about renaming your columns when reshaping? Don’t! 🙃

Even if you “nicely” format your column names say with sep="_", your’re doomed if you have “_” somewhere in the column names as you can’t use perl-compatible regex to force the location of the sep (e.g., you have gdp_ppp_2020 and you’re not allowed to use sep="_(?=\\d{4})" or split(regexp = "_(?=\\d{4})", include = FALSE, fixed = FALSE)). If you look at the source, this could have been done as split and regexpr (where sep and split are used) can handle perl-compatible regex with perl = TRUE.

While reshape is fundamentally built to deal with panel data, it does not handle “unbalanced” (panel) data. A special case that traps reshape (to long) is when a variable (say population size) is available for 2019 but missing for 2020 (for all units), while another variable (GDP) is measured for both 2019 and 2020 (for some or all units).4 You may think this is a useless edge scenario, but both {data.table} and {tidyr} support it. This rare edge case may frequently creep into typical use cases, doesn’t it?

Reshaping with stats::reshape

Let’s quickly look at how we can use reshape to convert data from wide to long format and vice versa and touch on some of the issues along the way, with sample datasets from the World Development Indicators (WDI) database. The WDI datasets provide a perfect use case for reshaping because you can download the data in wide or long format.

Tip

The official documentation and vignette for reshape are great. I also find this blog post by Jozef Hajnala excellent.

Code
# import data (source WBG WDI)
wdi_data_wide = read.csv("wdi_data_wide.csv")
wdi_data_long = read.csv("wdi_data_long.csv")

# constants
data_url = "https://databank.worldbank.org/id/8dfced2d"
varying_patt = "^((?:pop|gdp).*)[_](\\d{4})$" # regex for measure vars
varying_patt_vec = c("^pop", "^gdp")

I downloaded the data in two formats: “table” and “list”, which roughly correspond to wide and long formats, respectively. The data contains information on 2 indicators (“Population, total” and “GDP per capita, PPP (constant 2021 international $)”), for countries Korea, Rep., Ethiopia, and Germany for the years 2019-2020. Here is a direct link to the original dataset.

Code
read.csv("sample_wdi_data_table.csv", check.names = FALSE) |>
  head() |>
  knitr::kable()
read.csv("sample_wdi_data_list.csv", check.names = FALSE) |>
  head() |>
  knitr::kable()
Table 1: Sample Dataset from WDI Database (Raw)
(a) Head of the data in table (wide) format
Country Name Series Name 2019 2020
Korea, Rep. Population, total 51764822.00 5.183624e+07
Korea, Rep. GDP per capita, PPP (constant 2021 international $) 46757.26 4.636160e+04
Ethiopia Population, total 115737383.00 1.189177e+08
Ethiopia GDP per capita, PPP (constant 2021 international $) 2437.97 2.516549e+03
Germany Population, total 83092962.00 8.316087e+07
Germany GDP per capita, PPP (constant 2021 international $) 62970.72 6.034266e+04
(b) Head of the data in list (long) format
Country Name Series Name Time Value
Korea, Rep. Population, total 2019 51764822.00
Korea, Rep. Population, total 2020 51836239.00
Korea, Rep. GDP per capita, PPP (constant 2021 international $) 2019 46757.26
Korea, Rep. GDP per capita, PPP (constant 2021 international $) 2020 46361.60
Ethiopia Population, total 2019 115737383.00
Ethiopia Population, total 2020 118917671.00

Reshape to long: reshape(..., direction = "long")

The dataset in wide format looks like:5

Table 2: The data in true wide format (cleaned)
country pop_total_2019 pop_total_2020 gdp_pc_ppp_2019 gdp_pc_ppp_2020
Korea, Rep. 51764822 51836239 46757.26 46361.596
Ethiopia 115737383 118917671 2437.97 2516.549
Germany 83092962 83160871 62970.72 60342.663

To pivot this into a long format, we just collect the suffixed years in the (series) columns into a single column “year”, and each series makes (or gets collapsed into) a single column.

IMO, all three solutions are straighforward. We get a proper long format, a row represents a record for a country at a point-in-time, that is, this is in a typical panel (repeated-crossection or longitudinal) data format.

Code
wdi_data_wide |>
as.data.table() |>
melt(
  measure.vars = measure(value.name, year, pattern = "^([a-z_]+)_(\\d{4})$"),
  id.vars = "country"
)
       country   year pop_total gdp_pc_ppp
        <char> <char>     <int>      <num>
1: Korea, Rep.   2019  51764822  46757.261
2:    Ethiopia   2019 115737383   2437.970
3:     Germany   2019  83092962  62970.716
4: Korea, Rep.   2020  51836239  46361.596
5:    Ethiopia   2020 118917671   2516.549
6:     Germany   2020  83160871  60342.663
Code
wdi_data_wide |>
  tidyr::pivot_longer(
    cols = -country,
    names_pattern = "^([a-z_]+)_(\\d{4})$",
    names_to = c(".value", "year")
  )
# A tibble: 6 × 4
  country     year  pop_total gdp_pc_ppp
  <chr>       <chr>     <int>      <dbl>
1 Korea, Rep. 2019   51764822     46757.
2 Korea, Rep. 2020   51836239     46362.
3 Ethiopia    2019  115737383      2438.
4 Ethiopia    2020  118917671      2517.
5 Germany     2019   83092962     62971.
6 Germany     2020   83160871     60343.
Code
duckdb << EOF
create or replace table wdi_data_wide as
from read_csv('wdi_data_wide.csv');

set variable measure_patt = '^([a-z_]+)_(\d{4})$';

.mode markdown
with unpivoted as (
  select
    country,
    regexp_extract(column_name, getvariable('measure_patt'), 2) as year,
    regexp_extract(column_name, getvariable('measure_patt'), 1) as measure,
    column_value as value
  from (
    unpivot wdi_data_wide
    on COLUMNS(getvariable('measure_patt'))
    into
      name column_name
      value column_value
  )
)
pivot unpivoted
on measure
using first(value)
order by country, year;
EOF
|   country   | year |    gdp_pc_ppp    |  pop_total  |
|-------------|------|-----------------:|------------:|
| Ethiopia    | 2019 | 2437.97046055952 | 115737383.0 |
| Ethiopia    | 2020 | 2516.54907642544 | 118917671.0 |
| Germany     | 2019 | 62970.7160913279 | 83092962.0  |
| Germany     | 2020 | 60342.6627037391 | 83160871.0  |
| Korea, Rep. | 2019 | 46757.2614496182 | 51764822.0  |
| Korea, Rep. | 2020 | 46361.5957811718 | 51836239.0  |

According to the doc, the easiest way to reshape to long with reshape is by using “nicely” formatted column names: that is, use a sep (which defaults to .) between the time varying variables and the timevar, like gdp_pc_ppp.2020 and pop_total.2020. This nice formatting has the added benefit that v.names (gdp_pc_ppp, pop_total) and times (i.e., the values for timevar) (2020), will be auto-guessed; otherwise, we need to supply them ourselves (which is error prone).

### Typical usage for converting from wide to long format:

### If names of wide-format variables are in a 'nice' format

# reshape(data, direction = "long",
#         varying = c(___), # vector
#         sep)              # to help guess 'v.names' and 'times'

So let’s rename our varying columns using a special sep and try reshaping to long as follows:

Code
sep = "."
# let's replace "_" (just before the year) with the sep to "nicely" format the column names
nicely_format_names = function(nms) {
  sub("_(?=\\d)", sep, nms, perl = TRUE)
}

nice_nms = nicely_format_names(names(wdi_data_wide))

wdi_data_wide_long1 = reshape(
  setNames(wdi_data_wide, nice_nms),
  direction = "long",
  varying = grepv(sprintf("^[a-z_]+%s\\d{4}$", sep), nice_nms),
  idvar = "country", # if ommitted, an `id` column will be added `seq_along(data)`
  timevar = "year", # otherwise `time`
  sep = sep # the default is `.`
)

wdi_data_wide_long1 |> head()
                     country year pop_total gdp_pc_ppp
Korea, Rep..2019 Korea, Rep. 2019  51764822  46757.261
Ethiopia.2019       Ethiopia 2019 115737383   2437.970
Germany.2019         Germany 2019  83092962  62970.716
Korea, Rep..2020 Korea, Rep. 2020  51836239  46361.596
Ethiopia.2020       Ethiopia 2020 118917671   2516.549
Germany.2020         Germany 2020  83160871  60342.663

That is a hell of unnatural detour for reshaping. What if we want to go along with our column names? Enter explicit-specification:

### Typical usage for converting from wide to long format:

# ...

### To specify long-format variable names explicitly

# reshape(data, direction = "long",
#         varying = ___,  # list / matrix / vector (use with care)
#         v.names = ___,  # vector of variable names in long format
#         timevar, times, # name / values of constructed time variable
#         idvar, ids)     # name / values of constructed id variable
Code
wdi_data_wide_long2 = wdi_data_wide |>
  reshape(
    direction = "long",
    # varying = lapply(c("^pop", "^gdp"), \(p) grepv(p, names(wdi_data_wide))),
    varying = list(
      c("pop_total_2019", "pop_total_2020"),
      c("gdp_pc_ppp_2019", "gdp_pc_ppp_2020")
    ),
    v.names = c("pop", "gdp_pc_ppp"),
    idvar = "country", # otherwise column `ids` will be created
    timevar = "year", # otherwise "time"
    times = c("2019", "2020") # otherwise auto-generated
  )

wdi_data_wide_long2 |> head(10)
                     country year       pop gdp_pc_ppp
Korea, Rep..2019 Korea, Rep. 2019  51764822  46757.261
Ethiopia.2019       Ethiopia 2019 115737383   2437.970
Germany.2019         Germany 2019  83092962  62970.716
Korea, Rep..2020 Korea, Rep. 2020  51836239  46361.596
Ethiopia.2020       Ethiopia 2020 118917671   2516.549
Germany.2020         Germany 2020  83160871  60342.663

That is very verbose and error prone: (1) we need to specify varying “with care” (choosing the list form), as urged by the doc, (2) the values of timevar (named as year) which are controlled by times will be auto-generated instead of the actual values we see from the data, had we not provided them like so times = c("2019", "2020"), (3) not only do we need to know the unique values of timevar, but also their correct order.

Importantly, were we should note that we provided a list to varying. What if you accidentially gave a vector like c(c("pop_total_2019", "pop_total_2020"), c("gdp_pc_ppp_2019", "gdp_pc_ppp_2020"))? Well, that would be wrong because if varying takes on a vector, we should rather provide it in this order c("pop_total_2019", "gdp_pc_ppp_2019", "pop_total_2020", "gdp_pc_ppp_2020").

The correct order requires all columns corresponding to the same time to be contiguous; this is the same intrinsic column-major ordering in the matrix [ or list] […] form. It is best to avoid the atomic vector form of varying unless v.names is being omitted.
Using the reshape function (The R Core Team, April 2025)

Looks like passing a list to varying is safe and straighforward when it comes to the order of the time-varying variables. But as we saw we need to manually define the names/values of several variables (v.names, timevar, times). Before abandoning passing a vector for varying, since we actually want names/values auto-guessed, what if we tell reshape to split at sep = "_"? However, since our varying variables have _, this would be impossible to do without a perl-compatible regex option, which reshape hides from us. If you edit the function definition of reshape to allow for this (i.e, add perl = TRUE), we can use sep = "_(?=\\d{4})" and get the same result as the nicely formatted naming option above.

We can also play with split to provide a custom sep (regexp) as shown below but the names will have trailing _s (because include = FALSE fails), but we need to pass a vector of names for varying (to take advantage of auto-guessing, so much complication!).

Code
wdi_data_wide_long3 = wdi_data_wide |>
  reshape(
    direction = "long",
    varying = c("pop_total_2019", "gdp_pc_ppp_2019", "pop_total_2020",  "gdp_pc_ppp_2020"),
    timevar = "year",
    idvar = "country",
    split = list(regexp = "_\\d{4}$", include = TRUE, fixed = FALSE)
  )

wdi_data_wide_long3 |> head(10)
                     country year pop_total_ gdp_pc_ppp_
Korea, Rep..2019 Korea, Rep. 2019   51764822   46757.261
Ethiopia.2019       Ethiopia 2019  115737383    2437.970
Germany.2019         Germany 2019   83092962   62970.716
Korea, Rep..2020 Korea, Rep. 2020   51836239   46361.596
Ethiopia.2020       Ethiopia 2020  118917671    2516.549
Germany.2020         Germany 2020   83160871   60342.663

I would argue that this is a better approach (than the nicely renaming columns we saw above) and is more flexible and usable: we can force the location of the split without perl-compatible regex, and have auto-guessing of names/values.

Code
wdi_data_wide_unbalanced = wdi_data_wide[, setdiff(names(wdi_data_wide), "gdp_pc_ppp_2020")]

wdi_data_wide_unbalanced |>
  reshape(
    direction = "long",
    varying = grepv("^[a-z_]+_\\d{4}$", names(wdi_data_wide_unbalanced)),
    timevar = "year",
    idvar = "country",
    split = list(regexp = "_\\d{4}$", include = TRUE, fixed = FALSE)
  )
Error in reshapeLong(data, idvar = idvar, timevar = timevar, varying = varying, : 'varying' arguments must be the same length
Code
# data.table solution
wdi_data_wide_unbalanced |>
  as.data.table() |>
  melt(
    id.vars = "country",
    measure.vars = measure(value.name, year, pattern = "^([a-z_]+)_(\\d{4})$")
  )
       country   year pop_total gdp_pc_ppp
        <char> <char>     <int>      <num>
1: Korea, Rep.   2019  51764822   46757.26
2:    Ethiopia   2019 115737383    2437.97
3:     Germany   2019  83092962   62970.72
4: Korea, Rep.   2020  51836239         NA
5:    Ethiopia   2020 118917671         NA
6:     Germany   2020  83160871         NA
Code
# tidyr solution
wdi_data_wide_unbalanced |>
  tidyr::pivot_longer(
    cols = -country,
    names_pattern = "^([a-z_]+)_(\\d{4})$",
    names_to = c(".value", "year")
  )
# A tibble: 6 × 4
  country     year  pop_total gdp_pc_ppp
  <chr>       <chr>     <int>      <dbl>
1 Korea, Rep. 2019   51764822     46757.
2 Korea, Rep. 2020   51836239        NA 
3 Ethiopia    2019  115737383      2438.
4 Ethiopia    2020  118917671        NA 
5 Germany     2019   83092962     62971.
6 Germany     2020   83160871        NA 
Code
# duckdb solution

duck_cmd = r"{
duckdb << EOF
  create or replace table wdi_data_wide as
  from read_csv('wdi_data_wide.csv');

  set variable measure_patt = '^([a-z_]+)_(\d{4})$';

  pivot (
    select
    country,
    regexp_extract(column_name, getvariable('measure_patt'), 2) as year,
    regexp_extract(column_name, getvariable('measure_patt'), 1) as measure,
    column_value as value
    from (
      unpivot wdi_data_wide
      on COLUMNS( * exclude(country, gdp_pc_ppp_2020))
      into
        name column_name
        value column_value
    )
  )
  on measure
  using first(value)
  order by country, year;
EOF
}"

cat(system(duck_cmd, intern = TRUE), sep = "\n")
┌─────────────┬─────────┬──────────────────┬─────────────┐
│   country   │  year   │    gdp_pc_ppp    │  pop_total  │
│   varchar   │ varchar │      double      │   double    │
├─────────────┼─────────┼──────────────────┼─────────────┤
│ Ethiopia    │ 2019    │ 2437.97046055952 │ 115737383.0 │
│ Ethiopia    │ 2020    │             NULL │ 118917671.0 │
│ Germany     │ 2019    │ 62970.7160913279 │  83092962.0 │
│ Germany     │ 2020    │             NULL │  83160871.0 │
│ Korea, Rep. │ 2019    │ 46757.2614496182 │  51764822.0 │
│ Korea, Rep. │ 2020    │             NULL │  51836239.0 │
└─────────────┴─────────┴──────────────────┴─────────────┘

Reshape to wide: reshape(..., direction = "wide")

Reshaping from long to wide is supposed to be straightforward. However, reshape handles one timevar at a time. This is because reshape is designed to fundamentally deal with longitudinal data. However, the sample long dataset (see Table 1 (b)) is not quite in that form as we have series and year as timevars. So we need to transform the dataset into a longitudinal form (country, year, pop_total, gdp_pc_ppp), which means we reshape to wide with timevar="series". Then, we reshape this again with timevar="year". Thus, the solution involves reshaping twice along the two dimensions (series, year). However, data.table::dcast, tidyr::pivot_wider, and duckdb::pivot handle this in one step.

Code
# step 1 -> reshape into longitudinal
(wdi_data_longitudinal = reshape(
  wdi_data_long,
  direction = "wide",
  idvar = c("country", "year"),
  timevar = "series",
  v.names = "value",
  sep = "_"
))
       country year value_pop_total value_gdp_pc_ppp
1  Korea, Rep. 2019        51764822        46757.261
2  Korea, Rep. 2020        51836239        46361.596
5     Ethiopia 2019       115737383         2437.970
6     Ethiopia 2020       118917671         2516.549
9      Germany 2019        83092962        62970.716
10     Germany 2020        83160871        60342.663
Code
# step 2 -> wide format
wdi_data_longitudinal |>
  reshape(
    direction = "wide",
    idvar = "country",
    timevar = "year",
    sep = "_"
  ) -> wdi_data_long_wide

wdi_data_long_wide # |> setNames(sub("^value_", "", names(wdi_data_long_wide)))
      country value_pop_total_2019 value_gdp_pc_ppp_2019 value_pop_total_2020
1 Korea, Rep.             51764822              46757.26             51836239
5    Ethiopia            115737383               2437.97            118917671
9     Germany             83092962              62970.72             83160871
  value_gdp_pc_ppp_2020
1             46361.596
5              2516.549
9             60342.663
Code
wdi_data_long |>
  as.data.table() |>
  dcast(country ~ series + year)
Key: <country>
       country gdp_pc_ppp_2019 gdp_pc_ppp_2020 pop_total_2019 pop_total_2020
        <char>           <num>           <num>          <num>          <num>
1:    Ethiopia         2437.97        2516.549      115737383      118917671
2:     Germany        62970.72       60342.663       83092962       83160871
3: Korea, Rep.        46757.26       46361.596       51764822       51836239
Code
wdi_data_long |>
  tidyr::pivot_wider(names_from = c(series, year), values_from = value)
# A tibble: 3 × 5
  country     pop_total_2019 pop_total_2020 gdp_pc_ppp_2019 gdp_pc_ppp_2020
  <chr>                <dbl>          <dbl>           <dbl>           <dbl>
1 Korea, Rep.       51764822       51836239          46757.          46362.
2 Ethiopia         115737383      118917671           2438.           2517.
3 Germany           83092962       83160871          62971.          60343.

See https://duckdb.org/docs/stable/sql/statements/pivot

Code
duckdb << EOF
create or replace table wdi_data_long as
from read_csv('wdi_data_long.csv');

.mode markdown
pivot wdi_data_long
-- with a custom sep
-- on series || '.' || year
on series, year
using first(value);
EOF
|   country   | gdp_pc_ppp_2019  | gdp_pc_ppp_2020  | pop_total_2019 | pop_total_2020 |
|-------------|-----------------:|-----------------:|---------------:|---------------:|
| Korea, Rep. | 46757.2614496182 | 46361.5957811718 | 51764822.0     | 51836239.0     |
| Germany     | 62970.7160913279 | 60342.6627037391 | 83092962.0     | 83160871.0     |
| Ethiopia    | 2437.97046055952 | 2516.54907642544 | 115737383.0    | 118917671.0    |

To conclude, this is not a diss at reshape. In fact, it is great even with the abovementioned idiosyncriasies, if we use it with care. But, I hope it gets better and more usable in future versions of R.


—— Thank you for stopping by! 💚💛❤️ ——


Footnotes

  1. Consider using data.table!↩︎

  2. Oh wait, that was the case but removed at some point.↩︎

  3. For example, times is ignored if v.names is not provided in reshape to long (see here).↩︎

  4. Panel data is unbalanced when some units are not measured for at least one time period.↩︎

  5. When you download the data in “table” format, it is kinda in wide format (because the indicators are stacked in the Series Name column), as shown in Table 1. However, I cleaned (including shortening the names of the series) the original dataset to transform it into “true” wide format shown in Table 2.↩︎