ARCHIVED: How do I merge two data files in R?

This content has been archived, and is no longer maintained by Indiana University. Information here may no longer be accurate, and links may no longer be available or reliable.

Suppose you have two data files, dataset1 and dataset2, that need to be merged into a single data set. First, read both data files in R. Then, use the merge() function to join the two data sets based on a unique id variable that is common to both data sets:

  > merged.data <- merge(dataset1, dataset2, by="countryID")

merged.data is an R object, which contains the two merged data sets. The data files were joined based on the id variable countryID.

It is possible to merge data files by more than one id variable:

  > merged.data <- merge(dataset1, dataset2, by=c("regionID", "countryID"))

It is also possible to merge the two files if the unique id variable has a different name in each data set. For example, the id variable may be called countryID in dataset1, but called stateID in dataset2:

  > merged.data <- merge(dataset1, dataset2, by.x="countryID", by.y="stateID")

In this case, by.x calls the name of the id variable in dataset1, and by.y calls the name of the id variable in dataset2.

Note: The default setting of the merge() function drops all unmatched cases. If you want to keep all cases in the new data set, include the option all=TRUE in the merge() function:

  > merged.data.all <- merge(dataset1, dataset2, by="countryID", all=TRUE)

To keep unmatched cases only from dataset1, use the all.x option. Conversely, to keep unmatched cases only from dataset2, use the all.y option:

> merged.data.all <- merge(dataset1, dataset2, by="countryID", all.x=TRUE)
> merged.data.all <- merge(dataset1, dataset2, by="countryID", all.y=TRUE)

When all.x=TRUE, an extra row will be added to the output for each case in dataset1 that has no matching cases in dataset2. Cases that do not have values from dataset2 will be labeled as missing. Conversely, when all.y=TRUE, an extra row will be added to the output for each case in dataset2 that has no matching cases in dataset1. Cases that do not have values from dataset1 will be labeled as missing.

If you have questions about using statistical and mathematical software at Indiana University, contact the UITS Research Applications and Deep Learning team.