Using the US Census Geocoder in R

R
Statistics
Census
API
Using the US Census batch geocoding service to retrieve coordinates for address lists
Author

Kent Orr

Published

May 23, 2023

The US Census is kind enough to provide their geocoding service to the public. This service allows you to retrieve coordinates for a given postal address and vice versa. For this post we’re going to look at using the batch job endpoint to retrieve coordinates of an address list. I’ll be using the curl package in R, which is essentially the nuts and bolts of the more popular and easier to use httr.

We’ll start by gathering a set of addresses. We’ll use the voter rolls from the Vinton County, Ohio Board of Elections

# Example voter data structure - you would load your actual data here
voters <- data.frame(
  sosidnum = c("12345", "67890", "11111"),
  stnum = c("123", "456", "789"),
  stname = c("Main St", "Oak Ave", "Pine Rd"),
  city = c("Athens", "Nelsonville", "McArthur"),
  zip = c("45701", "45764", "45651"),
  stringsAsFactors = FALSE
)

voters$street = paste(voters$stnum, voters$stname)
voters <- voters[,c("sosidnum", "street", "city", "zip")]

for (col in c("street", "city")) {
  voters[[col]] <- stringr::str_to_title(voters[[col]])
}

head(voters)

Now that we’ve got some address data, let’s look at our Census API Documentation

Geocoding can be accomplished in batch mode with submission of a .CSV, .TXT, .DAT, .XLS, or .XLSX formatted file. The file needs to be included as part of the HTTP request. The file must be formatted in the following way: Unique ID, Street address, City, State, ZIP If a component is missing from the dataset, it must still retain the delimited format with a null value. Unique ID and Street address are required fields. If there are commas that are part of one of the fields, the whole field needs to be enclosed in quote marks for proper parsing. There is currently an upper limit of 10,000 records per batch file.

So, we want to limit each request to under 10k, and each request is an uploaded csv with particular columns. Let’s put that into action using our .csv of voter registrations.

library(curl)

# Extracting relevant columns from the 'voters' dataset
to_send = voters[,c("sosidnum", "street", "city", "zip")]

# Renaming the columns
to_send = setNames(to_send, c("Unique ID", "Street Address", "City", "Zip"))

# Adding a new column for the state
to_send$State = "OH"

# Rearranging the columns
to_send = to_send[, c("Unique ID", "Street Address", "City", "State", "Zip")]

# Creating a temporary file to store the data in CSV format
t = tempfile(tmpdir = getwd(), fileext = ".csv")

# Writing the data to the CSV file
write.csv(to_send, t, row.names = FALSE)

# Creating a new handle for making a request
h <- new_handle() |>
  handle_setform(
    addressFile = form_file(t),
    benchmark = "Public_AR_Current"
  )

# Sending a POST request to geocode the addresses
x = curl_fetch_memory("https://geocoding.geo.census.gov/geocoder/locations/addressbatch", handle = h)

# Removing the temporary file
file.remove(t) |> invisible()

# Creating a new temporary file to store the response
t = tempfile(tmpdir = getwd(), fileext = ".csv")

# Writing the response content to the temporary file
x$content |> rawToChar() |> cat(file = t)

# Reading the response into the 'results' dataframe
results <- read.csv(t, header = F, col.names = c("uid", "address", "match", "match_type", "matched_address", "latlon", "tigerid", "tigerside"))

# Removing the temporary file
file.remove(t) |> invisible()

head(results)

And that’s that. From there we can use this data for analysis.

# Parse latitude and longitude from the results
results$lon = apply(results, 1, \(x) {
  as.numeric(strsplit(x[["latlon"]], ",")[[1]][1])
})

results$lat = apply(results, 1, \(x) {
  as.numeric(strsplit(x[["latlon"]], ",")[[1]][2])
})

# Visualize the geocoded addresses
library(ggplot2)
usmap::plot_usmap(regions = "counties", include = c(39163)) +
  geom_point(data = usmap::usmap_transform(results[!is.na(results$lat),]), aes(x = x, y = y)) +
  labs(title = "Vinton County")