Pawpawanalytics.com - Using the US Census Geocoder in R

The US Census is kind enough to provide their geocoding service to the public. This service allows you to retrieve coordinates for a given postal address and vice versa. For this post we’re going to look at using the batch job endpoint to retrieve coordinates of an address list. I’ll be using the curl package in R, which is essentially the nuts and bolts of the more popular and easier to use httr.

We’ll start by gathering a set of addresses. We’’ use the voter rolls from the Vinton County, Ohio Board of Elections

voters <- read.csv(here::here("posts/us-census-geocoder/voterfile.csv")) |>
  janitor::clean_names()

voters$street = paste(voters$stnum, voters$stname)
voters$lastvote = lubridate::mdy(voters$lastvote)

Warning: 35 failed to parse.

Now that we’ve got some address data, let’s look at our Census API Documentation

Geocoding can be accomplished in batch mode with submission of a .CSV, .TXT, .DAT, .XLS, or .XLSX formatted file. The file needs to be included as part of the HTTP request. The file must be formatted in the following way: Unique ID, Street address, City, State, ZIP If a component is missing from the dataset, it must still retain the delimited format with a null value. Unique ID and Street address are required fields. If there are commas that are part of one of the fields, the whole field needs to be enclosed in quote marks for proper parsing. There is currently an upper limit of 10,000 records per batch file.

So, we want to limit each request to under 10k, and each request is an uploaded csv with particular columns. Let’s put that into action using our .csv of voter registrations.

library(curl)

Using libcurl 7.81.0 with OpenSSL/3.0.2

# Extracting relevant columns from the 'voters' dataset
to_send = voters[,c("sosidnum", "street", "city", "zip")]

# Renaming the columns
to_send = setNames(to_send, c("Unique ID", "Street Address", "City", "Zip"))

# Adding a new column for the state
to_send$State = "OH"

# Rearranging the columns
to_send = to_send[, c("Unique ID", "Street Address", "City", "State", "Zip")]

# Creating a temporary file to store the data in CSV format
t = tempfile(tmpdir = getwd(), fileext = ".csv")

# Writing the data to the CSV file
write.csv(to_send, t, row.names = FALSE)

# Creating a new handle for making a request
h <- new_handle() |>
  handle_setform(
    addressFile = form_file(t),
    benchmark = "Public_AR_Current"
  )

# Sending a POST request to geocode the addresses
x = curl_fetch_memory("https://geocoding.geo.census.gov/geocoder/locations/addressbatch", handle = h)

# Removing the temporary file
file.remove(t) |> invisible()

# Creating a new temporary file to store the response
t = tempfile(tmpdir = getwd(), fileext = ".csv")

# Writing the response content to the temporary file
x$content |> rawToChar() |> cat(file = t)

# Reading the response into the 'results' dataframe
results <- read.csv(t, header = F, col.names = c("uid", "address", "match", "match_type", "matched_address", "latlon", "tigerid", "tigerside"))

# Removing the temporary file
file.remove(t) |> invisible()

head(results)

           uid                                        address    match
1 OH0012504958     72852 TWO MILE RD, NEW PLYMOUTH, OH, 45654    Match
2 OH0012503629 23080 CHERRY RIDGE RD, NEW PLYMOUTH, OH, 45654 No_Match
3 OH0021484097       24266 ST RT 328, NEW PLYMOUTH, OH, 45654 No_Match
4 OH0018793466       27172 ST RT 328, NEW PLYMOUTH, OH, 45654    Match
5 OH0012506696 23283 CHERRY RIDGE RD, NEW PLYMOUTH, OH, 45654    Match
6 OH0023093879     70416 TWO MILE RD, NEW PLYMOUTH, OH, 45654    Match
  match_type                                matched_address
1      Exact     72852 TWO MILE RD, NEW PLYMOUTH, OH, 45654
2                                                          
3                                                          
4      Exact   27172 STATE RTE 328, NEW PLYMOUTH, OH, 45654
5      Exact 23283 CHERRY RIDGE RD, NEW PLYMOUTH, OH, 45654
6      Exact     70416 TWO MILE RD, NEW PLYMOUTH, OH, 45654
                                 latlon  tigerid tigerside
1  -82.30831659575568,39.38436598976523 37605135         R
2                                             NA          
3                                             NA          
4 -82.40047367526898,39.325429702275244 37622682         R
5  -82.39873357836785,39.38559864463582 37604899         L
6   -82.3586789372356,39.36734046653169 37618271         R

And that’s that. From there we can use this data for analysis.

results$lon = apply(results, 1, \(x) {
  as.numeric(strsplit(x[["latlon"]], ",")[[1]][1])
})

results$lat = apply(results, 1, \(x) {
  as.numeric(strsplit(x[["latlon"]], ",")[[1]][2])
})

library(ggplot2)
usmap::plot_usmap(regions = "counties", include = c(39163)) +
  geom_point(data = usmap::usmap_transform(results[!is.na(results$lat),]), aes(x = x, y = y)) +
  labs(title = "Vinton County")