3 Methodology

In this section, we describe the methodology we follow for this analysis. We first describe how we created a dataset on which to estimate park activity location choices for a sample of observed trips in Alameda County, California. Then we provide an overview of destination choice modeling and using such models to derive utility-based accessibility.

3.1 Study area

Alameda County is one of nine counties that constitute the San Francisco Bay Area metropolitan region in California. Alameda is the seventh most populous county in California with a population of 1.5 million (U.S. Census Bureau 2019), and has 14 incorporated cities and several unincorporated communities. It is an economically and ethnically diverse county and hence it was an attractive area to use for this study. The racial makeup of Alameda County was 49.7% White, 11.2% African American, 1.0% Native American, 38.7% Asian, 1.0% Pacific Islander, and 22.4% Hispanic or Latino of any race. Alameda County has a diverse set of parks, ranging from local small community parks, urban and transit-accessible parks like the Lake Merritt Recreational area, coastal access points, and suburban recreational areas like Lake Chabot.

3.2 Data

We constructed an estimation data set from a publicly-available parks polygons layer, a commercially acquired passive device origin-destination table representing visitors to parks and inferred residential block groups for these visitors, and American Community Survey data for the residence block groups. We also constructed a dataset representing open street installations that were implemented in response to the COVID-19 pandemic.

3.2.1 Model estimation data

We obtained a polygons shapefile layer representing open spaces in Alameda County, California from the California Protected Areas Database (GreenInfo Network 2019). This dataset was selected because it included multiple different types of open space including local and state parks, traditional green spaces as well as wildlife refuges and other facilities that can be used for recreation. We removed facilities that did not allow open access to the public (such as the Oakland Zoo) and facilities whose boundaries conflated with freeway right-of-way – this prevents trips through the park from being conflated with park trips in the passive origin-destination data.

From this initial parks polygon dataset, we obtained park attribute information through OpenStreetMap (OSM) using the osmdata package for R (Padgham et al. 2017). Three attribute elements are considered in this analysis. First, we identify playgrounds using OSM facilities given a leisure = playground tag. The tagged facilities can be either polygon or point objects; in the former case we use the polygon centroid to determine the point location of the playground.

Second, we consider sport fields of various kinds identified with the OSM leisure = pitch tag. This tag has an additional attribute describing the sport the field is designed for, which we retain in a consolidated manner. Soccer and American football fields are considered in a single category, and baseball and softball fields are combined as well. Basketball, tennis, and volleyball courts are kept as distinct categories, with all other sport-specific fields combined into a single “other.” Golf courses are discarded. As with playgrounds, polygon field and court objects are converted into points at the polygon centroid.

Finally, we identified trails and footpaths using the path, cycleway, and footway values of the highway tag. A visual inspection of the resulting data revealed that the large preponderance of sidewalks and cycling trails within parks in Alameda County are identified properly with these variables. Trails are represented in OSM as polylines, or as polygons if they form a complete circle. In the latter case, we converted the polygon boundary into an explicit polyline object.

It is possible for each of these facilities to exist outside the context of a public park. For example, many private apartment complexes have playgrounds and high schools will have sports facilities that are not necessarily open to the general public. We spatially matched the OSM amenities data and retained only those facilities that intersected with the CPAD open spaces database identified earlier.

tar_load(parks)
tar_load(playgrounds)
tar_load(trails)
tar_load(pitches)
dj1 <- wesanderson::wes_palette("Darjeeling1")
if(knitr::is_latex_output()) {
  ggplot() +
    annotation_map_tile("cartolight", zoom = 11) + 
    coord_sf(crs = st_crs(3857), 
             xlim = c(-122.29674 , -122.16358), 
             ylim = c( 37.74098, 37.87028), expand = FALSE) + 
    geom_sf(data = parks %>% st_transform(3857), inherit.aes = FALSE,
            fill = dj1[2], lwd = 0)  +
    theme(axis.line = element_line(color = NA),
          axis.title.x=element_blank(),
          axis.text.x=element_blank(),
          axis.ticks.x=element_blank(),
          axis.title.y=element_blank(),
          axis.text.y=element_blank(),
          axis.ticks.y=element_blank())
} else {
  leaflet() %>%
    addProviderTiles(providers$Esri.WorldGrayCanvas) %>%
    addPolygons(data = parks %>% st_transform(4326), group = "Parks", color = dj1[2]) %>%
    addCircleMarkers(data = playgrounds %>% st_transform(4326), 
                     group = "Playgrounds", color = dj1[3], radius = 2) %>%
    addCircleMarkers(data = pitches %>% st_transform(4326), 
                     group = "Sport Fields", color = dj1[1], radius = 2) %>%
    addPolylines(data = trails %>% st_transform(4326), 
                 group = "Trails", color = dj1[5]) %>%
    addLayersControl(overlayGroups = c("Parks", "Playgrounds", "Sport Fields", "Trails"))
}

Figure 3.1: Location of parks in Alameda County.

We provided the park boundaries layer to a commercial firm, StreetLight Data Inc., which develops and resells origin-destination matrices derived from passive device location data. The provider employs a proprietary data processing engine (called Route Science) to algorithmically transform observed device location data points (the provider uses in-vehicle GPS units and mobile device LBS) over time into contextualized, normalized, and aggregated travel patterns. From these travel patterns, the Route Science processing algorithms infer likely home Census block group locations for composite groups of people and converts raw location data points into trip origin and destination points (Pan et al. 2006; Friedrich et al. 2010).

For each park polygon, the firm returned a population-weighted estimate of how many devices were observed by home location block group over several months in the period between May 2018 and October 2018. We transformed this table such that it represented the weighted unique devices traveling between block groups and parks. We discarded home location block groups outside of Alameda County; the imputed home locations can be far away from the study area for a small amount of trips and are unlikely to represent common or repeated park activities. Table 3.1 presents descriptive statistics on the 500 parks assembled for this study, grouped according to the park type as defined on CPAD. A little more than half of the parks have identified paths, while around 40% of the identified parks have playgrounds and sport fields.

tar_load(attributed_parks)
tar_load(park_flows)
attributed_parks <- attributed_parks %>% group_by(type) %>%
  left_join(park_flows %>% group_by(park) %>% slice(1) %>% 
              select(park, attractions), by = c("id" = "park")) %>%
  mutate(attractions = ifelse(is.na(attractions), 0, attractions)) 

datasummary_balance(
  ~ Type,  
  data = attributed_parks %>% 
    transmute(Access = access, Acres = acres, 
           Type = ifelse(grepl("Recreation Area", type), "Recreation Area" ,type),
           Playground = playground, 
           `Any Sport Field` = pitch, 
           `Football / Soccer` = `football / soccer`,
           `Baseball` = baseball, `Basketball` = basketball,
           `Tennis` = tennis, `Volleyball` = volleyball,
           Trail = trail, 
           `Mobile Devices` =attractions), 
  dinm = FALSE, caption = "Park Summary Statistics", booktabs = TRUE)
Table 3.1: Park Summary Statistics
Local Park (N=441)
Recreation Area (N=59)
Mean Std. Dev. Mean Std. Dev.
Acres 59.8 370.5 125.6 505.1
Mobile Devices 1450.0 6685.4 2659.0 6161.0
N Pct. N Pct.
type Local Park 441 100.0 0 0.0
Local Recreation Area 0 0.0 57 96.6
State Recreation Area 0 0.0 2 3.4
Access Open Access 441 100.0 59 100.0
No Public Access 0 0.0 0 0.0
Restricted Access 0 0.0 0 0.0
Playground FALSE 213 48.3 43 72.9
TRUE 228 51.7 16 27.1
Any Sport Field FALSE 270 61.2 38 64.4
TRUE 171 38.8 21 35.6
Football / Soccer FALSE 414 93.9 51 86.4
TRUE 27 6.1 8 13.6
Baseball FALSE 363 82.3 45 76.3
TRUE 78 17.7 14 23.7
Basketball FALSE 337 76.4 52 88.1
TRUE 104 23.6 7 11.9
Tennis FALSE 387 87.8 52 88.1
TRUE 54 12.2 7 11.9
Volleyball FALSE 433 98.2 57 96.6
TRUE 8 1.8 2 3.4
Trail FALSE 143 32.4 21 35.6
TRUE 298 67.6 38 64.4

In order to understand the demographic makeup of the home block groups and potentially the characteristics of the people who make each trip, we obtained 2013-2017 five-year data aggregations from the American Community Survey using the tidycensus (Walker 2019) interface to the Census API for several key demographic and built environment variables: the share of individuals by race, the share of households by income level, household median income, the share of households with children under 6 years old, and the household density. An important attribute of the choice model is the distance from the home block group to the park boundary. Because we have no information on where in the block group a home is actually located, we use the population-weighted block group centroid published by the Census Bureau as the location for all homes in the block group. We then measured the network-based distance between the park and the home block group centroid using a walk network derived from OpenStreetMap via the networkx package for Python (Hagberg, Schult, and Swart 2008),

tar_load(acs)
acs_for_table <- acs %>% select(
    "Density: Households per square kilometer" = density,
    "Income: Median tract income" = income,
    "Low Income: Share of households making less than $35k" = lowincome,
    "High Income: Share of households making more than $125k" = highincome,
    "Children: Share of households with children under 6" = children,
    "Black: Share of population who is Black" = black,
    "Asian: Share of population who is Asian" = asian,
    "Hispanic: Share of population who is Hispanic*" = hispanic,
    "White: Share of population who is White" = white)

datasummary_skim(acs_for_table, title = "Block Group Summary Statistics",
  histogram = !knitr::is_latex_output()) %>%
  kableExtra::kable_styling(latex_options = c("scale_down")) %>%
  kableExtra::column_spec(1, width = "4cm") %>%
  kableExtra::footnote(symbol = "Hispanic indicates Hispanic individuals of all races; non-Hispanic individuals report a single race alone.") 
Table 3.2: Block Group Summary Statistics
Unique (#) Missing (%) Mean SD Min Median Max
Density: Households per square kilometer 1040 0 1727.2 1577.7 0.0 1374.0 21943.1
Income: Median tract income 978 3 106026.7 48909.9 13472.0 98206.5 250001.0
Low Income: Share of households making less than $35k 977 0 16.2 13.2 0.0 12.9 100.0
High Income: Share of households making more than $125k 1022 0 38.6 20.8 0.0 36.9 89.2
Children: Share of households with children under 6 976 0 14.5 8.7 0.0 13.2 48.7
Black: Share of population who is Black 914 0 11.6 14.0 0.0 6.3 80.8
Asian: Share of population who is Asian 1022 0 26.7 20.6 0.0 20.8 93.9
Hispanic: Share of population who is Hispanic* 1031 0 22.2 18.7 0.0 16.0 88.2
White: Share of population who is White 1030 0 33.5 22.5 0.0 28.8 93.6
* Hispanic indicates Hispanic individuals of all races; non-Hispanic individuals report a single race alone.

3.2.2 Model application data

tar_load(slowstreets)
tar_load(street_parks)

We compiled a list of streets in Berkeley, Oakland, and Alameda that were converted to public open space from each city’s respective websites (City of Alameda 2020; City of Oakland 2020; City of Berkeley 2020) and referred to the “Shifting Streets” COVID-19 mobility dataset (T. Combs et al. 2020) to determine whether other cities and places within Alameda County have similar Open Streets projects (as best as we could determine, they did not). Based on the information we gathered from these sources, 74 individual streets were converted; these streets represent 27.6 total miles across the cities of Berkeley, Oakland, and Alameda. For the purposes of this analysis, we represent each opened street as a single “park” without any sport facilities or playgrounds, but with a trail / walking path. The database provides the opened streets as polyline objects; we assert a 25-foot buffer around the line to represent a polygon with a measurable area. The 25-foot buffer effectively counts one vehicle lane and one shoulder parking lane in each direction as converted to “park” space. Finally, we measure the network-based distance from each population-weighted block group centroid to the nearest boundary of each new open space facility created by this policy.

3.3 Model estimation

In random utility choice theory, if an individual living in block group \(n\) wishes to make a park trip, the probability that the individual will choose park \(i\) from the set of all parks \(J\) can be described as a ratio of the park’s measurable utility \(V_{ni}\) to the sum of the utilities for all parks in the set. In the common destination choice framework we apply a multinomial logit model (D. L. McFadden 1974; Recker and Kostyniuk 1978), \[\begin{equation}\label{eq:p} P_{ni} = \frac{\exp(V_{ni})}{\sum_{j \in J}\exp(V_{nj})} \end{equation}\] where the measurable utility \(V_{ni}\) is a linear-in-parameters function of the destination attributes. \[\begin{equation}\label{eq:V} V_{ni} = X_{ni}\beta \end{equation}\] where \(\beta\) is a vector of estimable coefficients giving the relative utility (or disutility) of that attribute to the choice maker, all else equal. It is possible to add amenities of the park or the journey to the utility equation. However, as the number of alternatives is large, it is impractical to consider alternative-specific constants or coefficients and therefore not possible to include attributes of the home block group or traveler \(n\) directly. We can, however, segment the data and estimate distinct distance and size parameters for different segments to observe heterogeneity in the utility parameters between different socioeconomic groups.

The logarithm of the sum in the denominator of Equation (called the logsum) provides a measure of the consumer surplus of the choice set enjoyed by person \(n\) (Williams 1977), \[\begin{equation} CS_n = \ln{{\sum_{j \in J}\exp(V_{nj})}} + C \tag{3.1} \end{equation}\] where \(C\) is a constant indicating an unknown absolute value; the difference of logsum values in two different scenarios eliminates \(C\). Additionally, dividing the difference in logsum from choice set \(J\) and choice set \(J'\) by a cost coefficient \(\beta\) \[\begin{equation} \delta CS_n = (\ln{\sum_{j \in J'}\exp(V_{nj})} - \ln{\sum_{j \in J}\exp(V_{nj})})/\beta \tag{3.2} \end{equation}\] gives an estimate of the benefit received by person \(n\) in monetary terms. Thus, such a “utility-based” accessibility term is continuously defined, contains multiple dimensions of the attributes of the choice, and can be evaluated in monetary terms (Handy and Niemeier 1997; Dong et al. 2006). Note also that the logsum increases with the size of the choice set: if a new alternative \(q\) is added to \(J\), then \(\ln\sum_{j\in J}\exp(V_j) < \ln(\sum_{j\in J}\exp(V_j) + \exp(V_q))\) for any value of \(V_q\).

set.seed(42)
n_obs <- 20000
n_alts <- 10
n_flow <- sum(park_flows$flow)
ll0e <- sum(n_obs *.8 * log(1/n_alts))

In the most typical cases, researchers estimate the utility coefficients for destination choice models from household travel surveys. For example, the California add-on to the 2017 National Household Travel Survey could be used for this purpose for frequent trips like commutes to work and school. However, as a 24-hour trip diary, it is less useful for recreational trips that may take place less frequently. For better data on park access, we need to synthesize a suitable estimation data set. We do this by sampling 20,000 random discrete device origin-destination pairs from the commercial passive data matrix, weighted by the volume of the flows. This corresponds to a 4.3% sample of all the observed device origin-destination pairs.

The sampled origin-destination pair gives the home location as well as the “chosen” alternative for a synthetic person. In principle the individual’s choice set contains all the parks in our dataset; in practice it can be difficult to estimate choice models with so many alternatives (\(|J| = 500\)). For this reason we randomly sample 10 additional parks to serve as the non-chosen alternatives, with a different set of 10 parks for each synthetic choice maker. Such random sampling of alternatives reduces the efficiency of the estimated coefficients but the coefficients remain unbiased (Train 2009); a more elegant sampling approach might have resulted in smaller estimated standard errors, but the estimation results (presented below) suggest this is not a concern in this application. As the model has no alternative-specific constants, the standard likelihood comparison statistic against the market shares model \(\rho^2\) is not computable. We instead use the likelihood comparison against the equal shares model \(\rho_0^2\).

The resulting analysis dataset therefore contains 20,000 choice makers that select between 11 parks including the park they were observed to choose; the measured distance between the choice maker’s block group and all parks in the choice set; and the acreage of each park in the choice set. We use the mlogit package for R (Croissant 2019; R Core Team 2020) to estimate the multinomial logit models.

3.4 Model application

Using the full set of Alameda County parks — including those added by street conversion — we can apply a destination choice model to calculate the change in park choice utilities and utility-based accessibility values for each block group in Alameda County. As shown with Equation (3.2), the difference in utility-based accessibility values with and without the opened streets is the additional consumer surplus provided by the policy, converted into a monetary value by a cost-utility coefficient. The purpose of converting the logsum into a monetary value is to scale the benefits in terms that analysts and policy makers can compare more easily than raw utility values. The dataset used for this research does not have any information on travel costs or entrance fees, and such data would likely not be relevant in the context of urban parks. As a result, there is no direct link between the utility and a monetary cost in our estimated models.

As a substitution, we use an estimate of the cost coefficient obtained from the Metropolitan Transportation Commission (MTC, San Francisco Bay regional MPO) “Travel Model One” activity-based travel demand model (Metropolitan Transportation Commission 2012). In this model, the utility of destination choice for social and recreational trips uses the mode choice model logsum as an impedance measure. The mode choice model cost coefficient varies with each individual’s value of time. The average value of time in the synthetic population for the calibration scenario is $7.75 per hour. Dividing that value of time by the in-vehicle travel time parameter of \(-0.018\) results in an implied mode choice cost coefficient of \((-0.018 /min * 60 min / hr)/ (7.75 \$/hr) = -0.139 / \$\).