Bicycles as a mode of transportation offer various benefits, including reducing environmental pollution, lowering carbon emissions, being cost-effective, and alleviating urban traffic congestion. While the development of bicycle transportation is increasingly emphasized globally, with many countries and cities considering it an integral part of urban transportation planning, the United States still lags behind in this aspect.
Although some cities, including Washington, D.C., have begun to promote bicycle transportation by constructing bike lanes and implementing bike-sharing systems like Capital Bikeshare, bicycle commuting still accounts for a relatively small proportion in the US. Washington, D.C., faces unique challenges in terms of its infrastructure, road network, and urban layout, which can create additional difficulties for cyclists and contribute to the high crash rate.
Furthermore, bicycle transportation safety is a prominent issue in the US, particularly in Washington, D.C., where mixed traffic between bike lanes and motor vehicles can easily lead to crashes and casualties. One critical concern is the lack of a comprehensive network of protected bike lanes, which would separate cyclists from motor vehicle traffic and minimize conflicts. Additionally, inadequate signage, insufficient cyclist and driver education, and varying levels of enforcement of traffic rules contribute to the problem.
To address these issues, Washington, D.C., should consider investing in improved infrastructure, such as expanding protected bike lanes, enhancing signage and visibility at intersections, and conducting public awareness campaigns to educate both cyclists and drivers on safe road-sharing practices. Collaboration between city planners, transportation departments, and local communities is crucial to creating a safer and more inclusive environment for all road users.
Through an in-depth analysis of bicycle crash data in Washington, D.C., and the establishment of a random forest machine learning model, we have effectively predicted the locations of bicycle crashes. Simultaneously, by employing k-fold cross-validation and spatial cross-validation methods, we have ensured the accuracy and reliability of the model. We also rely on rich charts and maps to visualize the data, providing a multi-layered and multi-dimensional display of the results. These findings aim to offer robust data support and intuitive evidence for policy analysis to the government and relevant departments, allowing them to take targeted measures to improve bicycle transportation safety in Washington, D.C.
In this section, we will focus on describing the data we have collected, and we will explain how we cleaned, filtered, merged, and transformed it into the dataset we needed. During this process, we will visualize the data, which is a significant aspect of presenting persuasive information. This data will reveal potential correlations between the occurrence of bicycle crashes in Washington D.C. and various environmental factors.
Census data is also required for this regression analysis. The layout of bike-sharing sites is closely linked to the socioeconomic characteristics and the appearance of specific areas that are mapped out by census data. In this section, we first obtained Washington D.C.’s population census data from the American Community Survey (ACS). We selected some important variables related to the project, such as total population, median income, white population, commute time, and usage of different transportation modes. Next, we renamed these variables for easier understanding.
We chose the following variables: 1. Total population; 2. Median income; 3. White population; 4. Commute time; 5. Usage numbers of different transportation modes; 6. Public transportation usage numbers; 7. Car transportation usage numbers; 8. Average age; 9. Female population; 10. Population of males and females in different age groups; 11. Number of vehicles; 12. Education level (number of people with bachelor’s degrees); 13. Number of people who walk to work.
Then, we calculated some percentage variables, such as the proportion of the white population, average commute time, the proportion of public transportation users, female population proportion, the proportion of the underage population, the proportion of the population with bachelor’s degrees, and the proportion of car transportation users.
Finally, we extracted geographical information and converted it into simple feature (sf) objects, which serve as the base geographical map of Washington D.C., for performing geo spatial operations in subsequent analyses. This data will help us make targeted recommendations in terms of shared bike station layouts and urban planning.
DCCensus <-
get_acs(geography = "tract",
variables = c("B01003_001", "B19013_001",
"B02001_002", "B08013_001",
"B08012_001", "B08301_001",
"B08301_010", "B01002_001",
"B01001_026","B01001_003",
"B01001_004","B01001_005",
"B01001_006","B01001_027",
"B01001_028","B01001_029",
"B01001_030","B08015_001",
"B06009_002","B06009_005",
"B08006_015","B08301_002"),
year = 2020,
state = "DC",
geometry = TRUE,
output = "wide") %>%
rename(Total_Pop = B01003_001E,
Med_Inc = B19013_001E,
Med_Age = B01002_001E,
White_Pop = B02001_002E,
Travel_Time = B08013_001E,
Num_Commuters = B08012_001E,
Means_of_Transport = B08301_001E,
Total_car_Trans = B08301_002E,
Total_Public_Trans = B08301_010E,
Female=B01001_026E,
under5_male = B01001_003E,
between5to9_male = B01001_004E,
between10to14_male = B01001_005E,
between15to17_male = B01001_006E,
under5_female = B01001_027E,
between5to9_female = B01001_028E,
between10to14_female = B01001_029E,
between15to17_female = B01001_030E,
num_vehicles=B08015_001E,
edu_bac = B06009_005E,
walk_to_work = B08006_015E) %>%
select(Total_Pop, Med_Inc, White_Pop, Travel_Time,
Means_of_Transport, Total_Public_Trans,Total_car_Trans,
Med_Age,Female, under5_male,between5to9_male,between10to14_male,between15to17_male,under5_female,between5to9_female, between10to14_female,between15to17_female,num_vehicles,edu_bac,walk_to_work,GEOID, geometry) %>%
mutate(Percent_White = White_Pop / Total_Pop,
Mean_Commute_Time = Travel_Time / Total_Public_Trans,
Percent_Taking_Public_Trans = Total_Public_Trans / Means_of_Transport,
Percent_Taking_car_Trans = Total_car_Trans / Means_of_Transport,
Percent_Female= Female/Total_Pop,
Percent_Minor = (under5_male+between5to9_male+between10to14_male+between15to17_male+under5_female+between5to9_female+between10to14_female+between15to17_female)/Total_Pop,
Percent_edu= edu_bac/Total_Pop)%>%
select(Med_Inc, White_Pop, Travel_Time,
Means_of_Transport, Total_Public_Trans,
Med_Age,edu_bac,walk_to_work,Percent_White,Mean_Commute_Time,Percent_Taking_Public_Trans,Percent_Female,Percent_Minor,Percent_edu,Percent_Taking_car_Trans,GEOID, geometry)
DCTracts <-
DCCensus %>%
as.data.frame() %>%
distinct(GEOID, .keep_all = TRUE) %>%
select(GEOID, geometry) %>%
st_sf
I have collected all crash data for Washington D.C. from 2011 to 2022 and filtered out all bicycle-related crash records, including those involving bicycles and pedestrians, as well as bicycles and motor vehicles. These crash records account for approximately 8% of all crashes, amounting to nearly 10,000 incidents. We will analyze this data, using them as independent variables in our regression model, to quantify the bicycle crash situation within specific areas in Washington D.C. and predict the occurrence of bicycle crashes.
crash <- st_read("C:/UPENN CLASS/capstone/data/Crashes_in_DC.geojson")%>%
st_transform('EPSG:2225')
## Reading layer `Crashes_in_DC' from data source
## `C:\UPENN CLASS\capstone\data\Crashes_in_DC.geojson' using driver `GeoJSON'
## Simple feature collection with 284571 features and 56 fields
## Geometry type: POINT
## Dimension: XYZ
## Bounding box: xmin: -78.8155 ymin: -9.000001 xmax: 77.00682 ymax: 41.25995
## z_range: zmin: 0 zmax: 0
## Geodetic CRS: WGS 84
crash <- crash[crash$FROMDATE > "2010-12-31" & crash$FROMDATE < "2023-01-01" &!is.na(crash$FROMDATE), ]
bicycle_crash <- crash[crash$TOTAL_BICYCLES > 0, ]
bicycle_crash <- bicycle_crash[bicycle_crash$FROMDATE > "2010-12-31" & bicycle_crash$FROMDATE < "2023-01-01"& !is.na(bicycle_crash$FROMDATE), ]
In this section, we have visualized the first chart: a bar chart of the total number of crashes per year. Our goal is to analyze the number of crashes each year to identify any potential trends or patterns. By grouping the crash data by year and summarizing the counts, we created a dataset showing the number of crashes per year from 2011 to 2022.
The following code snippet groups the crash dataset by the occurrence year (FROMDATE), calculates the number of crashes per year, and stores the results in the “crash_by_year” dataset. The column name is then updated to “year” for easier understanding.
crash_by_year <- crash %>%
group_by(year(FROMDATE)) %>%
summarise(count = n())
colnames(crash_by_year)[1] <- "year"
Next, we visualized the crash data by creating a bar chart displaying the number of crashes per year. Visualization helps us better understand the trends in the number of crashes and whether there are any significant increases or decreases in the number of crashes in certain years. We created a gradient color scheme for the bar chart, with darker colors (closer to purple) for years with more crashes and lighter colors (closer to blue) for years with fewer crashes. We can see that there are many crashes in all the recorded years, with 2019 having the highest number at 26,790.
ggplot(crash_by_year, aes(x=year, y=count, fill=count)) +
geom_bar(stat="identity") +
scale_fill_gradient(low = "#d1e0ff", high = "#561C26", guide = guide_colorbar(title = "Count", direction = "vertical", barheight = 10, barwidth = 1)) +
geom_text(aes(label=count), vjust=-0.5, color="#232310", size=3) +
labs(title = "Number of Crashes Per Year",
subtitle = "Based on all recorded crashes from 2011 to 2022",
x = "Year",
y = "Number of Crashes") +
plotTheme+
scale_x_continuous(breaks = seq(min(crash_by_year$year), max(crash_by_year$year), by = 1))
In this section, our goal is to display the number of bicycle crashes per year to identify potential trends or patterns. Similar to the previous analysis and chart for all crashes, we grouped the bicycle crash data by year and summarized the counts, creating a dataset showing the number of bicycle crashes per year from 2011 to 2022.
Next, we grouped the bicycle crash dataset by the occurrence year (FROMDATE), calculated the number of bicycle crashes per year, and stored the results in the “bicycle_crash_by_year” dataset. The column name is then updated to “year” for easier understanding.
bicycle_crash_by_year <- bicycle_crash %>%
group_by(year(FROMDATE)) %>%
summarise(count = n())
colnames(bicycle_crash_by_year)[1] <- "year"
Next, we visualized the bicycle crash data by creating a bar chart displaying the number of bicycle crashes per year. This visualization shows the number of bicycle crashes for each year from 2011 to 2022. We used different color gradients to represent the number of bicycle crashes and added labels for the number of crashes on each bar.
ggplot(bicycle_crash_by_year, aes(x = year, y = count, fill=count)) +
geom_bar(stat="identity") +
scale_fill_gradient(low = "#d1e0ff", high = "#561C26", guide = guide_colorbar(title = "Count", direction = "vertical", barheight = 10, barwidth = 1)) +
geom_text(aes(label=count), vjust=-0.5, color="#232310", size=3.4) +
labs(title = "Number of Bicycle Crashes Per Year",
subtitle = "Based on all recorded bicycle crashes from 2011 to 2022",
x = "Year",
y = "Number of Crashes",
color = "Count")+
plotTheme+
scale_x_continuous(breaks = seq(min(bicycle_crash_by_year$year), max(crash_by_year$year), by = 1))
In this section, our goal is to analyze the percentage of bicycle crashes in total crashes per year to identify the share of bicycle crashes among all crashes. We will examine the proportion of bicycle crashes in all crashes for each year. To do this, we first summarize both the total crash data and bicycle crash data by year. Next, we merge these two sets of data into a single data frame and calculate the proportion of bicycle crashes in all crashes for each year. This can provide a clear representation of the importance of bicycle crashes in traffic accidents in daily life.
crash_by_year_df <- st_drop_geometry(crash_by_year)
bicycle_crash_by_year_df <- st_drop_geometry(bicycle_crash_by_year)
combined_data <- merge(crash_by_year_df, bicycle_crash_by_year_df, by = "year")
colnames(combined_data)[2] <- "all"
colnames(combined_data)[3] <- "bicycle"
combined_data$bike_prop <- combined_data$bicycle / combined_data$all
combined_data$other <- combined_data$all - combined_data$bicycle
Next, we use the ggplot2 package to create a stacked bar chart to display the proportion of bicycle crashes in all crashes per year. In this chart, the blue portion represents bicycle crashes, and the red portion represents other types of crashes. Additionally, we added percentage labels to each bar to provide a more intuitive understanding of the proportion of bicycle crashes each year. Through this chart, we can visually comprehend the proportion of bicycle crashes in all crashes per year, laying the foundation for further analysis and modeling.
ggplot(combined_data, aes(x = year, y = bike_prop)) +
geom_col(aes(y = 1, fill = "other"), position = "stack") +
geom_col(aes(fill = "bicycle"), position = "stack") +
geom_text(label = paste0("---",round(combined_data$bike_prop*100,2), "%"),vjust = 0.5, angle=90,color="#a3c4f3", size=5,nudge_y = 0.13) +
scale_fill_manual(name = "Crash Type", values = c("bicycle" = "#a3c4f3", "other" = "#561C26")) +
labs(x = "Year", y = "Proportion of Bicycle Crashes",subtitle = "A stacked bar chart to show the size of the share",) +
ggtitle("Proportion of Bicycle Crashes by Year")+
plotTheme+
scale_x_continuous(breaks = seq(min(bicycle_crash_by_year$year), max(crash_by_year$year), by = 1))
In this section, we will analyze the proportion of bicycle crashes involving injuries in all bicycle crashes per year. To do this, we first filter out crash records from the original bicycle crash data that involve at least one injured person (including cyclists, drivers, and pedestrians). Next, we summarize the bicycle crashes involving injuries by year and calculate the proportion of bicycle crashes with injuries in all bicycle crashes for each year. Finally, we create a stacked bar chart to display the proportion of bicycle crashes with injuries in all bicycle crashes per year. In this chart, the blue portion represents bicycle crashes involving injuries, and the purple portion represents bicycle crashes without injuries. Additionally, we added percentage labels to each bar to provide a more intuitive understanding of the proportion of bicycle crashes with injuries each year.
filtered_data <- bicycle_crash %>%
filter(MAJORINJURIES_BICYCLIST > 0 |
MINORINJURIES_BICYCLIST > 0 |
UNKNOWNINJURIES_BICYCLIST > 0 |
FATAL_BICYCLIST > 0 |
MAJORINJURIES_DRIVER > 0 |
MINORINJURIES_DRIVER > 0 |
UNKNOWNINJURIES_DRIVER > 0 |
FATAL_DRIVER > 0 |
MAJORINJURIES_PEDESTRIAN > 0 |
MINORINJURIES_PEDESTRIAN > 0 |
UNKNOWNINJURIES_PEDESTRIAN > 0 |
FATAL_PEDESTRIAN > 0)
bicycle_crash_injured <- filtered_data %>%
group_by(year = as.numeric(format(as.Date(FROMDATE), "%Y"))) %>%
summarise(total_crashes = n(),
bike_crashes = sum(MAJORINJURIES_BICYCLIST > 0 |
MINORINJURIES_BICYCLIST > 0 |
UNKNOWNINJURIES_BICYCLIST > 0 |
FATAL_BICYCLIST > 0)) %>%
mutate(bike_prop = bike_crashes / total_crashes,
other_prop = 1 - bike_prop)
ggplot(bicycle_crash_injured, aes(x = year)) +
geom_col(aes(y = 1, fill = "other"), position = "stack") +
geom_col(aes(y = bike_prop, fill = "bicycle"), position = "stack") +
geom_text(aes(y = bike_prop, label = paste0(round(bike_prop * 100, 2), "% ---")), vjust = 0.5, angle = 90, color = "#561C26", size = 5, nudge_y = -0.14) +
scale_fill_manual(name = "Crash Type", values = c("bicycle" = "#a3c4f3", "other" = "#561C26")) +
labs(x = "Year", y = "Proportion", subtitle = "A stacked bar chart to show the size of the share") +
ggtitle("Proportion of Bicycle Crashes Involving Injuries by Year") +
scale_x_continuous(breaks = seq(min(bicycle_crash_by_year$year), max(bicycle_crash_by_year$year), by = 1)) +
scale_y_continuous(labels = scales::percent, limits = c(0, 1))+
plotTheme
dc_tracts_crs <- st_crs(DCTracts)
bicycle_crash_transformed <- st_transform(bicycle_crash, dc_tracts_crs)
map_plot <- ggplot() +
geom_sf(data = DCTracts, fill = "#FAF0F1", color = "#A0898F", size = 0.1, alpha = 1) +
geom_point(data = bicycle_crash, aes(x = LONGITUDE, y = LATITUDE), color = "#003049", alpha = 0.1, show.legend = FALSE) +
theme_minimal() +
labs(title = "Bicycle Crash Locations") +
mapTheme
map_plot <- map_plot +
coord_sf(xlim = c(min(bicycle_crash_transformed$LONGITUDE), max(bicycle_crash_transformed$LONGITUDE)), ylim = c(min(bicycle_crash_transformed$LATITUDE), max(bicycle_crash_transformed$LATITUDE))) + xlab("longitude") + ylab("latitude") +
mapTheme
print(map_plot)
Based on the above two bar charts and data displaying crash proportions, we can observe the following key points:
Although bicycle crashes make up a relatively small proportion of all crashes, they do pose significant safety risks. This suggests that bicycle safety should be given adequate attention in research and policy development.
From the second bar chart, we can see that the proportion of bicycle crashes involving injuries is very high, nearly 98%. This indicates that the consequences of bicycle crashes are often severe, potentially leading to injuries or even fatalities. Therefore, improving bicycle road safety is crucial for protecting the lives of cyclists and other road users.
These observations are closely related to our research. Our goal is to establish a predictive model to understand the likelihood and associated environmental factors of bicycle crashes occurring in Washington D.C. By analyzing this data, we can better comprehend the severity and urgency of bicycle crashes, providing strong support for policymakers. Furthermore, by identifying the key factors affecting the occurrence of bicycle crashes, our model can help develop targeted policies and measures to reduce the incidence of bicycle crashes, improve road safety, and minimize the harm caused by bicycle crashes to cyclists and other road users.
In this section, we imported road data for Washington D.C. and created an interactive map displaying the distribution of different road types. With this interactive map, we can intuitively understand the distribution of various road types throughout Washington D.C. This helps us better understand the impact of the road environment on bicycle crashes, providing more comprehensive information for our predictive model.。
road <- st_read("C:/UPENN CLASS/capstone/data/Roads_2015.geojson")%>%
st_transform('EPSG:2225')
## Reading layer `Roads_2015' from data source
## `C:\UPENN CLASS\capstone\data\Roads_2015.geojson' using driver `GeoJSON'
## Simple feature collection with 39536 features and 9 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -77.11669 ymin: 38.79234 xmax: -76.90947 ymax: 38.99532
## Geodetic CRS: WGS 84
animation <- ggplot() +
geom_sf(data = DCTracts %>% st_transform(crs=4326), fill = "#FAF0F1", color = "#A0898F", size = 0.1) +
geom_sf(data = road, aes(color = DESCRIPTION, linetype = "solid"), size = 0.5) +
scale_color_manual(values = c("Paved Drive" = "#001219",
"Alley" = "#005F73",
"Parking Lot" = "#0A9396",
"Intersection" = "#94D2BD",
"Paved Median Island"="#E9D8A6",
"Paved Traffic Island"="#EE9B00",
"Unpaved Traffic Island"="#CA6702",
"Hidden Median"="#BB3E03",
"Hidden Road"="#AE2012",
"Road"="#9B2226",
"Unpaved Median Island"="#9c9515")) +
guides(color = guide_legend(override.aes = list(linetype = "solid", size = 1.5, shape = NA))) +
coord_sf(datum = NA) +
labs(title = "Road Types in Washington, D.C.",
subtitle = "Road Type: {closest_state}") +
transition_states(
DESCRIPTION,
transition_length = 3,
state_length = 2
) +
enter_fade() +
exit_fade()+
mapTheme
animate(animation, nframes = 200, end_pause = 50, width = 800, height = 600, renderer = gifski_renderer(loop = TRUE))
First, we read the bike lane data and transform its coordinate system to match the coordinate system of the other datasets. Next, we create a basemap that displays the geographic area of Washington D.C., and we draw the bike lanes on the basemap. Bike lanes are represented by blue lines.
bicycle_lanes <- st_read("C:/UPENN CLASS/capstone/data/bicycle_lanes.geojson")%>%
st_transform('EPSG:2225')
## Reading layer `Bicycle_Lanes' from data source
## `C:\UPENN CLASS\capstone\data\Bicycle_Lanes.geojson' using driver `GeoJSON'
## Simple feature collection with 2210 features and 26 fields
## Geometry type: LINESTRING
## Dimension: XYZ
## Bounding box: xmin: -77.08773 ymin: 38.82359 xmax: -76.93066 ymax: 38.98276
## z_range: zmin: 0 zmax: 0
## Geodetic CRS: WGS 84
ggplot() +
geom_sf(data = DCTracts %>% st_transform(crs=4326), fill = "#FAF0F1", color = "#A0898F", size = 0.1, alpha = 1) +
coord_sf(datum = NA) +
geom_sf(data = bicycle_lanes, color = "#0A9396", size = 1) +
labs(title = "Bicycle Lanes in Washington, D.C.",
subtitle = "Blue lines represent bicycle lanes")+
mapTheme
To better showcase the distribution of bike lanes, we will create a zoomed-in sub-map to examine bike lanes in specific areas in greater detail. Finally, we will put together the original map and the zoomed-in sub-map, allowing us to view the distribution of bike lanes throughout Washington D.C. as well as detailed information for specific areas in one view.
base_map <- ggplot() +
geom_sf(data = DCTracts %>% st_transform(crs=4326), fill = "#FAF0F1", color = "#A0898F", size = 0.1, alpha = 1) +
coord_sf(datum = NA) +
mapTheme
bicycle_lanes_plot <- base_map +
geom_sf(data = bicycle_lanes, color = "#0A9396", size = 1) +
labs(title = "Bicycle Lanes in D.C.",
subtitle = "Blue lines represent bicycle lanes")+
mapTheme
zoomed_area <- st_bbox(c(xmin = -77.05, xmax = -77.00, ymin = 38.89, ymax = 38.92), crs = 4326)
bicycle_lanes_transformed <- bicycle_lanes %>% st_transform(crs = st_crs(zoomed_area))
zoomed_map <- base_map +
geom_sf(data = bicycle_lanes_transformed %>% st_crop(zoomed_area), color = "#0A9396", size = 2, alpha = 1) +
coord_sf(xlim = c(zoomed_area["xmin"], zoomed_area["xmax"]), ylim = c(zoomed_area["ymin"], zoomed_area["ymax"])) +
labs(title = "Zoomed Area",
subtitle = "Detailed view of a specific area")+
mapTheme
combined_map <- bicycle_lanes_plot + zoomed_map + plot_layout(ncol = 2)
combined_map
We read data from the traffic sign data file and converted its coordinate system to be consistent with the other datasets. Traffic signs play a crucial guiding role in the transportation system, as they remind drivers, cyclists, and pedestrians to follow traffic rules and ensure road safety. In our predictive model, traffic sign data can help us understand the traffic rule settings in a specific area, allowing us to analyze their relationship with the crash occurrence rate. In terms of transportation policies, by analyzing the connection between traffic signs and crashes, the government can adjust traffic signs when necessary to enhance road safety.
traffic_signs <- st_read("C:/UPENN CLASS/capstone/data/Other_Traffic_Signs_1999.geojson")%>%
st_transform('EPSG:2225')
## Reading layer `Other_Traffic_Signs_1999' from data source
## `C:\UPENN CLASS\capstone\data\Other_Traffic_Signs_1999.geojson'
## using driver `GeoJSON'
## Simple feature collection with 8913 features and 6 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -77.11667 ymin: 38.8049 xmax: -76.91067 ymax: 38.99441
## Geodetic CRS: WGS 84
We read data from the traffic signal data file and converted its coordinate system to be consistent with the other datasets. Traffic signals are essential tools for traffic management, as they help reduce traffic congestion and accident risks by controlling traffic flow. In our predictive model, traffic signal data can help us understand the traffic conditions and distribution of traffic flow. In terms of transportation policy-making, based on the analysis results, the government can adjust traffic signal settings and timing to improve road safety and transportation efficiency.。
traffic_signal <- st_read("C:/UPENN CLASS/capstone/data/Traffic_Signal.geojson")%>%
st_transform('EPSG:2225')
## Reading layer `Traffic_Signal' from data source
## `C:\UPENN CLASS\capstone\data\Traffic_Signal.geojson' using driver `GeoJSON'
## Simple feature collection with 1833 features and 9 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -77.11174 ymin: 38.82129 xmax: -76.91055 ymax: 38.99228
## Geodetic CRS: WGS 84