Road safety research in the UK
relies heavily on the “STATS19” dataset—the official record of every
reported road traffic collision. As the Department for Transport (DfT)
has modernized its data delivery, the stats19 R package has
evolved alongside it.
Today, we are excited to announce stats19 v4.0.0, a major milestone that refactors the package from the ground up to be faster, cleaner, and more robust for longitudinal research.
In the past, users often had to deal with shifting schemas. For
example, columns like carriageway_hazards might appear as
carriageway_hazards_historic in older files. In v4.0.0,
we’ve adopted a Unified Longitudinal Schema. The
package now automatically detects these “historic” variants, merges them
into their modern counterparts, and drops the redundant columns.
While this “breaks” scripts that explicitly looked for those
*_historic names, it significantly simplifies research: you
can now analyze 45 years of data (1979–2024) using a single, consistent
set of column names.
If you’ve used previous versions, you might have been greeted by a
wall of red warnings about unmatched column parsers. No more! -
Intelligent Parsing: read_stats19() now
scans the actual CSV header first and builds a custom parser on the fly.
- Fixed Coordinates: We caught and fixed a critical bug
where 2024 Latitude/Longitude data was being truncated to integers.
v4.0.0 restores full floating-point precision.
Real-world data is messy. DfT files use a mix of -1,
“Code deprecated”, and “Data missing or out of range”. We now
aggressively standardize these to NA globally during the
formatting phase, so your is.na() calls actually work as
expected across all variables.
By defaulting to the readr Edition 2 engine, the package
now utilizes multi-threaded parsing. Large files that used to take
minutes now load in seconds, making the exploration of the full
1979–latest dataset much more practical.
Beyond the refactor, we’ve added powerful new functions: -
match_tag(): Directly join government TAG
(Transport Analysis Guidance) cost estimates to your collision data.
This allows you to estimate the economic impact of collisions based on
severity and road type. - Vehicle Cleaning: With
clean_make() and clean_model(), you can
standardize the 2,400+ unique raw strings in the vehicle dataset, making
it easier to study trends in vehicle safety and composition.