Open Data Quality Assessment

Not all open data is equally usable. Quality varies significantly across portals, agencies, and datasets. This guide provides a framework for systematically assessing open government data quality across five dimensions, along with common issues and their fixes.

Quality Scoring Framework

Dimension	Weight	Description	Key Metrics	Scoring
Completeness	25%	Proportion of expected data values that are present. Measures missing fields, null values, and absent records.	Field fill rate, record completeness ratio, required field coverage	100% = all expected fields populated; 0% = critical fields missing across most records
Accuracy	25%	Degree to which data values correctly represent the real-world phenomena they describe.	Known-error rate, cross-source validation match rate, outlier detection	100% = validated against authoritative source; 0% = widespread factual errors found
Timeliness	20%	How current the data is relative to its stated update frequency. Data published on schedule scores higher.	Days since last update vs stated frequency, publication lag (event to publication)	100% = updated within stated frequency; 0% = data is years out of date
Consistency	15%	Uniformity of data representation across records and over time. Consistent formats, codes, and naming conventions.	Schema conformance rate, code list adherence, format uniformity	100% = uniform schema and coding throughout; 0% = mixed formats, inconsistent codes across records
Metadata Quality	15%	Completeness and accuracy of the dataset's descriptive metadata (title, description, license, update frequency, contact).	DCAT required field coverage, description length, license presence, contact availability	100% = full DCAT metadata with accurate descriptions; 0% = title only, no description or license

Dimension Details

Completeness

25%

Proportion of expected data values that are present. Measures missing fields, null values, and absent records.

Common issues:Null values in required fields, missing geographic identifiers, incomplete time series with gaps

Accuracy

25%

Degree to which data values correctly represent the real-world phenomena they describe.

Common issues:Stale values not updated, transcription errors, unit mismatches (thousands vs millions), geocoding errors

Timeliness

20%

How current the data is relative to its stated update frequency. Data published on schedule scores higher.

Common issues:Datasets listed as 'monthly' not updated in years, no last-modified timestamps, broken update pipelines

Consistency

15%

Uniformity of data representation across records and over time. Consistent formats, codes, and naming conventions.

Common issues:Mixed date formats (MM/DD/YYYY vs YYYY-MM-DD), inconsistent state/country codes, schema changes without versioning

Metadata Quality

15%

Completeness and accuracy of the dataset's descriptive metadata (title, description, license, update frequency, contact).

Common issues:Generic descriptions ('Data file'), missing licenses, no contact information, absent data dictionaries

Common Data Quality Issues and Fixes

Issue	Impact	Detection	Fix
Missing values (nulls, empty strings)	Breaks aggregations, produces misleading statistics	Count nulls per column; flag columns with >10% missing	Document missing data policy; use explicit null markers; provide imputation notes if values are estimated
Inconsistent date formats	Parsing failures, incorrect chronological ordering	Regex pattern analysis across date columns	Standardize to ISO 8601 (YYYY-MM-DD or YYYY-MM-DDTHH:MM:SSZ); document timezone handling
Duplicate records	Inflated counts, double-counting in aggregations	Group by key fields and count; hash-based deduplication	Define unique key constraints; implement deduplication in ETL pipeline; document duplicate resolution rules
Unstable identifiers	Broken joins across datasets, lost linkages over time	Track ID changes between dataset versions	Use persistent URIs or stable identifiers (FIPS codes, GEOIDs); provide crosswalk files when IDs change
Character encoding issues	Garbled text, broken accented characters, import failures	Check for UTF-8 BOM, mixed encoding indicators	Standardize to UTF-8; strip BOM for CSV files; declare encoding in metadata
Stale metadata	Users cannot assess data currency; wrong license applied	Compare metadata timestamps to actual data timestamps	Automate metadata updates in publishing pipeline; validate metadata against data on each release