Open Data Quality Assessment

Not all open data is equally usable. Quality varies significantly across portals, agencies, and datasets. This guide provides a framework for systematically assessing open government data quality across five dimensions, along with common issues and their fixes.

Quality Scoring Framework

DimensionWeightDescriptionKey MetricsScoring
Completeness25%Proportion of expected data values that are present. Measures missing fields, null values, and absent records.Field fill rate, record completeness ratio, required field coverage100% = all expected fields populated; 0% = critical fields missing across most records
Accuracy25%Degree to which data values correctly represent the real-world phenomena they describe.Known-error rate, cross-source validation match rate, outlier detection100% = validated against authoritative source; 0% = widespread factual errors found
Timeliness20%How current the data is relative to its stated update frequency. Data published on schedule scores higher.Days since last update vs stated frequency, publication lag (event to publication)100% = updated within stated frequency; 0% = data is years out of date
Consistency15%Uniformity of data representation across records and over time. Consistent formats, codes, and naming conventions.Schema conformance rate, code list adherence, format uniformity100% = uniform schema and coding throughout; 0% = mixed formats, inconsistent codes across records
Metadata Quality15%Completeness and accuracy of the dataset's descriptive metadata (title, description, license, update frequency, contact).DCAT required field coverage, description length, license presence, contact availability100% = full DCAT metadata with accurate descriptions; 0% = title only, no description or license

Dimension Details

Completeness

25%

Proportion of expected data values that are present. Measures missing fields, null values, and absent records.

Common issues:Null values in required fields, missing geographic identifiers, incomplete time series with gaps

Accuracy

25%

Degree to which data values correctly represent the real-world phenomena they describe.

Common issues:Stale values not updated, transcription errors, unit mismatches (thousands vs millions), geocoding errors

Timeliness

20%

How current the data is relative to its stated update frequency. Data published on schedule scores higher.

Common issues:Datasets listed as 'monthly' not updated in years, no last-modified timestamps, broken update pipelines

Consistency

15%

Uniformity of data representation across records and over time. Consistent formats, codes, and naming conventions.

Common issues:Mixed date formats (MM/DD/YYYY vs YYYY-MM-DD), inconsistent state/country codes, schema changes without versioning

Metadata Quality

15%

Completeness and accuracy of the dataset's descriptive metadata (title, description, license, update frequency, contact).

Common issues:Generic descriptions ('Data file'), missing licenses, no contact information, absent data dictionaries

Common Data Quality Issues and Fixes

IssueImpactDetectionFix
Missing values (nulls, empty strings)Breaks aggregations, produces misleading statisticsCount nulls per column; flag columns with >10% missingDocument missing data policy; use explicit null markers; provide imputation notes if values are estimated
Inconsistent date formatsParsing failures, incorrect chronological orderingRegex pattern analysis across date columnsStandardize to ISO 8601 (YYYY-MM-DD or YYYY-MM-DDTHH:MM:SSZ); document timezone handling
Duplicate recordsInflated counts, double-counting in aggregationsGroup by key fields and count; hash-based deduplicationDefine unique key constraints; implement deduplication in ETL pipeline; document duplicate resolution rules
Unstable identifiersBroken joins across datasets, lost linkages over timeTrack ID changes between dataset versionsUse persistent URIs or stable identifiers (FIPS codes, GEOIDs); provide crosswalk files when IDs change
Character encoding issuesGarbled text, broken accented characters, import failuresCheck for UTF-8 BOM, mixed encoding indicatorsStandardize to UTF-8; strip BOM for CSV files; declare encoding in metadata
Stale metadataUsers cannot assess data currency; wrong license appliedCompare metadata timestamps to actual data timestampsAutomate metadata updates in publishing pipeline; validate metadata against data on each release