Open Data Quality Assessment
Not all open data is equally usable. Quality varies significantly across portals, agencies, and datasets. This guide provides a framework for systematically assessing open government data quality across five dimensions, along with common issues and their fixes.
Quality Scoring Framework
| Dimension | Weight | Description | Key Metrics | Scoring |
|---|---|---|---|---|
| Completeness | 25% | Proportion of expected data values that are present. Measures missing fields, null values, and absent records. | Field fill rate, record completeness ratio, required field coverage | 100% = all expected fields populated; 0% = critical fields missing across most records |
| Accuracy | 25% | Degree to which data values correctly represent the real-world phenomena they describe. | Known-error rate, cross-source validation match rate, outlier detection | 100% = validated against authoritative source; 0% = widespread factual errors found |
| Timeliness | 20% | How current the data is relative to its stated update frequency. Data published on schedule scores higher. | Days since last update vs stated frequency, publication lag (event to publication) | 100% = updated within stated frequency; 0% = data is years out of date |
| Consistency | 15% | Uniformity of data representation across records and over time. Consistent formats, codes, and naming conventions. | Schema conformance rate, code list adherence, format uniformity | 100% = uniform schema and coding throughout; 0% = mixed formats, inconsistent codes across records |
| Metadata Quality | 15% | Completeness and accuracy of the dataset's descriptive metadata (title, description, license, update frequency, contact). | DCAT required field coverage, description length, license presence, contact availability | 100% = full DCAT metadata with accurate descriptions; 0% = title only, no description or license |
Dimension Details
Completeness
25%Proportion of expected data values that are present. Measures missing fields, null values, and absent records.
Accuracy
25%Degree to which data values correctly represent the real-world phenomena they describe.
Timeliness
20%How current the data is relative to its stated update frequency. Data published on schedule scores higher.
Consistency
15%Uniformity of data representation across records and over time. Consistent formats, codes, and naming conventions.
Metadata Quality
15%Completeness and accuracy of the dataset's descriptive metadata (title, description, license, update frequency, contact).
Common Data Quality Issues and Fixes
| Issue | Impact | Detection | Fix |
|---|---|---|---|
| Missing values (nulls, empty strings) | Breaks aggregations, produces misleading statistics | Count nulls per column; flag columns with >10% missing | Document missing data policy; use explicit null markers; provide imputation notes if values are estimated |
| Inconsistent date formats | Parsing failures, incorrect chronological ordering | Regex pattern analysis across date columns | Standardize to ISO 8601 (YYYY-MM-DD or YYYY-MM-DDTHH:MM:SSZ); document timezone handling |
| Duplicate records | Inflated counts, double-counting in aggregations | Group by key fields and count; hash-based deduplication | Define unique key constraints; implement deduplication in ETL pipeline; document duplicate resolution rules |
| Unstable identifiers | Broken joins across datasets, lost linkages over time | Track ID changes between dataset versions | Use persistent URIs or stable identifiers (FIPS codes, GEOIDs); provide crosswalk files when IDs change |
| Character encoding issues | Garbled text, broken accented characters, import failures | Check for UTF-8 BOM, mixed encoding indicators | Standardize to UTF-8; strip BOM for CSV files; declare encoding in metadata |
| Stale metadata | Users cannot assess data currency; wrong license applied | Compare metadata timestamps to actual data timestamps | Automate metadata updates in publishing pipeline; validate metadata against data on each release |