Data Loader PR Quality Patterns

lesson patterns active patterns/data-loader-pr-quality.md View on GitHub

Data Loader PR Quality Patterns

Rule

When writing data loader PRs for any data pipeline, apply these six patterns to produce production-quality code that passes automated review on the first cycle.

Context

When submitting a PR that adds a new data loader: reading CSV exports, parsing SQLite databases, querying APIs, or any function that ingests external data into a DataFrame.

Detection

Observable signals that you need to apply these patterns:

Pattern

1. Error handling: FileNotFoundError over assert

# Wrong: assert is stripped with python -O
assert path.exists(), f"File not found: {path}"

# Correct: explicit, always runs
if not path.exists():
    raise FileNotFoundError(f"File not found: {path}")

Why: assert statements are stripped when Python runs in optimized mode (-O flag). Use explicit if not ...: raise for guards that must always be enforced.

2. No no-op string operations

# Wrong: identity replacements have no effect
df.columns = (
    df.columns.str.lower()
    .str.replace(" ", "_")
    .str.replace("energy_kcal", "energy_kcal")  # no-op
    .str.replace("protein_g", "protein_g")      # no-op
)

# Correct: keep only transformations that actually change something
df.columns = df.columns.str.lower().str.replace(" ", "_").str.replace("(", "").str.replace(")", "")

Before committing: verify each replace/transform actually changes at least one value in the real data.

3. Test every loader function

If your module has load_nutrition_df AND load_servings_df, both need tests. The primary function gets tested; the secondary one is often forgotten.

Checklist for each loader function:

4. Timezone assertions: be strict

# Weak: passes if tz is set to anything
assert df.index.tz is not None

# Strong: verifies the actual timezone
assert str(df.index.tz) == "UTC"

Use str(df.index.tz) == "UTC" (works across pandas versions and pytz/zoneinfo).

5. Enum-like parameters: explicit allowlist + ValueError

# Wrong: silently falls through to one branch for unknown inputs
if unit == "mg/dL":
    factor = 18.0
else:  # treats anything unknown as mmol/L — silent failure
    factor = 1.0

# Correct: raise on unsupported values
VALID_UNITS = {"mmol/L", "mg/dL"}
if unit not in VALID_UNITS:
    raise ValueError(f"Unsupported unit: {unit!r}. Expected one of {sorted(VALID_UNITS)}")

6. Hub vs. per-device architecture

Before building a per-device data loader, ask: does the user route this data through a hub?

Data source Common hub
Home sensors (temperature, CO2) Home Assistant
Wearable fitness data Apple Health, Google Fit, Garmin Connect
Smart home devices Home Assistant, SmartThings
Financial accounts Plaid, bank export

A single hub loader covers all current and future devices automatically. Per-device CSV parsers are only appropriate when there is no hub integration.

Anti-pattern: Building a per-device parser when a hub loader already exists or could cover the same data. One hub loader > N per-device parsers.

Review Checklist (before opening PR)

Outcome

Following these patterns results in:

Related

Origin

2026-03-06: Extracted from Sourcery AI review feedback across three data loader PRs (nutrition, glucose CGM, environmental sensors). All six code patterns appeared in at least two of the three PRs, indicating they are systematic gaps rather than one-offs. The hub vs. per-device pattern emerged from a closed PR where a per-device parser was superseded by an existing hub integration.

Match Keywords

data loader csv loader file not found assert path missing file test loader test timezone assertion unit validation