Automating a 37-Location ETL Pipeline (2 Hours to 12 Minutes a Day)

At UMass Amherst Dining, reporting across 37 locations used to eat about two hours a day. Someone pulled numbers from each location, pasted them into a master sheet, reconciled the inevitable mismatches, and only then could anyone ask an actual question of the data. By the time the report was ready, it was already stale. None of that was a modeling problem. It was an ETL problem (extract, transform, load), done by hand. So I automated it.

// The whole point in one sentence

The win was not a clever model. It was deleting two hours of manual reporting a day and making the numbers trustworthy enough to forecast on.

01 The two-hour problem

Thirty-seven locations, one master sheet, and a person stitching it together by hand every morning. The work was slow, easy to get wrong, and stale by the time it landed. Worst of all, nobody trusted the numbers enough to plan on them. The fix was not a fancier dashboard. It was making the boring part disappear.

02 The pipeline at a glance

Three stages, each with one job. The messy reality of 37 locations goes in one end; a single trustworthy reporting layer comes out the other.

Messy in, trustworthy out. Two hours of manual reporting a day became about twelve. Teal = the automated path.

03 Extract: meet the data where it lives

Thirty-seven locations means thirty-seven slightly different habits. Different column orders, different names for the same thing, the occasional missing day. The first job was not transformation, it was building extraction that tolerates that mess instead of breaking on it.

The one rule: standardize on ingest, log what does not fit, and never silently drop a row. A pipeline that hides bad data is worse than one that stops and tells you.

04 Transform: make it boring and repeatable

The transform layer normalizes everything into one shape: consistent keys, consistent units, consistent location IDs. The rule I held onto was idempotency. Running the pipeline twice on the same inputs produces exactly the same output. That is what lets you re-run it without fear when a location submits a correction at 4pm.

Validate at the edges
Row counts, value ranges, and a "does every location actually show up today?" check that fails loudly instead of producing a quietly wrong number.
Keep an audit trail
What ran, when, and what it touched, so the next person (or future me) can actually read what happened.
Write it so a teammate can follow
A pipeline only one person understands is a liability, not an asset.

05 Load: one source of truth

Everything lands in a single reporting layer, so leadership stops reconciling spreadsheets and starts asking questions. The two hours of manual reporting dropped to about twelve minutes, and most of that is now review, not assembly.

2 hours → 12 minutes a day, mostly review now

06 The real payoff: forecasting on data you trust

Once the numbers were consistent and current, they were good enough to forecast on. That opened the door to data-driven inventory forecasting across locations, anticipating demand instead of reacting to last week's stockout. You can't forecast on data you don't trust, and you can't trust data that's reassembled by hand every morning.

The lesson I keep relearning: the highest-leverage data work is often the least glamorous. Remove the toil, make the inputs trustworthy, and the interesting analysis becomes possible.

ETL Idempotency Edge validation Audit trail Inventory forecasting

07 The live tool it became

The same kitchen operation ran another job on a hand-built 39-tab spreadsheet: turning the day's order report into production sheets. I rebuilt that as a tool you can actually use, the same "kill the toil" idea pushed one step further.

// Live tool, try it

CPK Production Planner

Paste a commissary kitchen's daily order report and get the make list and per-station pick sheets across 36 locations, instantly, entirely in your browser. Nothing is sent anywhere.

Try the CPK Production Planner →

↑ back to top