the data is the moat

every enterprise ai project that fails has the same root cause: the folder structure.

12-Mar-26

Most enterprise AI projects fail before the model runs. The folder structure already decided the outcome.

Companies want chatbots, search tools, report writers, RAG systems. The pitch deck talks about models. The real work is the data that goes in.

what the mess actually looks like

Open a typical company drive. PDFs of scans of scans. Slide decks where the table is a screenshot of an Excel cell. Three versions of the same report, two of them outdated. Folders named “FINAL_v4_USE_THIS_ONE.” Sensitive information mixed in with public material. Two languages in the same document. Metadata fields left blank.

This is the input. The model sits downstream.

When you push this corpus into a RAG pipeline, the symptoms are familiar. The chatbot answers confidently from the wrong source. The summary repeats a number that was corrected six months ago. Search returns five copies of the same memo. The customer-facing tool surfaces a file marked confidential.

The model is doing exactly what it should. The data is the problem.

the basics, which sound boring

For AI to work in a business, someone has to do the unglamorous part. Classify documents. Add metadata. Remove duplicates. Flag outdated content. Wall off sensitive material. Build a retrieval layer that returns the right document. Insert human review where judgment matters.

None of this is exciting. None of it ships in a demo. All of it decides whether the system works in production.

why technical industries pay more for this

In oil and gas, mining, water treatment, finance, and health care, documents carry detail that costs money to miss. A skipped table in a frac design report. A scanned lab analysis the OCR mangled. A duplicate file that overrides a corrected one. The wrong answer here is expensive, sometimes dangerous.

These industries also tend to have the worst data. Decades of reports in formats that changed every five years. Vendor names that merged and split. Field shorthand the office never standardized.

That is where the gap between “we use AI” and “AI works for us” is widest.

the part nobody puts on the slide

A strong data pipeline is the actual moat. The model is a commodity. The clean, labeled, deduplicated, searchable version of your company’s own knowledge is the asset.

Better data gives better retrieval. Better retrieval gives better answers. The companies winning this round are the ones who did the data work first.

The engine I built on this idea: document intelligence.