~/narapa.dev/notes/wiring-gpt4-to-cobol-metadata
cd ~
$cat notes/wiring-gpt4-to-cobol-metadata.mdx

Wiring GPT-4 to COBOL metadata — what broke

At work we have a pile of legacy systems that nobody fully understands any more. There is COBOL that was written in the nineties. There is an ETL layer on top of that, which was written later by a different team. There is a third layer that pulls data out into reports. Every time an analyst asks a question like "where does this field on the dashboard actually come from?" — someone has to spend hours reading old code and old documents to find the answer.

So I built a chatbot on top of all this, using GPT-4 and LangChain and a vector store. The idea was simple: index all the source code and metadata once, and let the analyst ask questions in plain English.

The first version took me about a day. I was proud of it. It also did not work at all.

Here is what went wrong, and what fixed it.

The first version

I did what every tutorial says to do:

  1. Read every file in the repository.
  2. Split each file into chunks of about 500 tokens.
  3. Create embeddings for each chunk and save them in a vector database.
  4. When the analyst asks a question, find the top 5 most similar chunks and send them to GPT-4 with the question.

This is the standard shape of a RAG system. It is in every blog post and every demo. It took me a few hours to put together.

Then I tested it with real questions from real analysts. Almost every answer was wrong, or confident nonsense, or it said "I don't know" when the answer was clearly sitting in the code.

Problem one: chunking by token count is wrong for code

COBOL is not English. It has paragraphs, sections, and data divisions. A PROCEDURE DIVISION paragraph is one logical unit — it does one thing. If you cut it in half at token 500, the first half says "compute the premium" and the second half says "based on state, coverage, and age." When the analyst asks "how do we compute the premium?" — the retrieval might pull back the first half and miss the second half. The answer GPT-4 gives back is technically true but useless.

The fix: chunk by logical source unit, not by tokens. For COBOL, that meant splitting by paragraph. For the ETL layer, that meant splitting by job. For the metadata files, one chunk per table. Each chunk is a complete idea, not a slice of one.

This one change lifted the quality of answers more than anything else I did later.

Problem two: the model was hallucinating field names that did not exist

When the analyst asked about a field, GPT-4 would sometimes invent a very plausible-sounding COBOL paragraph that looked right but did not actually exist in the code. The analyst would then go search for it, not find it, and lose trust in the whole system.

This is the classic hallucination problem, and the fix was not subtle. I changed the prompt to say:

Only answer based on the provided passages. If the passages do not contain the answer, say so. Do not guess. Every claim must cite the file and paragraph it came from.

And I made the response format require citations: each sentence that claimed something had to end with [file: X.cbl, paragraph: Y]. If the model couldn't cite it, it wasn't allowed to say it.

This cut hallucinations to near zero. Answers became shorter and more boring, which turned out to be exactly what the analysts wanted. They were not here to be impressed. They were here to find facts.

Problem three: retrieval was missing the obvious answer

Some questions the system got wrong were almost embarrassing. An analyst would ask "what table stores the policy holder's address?" — and the retrieval would pull back five chunks about address formatting, or about mailing labels, none of which was the answer. The actual answer was one line in a metadata file that said TABLE_NAME = POLICY_ADDRESS and that line was sitting in the vector store. It just wasn't being retrieved, because "policy holder address" embedded closer to "address formatting" than to POLICY_ADDRESS.

The fix was to not rely on vector search alone. I added a second pass of plain keyword search on top — if the analyst's question contained a word that also appeared in a chunk, that chunk got a boost. This is called hybrid retrieval, and for this kind of structured source data it is almost always better than pure vector search.

After this, "obvious" questions started getting obvious answers.

Problem four: the eval was "I tried five questions and they all worked"

This was not really a technical problem. It was a process problem. I had no way to know, after every change, whether things were getting better or worse. I would make a change, try three questions, and declare victory or defeat based on those three.

So I built a small eval harness. I collected 40 real questions that analysts had asked, along with the correct answer for each. After every change I would run all 40 through the system and measure two numbers:

  • How often did the system retrieve the correct source passage? (Retrieval accuracy.)
  • How often was the final answer factually correct, with a valid citation? (End-to-end accuracy.)

Now I had numbers. Changes that looked good sometimes made things worse, and I could see it. Changes that felt like polish sometimes turned out to be the single biggest quality gain.

This felt like overkill for a small internal tool. It was not. The eval harness is the thing that made me trust my own work.

What I would do differently

If I started over tomorrow:

  1. Start with the eval harness. Before any chunking, any embeddings, any prompts. A small set of real questions with known answers is the floor you build the rest on.
  2. Chunk by logical units from day one. Do not do the naive "split every 500 tokens" thing just because the tutorial does.
  3. Assume you will need hybrid retrieval. Do not waste a week tuning cosine similarity when BM25 is most of the answer.
  4. Make citations a hard requirement in the prompt. Answers without citations should not be allowed to exist.
  5. Keep the final prompt short. Every word of instruction you add is another word the model can misread.

What I did not do

There are a lot of things I did not do that a bigger project would need. No agentic loops. No tool use. No fine-tuning. No streaming. No cost optimization beyond a basic per-user token budget. This was a single-purpose tool for a single team, and it only needs to do one thing well.

Most of the "advanced" RAG techniques I have read about would make this worse, not better. Simpler almost always wins.

If you are building something similar and want to compare notes — email me.