INTERVIEW

Regular readers know that I wrote about white collar crime for a number of years for Bloomberg News in New York. And I learned that one of the most challenging types of cases to prove was accounting fraud — in part because there are so many judgement calls required.

Tax fraud cases are (generally) more straightforward: ‘You reported earning X dollars last year, but you actually earned Y dollars.’

Prior to ChatGPT, accounting judgement calls had to be done by human analysts. It’s hard to create deterministic rules for that task.

That’s what makes a recent academic paper about using GPT-4o, the current state of the art model provided by ChatGPT, to measure “core earnings” so interesting. The research, written by Matthew Shaffer at the University of Southern California and Charles C.Y. Wang at Harvard Business School, showed that ChatGPT could be used to estimate a company’s recurring profitability better than standard measures.

I spoke with Professor Shaffer for this week’s Five Minutes With Interview.

The paper, Scaling Core Earnings Measurement with Large Language Models, shows how AI could bridge the gap between quantitative metrics like price-to-earnings ratios, and qualitative factors which have historically been hard to analyze at scale.

This interview has been edited for clarity and length.

What did you seek to study in your paper?

Our paper examined whether large language models could effectively quantify companies’ “core earnings”—the recurring, core profitability from their main business activities—at scale. We chose this question because we thought that this is the kind of task that is most amenable to the distinctive new capabilities opened up by large language models—as opposed to the prior “rule-based” symbolic artificial intelligences. This is a task that requires reasoning over a lot of unstructured text, applying common sense judgments contextualized in general background knowledge, about accounting and industries, etc.

With that research question in mind, we approached in a quite straightforward way, developing two strategies and judging the models’ outputs.

Then we evaluated the quality of the measures we got from these approaches. Now, by its nature, “core earnings” is not directly reported anywhere, there’s no one external benchmark for the “correct” number. So our approach to testing the quality of the measures is to say, “if this were indeed a good core earnings measure, what properties would it have empirically?” For example, a good core earnings measure should be informative for predicting future Net Income. We have many such tests, and we benchmark the LLM-based measures to widely-used proxies from Compustat, a standardized financial data provider.

Overall, our LLM-based measure from our gold standard approach outperforms the Compustat alternatives in most, though not all, standard tests. For example, Compustat’s OPEPS does slightly better in regressions predicting future Net Income – but, interestingly, our LLM-based measure does better predicting average Net Income over the next two years.

But, I should emphasize that our goal wasn't really about earnings prediction per se – these are just standard ways of empirically testing whether it’s a good core earnings measure. That’s what we’re most interested in - this concept of core earnings, that matters to investors but typically requires qualitative, firm-specific analysis. We wanted to see if LLMs could be used to quantify this at scale.

Can you break down what you mean by core earnings?

Sure. Let’s start from fundamentals. Public companies report their financial statements using GAAP – (generally accepted accounting principles) – as specified by the FASB. But, investors don’t really take the bottom-line, GAAP-defined measure of Net Income for granted - that isn’t the number they have in mind when pricing stocks. There’s a bunch of principles in GAAP that investors don’t like or items affecting Net Income that they don’t value as much—one-time items, revaluations of accounting allowances, etc. Investors care about the “core” profitability, the owners’ earnings accruing from the ongoing, central business activities. And that makes perfect sense: as an equity investor, you’re buying the future profits, so when you evaluate the current earnings, from a valuation / investment perspective, you care about the recurring component.

Could you go into more detail on the different prompting strategies you used?

First, we developed what I'd call a "lazy" approach - just giving the model a definition and a full 10-K. The goal was to benchmark the model’s performance using only its ‘native’ reasoning capacity, rather than expert guidance, and get a sense of what analyst should expect if they use these models in a low-effort and credulous way. The outputs from this approach had a number of issues and flaws – which I found interesting and revelatory about the nature of autoregressive language models in a setting like this – as we detail in the paper.

But we also wanted to scope out the potential utility of LLMs in this task. We developed a prompting approach where we broke down the analysis into three sequential steps:

Have the model identify unusual losses or expenses in the 10-K
Look for unusual gains or income
Finally, tabulate these findings to produce an adjusted core earnings measure

So, it’s fairly rote in that sense. This helped the model “stay on task.” Also, notably, we don’t tell the model that we’re trying to measure core earnings, nor do we specify how we plan to validate the results. We just tell it how to proceed. Finally, I’d note that our ‘sequential’ prompt approach is the ‘gold standard’ in the context of our paper, but, in reality, it’s not particularly sophisticated or complex and yet we still got meaningful results. We chose a strategy that we thought would be good enough to be useful, while simple enough for us to manage at scale. A dedicated analyst or quantitative firm could likely do much more with these tools.

For example, just imagine you were a sell-side analyst preparing a report on a company and using ChatGPT to help you analyze its 10-K: you probably wouldn’t stop after just three chats and accept the outputs uncritically Or, imagine you had the budget of a quant investment firm, and a smaller set of companies/industries to cover: You could fine-tune, use multiple models and prompts, break down each 10-K even more, etc. So in that sense our “gold standard” may represent a fairly low benchmark for the potential utility in use.

What did you find about short-term versus long-term performance?

One of the more intriguing findings was how our LLM-based measure compared to Compustat's OPEPS (operating earnings per share) in predicting future earnings over different horizons. In the short term—one year out—Compustat’s measure actually performed slightly better. But as we extended the horizon to the average profitability over the next two years, the performance of Compustat’s measure decreased, while ours improved. Both parts of that are interesting: intuitively, you might think that by ‘smoothing’ out shocks over two years, the average Net Income should be more predictable, even for the Compustat measure.

I suspect this might have to do with some of the inconsistencies in Compustat’s measures. It could be that our LLM-based analyses may miss smaller items or have hallucinations from time-to-time, meaning it might not predict Net Income as well over short horizons. But, speculating, its conceptual consistency, and perhaps even some of the ‘common sense judgments’ from the language model, might make it a better anchor for the long-term.

What trends are you seeing in AI models?

One trend is towards these new reasoning-specific models, such as OpenAI’s “Orion-class models” including o1-preview and o1-mini, and new offerings from DeepSeek, etc. The other trend that's a bit interesting is this trend towards model distillation, where they're pruning large models to make them smaller and faster, saving computing cost, and claiming they perform just as well or better on the “benchmarks.” Something I'm thinking about is the trade-offs, whether we’re losing something that’s harder to quantify—missing from the benchmarks used by the developer community—in their performance. Some of that human qualitative factor might get a bit weaker in a way that's hard to measure.

For example, anecdotally, in my own work, I find o1-preview and o1-mini to be great for programming and mathematical modeling, but on more qualitative tasks, I sometimes feel they’re missing something that I valued in the original GPT-4 (remember “Sydney”?). What exactly is that something? Well, the whole problem is that it’s hard to pin down. But in future research papers, I’ll be keen to evaluate whether the “reasoning” models that will be the next trend do, in fact, outperform on the more open-ended tasks that folks in finance (rather than software engineering) might use them for.

How do you see this affecting strategy analysis?

Great question. As I mentioned, I think the big opportunity from LLMs is that now we can bring computational speed, breadth, and neutrality to things that previously were the domain of “human-like judgement.” Core earnings analysis is one such task; strategy analysis is another. In some ways, these tasks are similar, but in others, they are different. So it’s an interesting and open question. Strategy analysis is inherently very open-ended, relying heavily on qualitative reasoning. AI could excel here by incorporating vast amounts of background knowledge and reasoning intelligently. But it’s also possible that the task is too unstructured for them.

The task in our paper—core earnings analysis—involves judgment and lots of text, yes, but it is at least somewhat more anchored: “Start with the reported Net Income, and use the 10-K annual report to ground your adjustments.” Strategy analysis may be a different beast. So it’s an open and important question whether valid strategy analysis is beyond the capability of LLMs. There’s a research agenda in bounding the capabilities of these models in different types of financial applications—identifying tasks where they excel versus those that remain better suited for human judgment.

What's the opportunity you see for bridging quantitative and qualitative analysis?

Broadly, the big opportunity lies in using this new class of AI, to bridge the gap between simple quantitative metrics—like price-to-earnings ratios—and qualitative factors that we know matter but haven’t been able to incorporate systematically and at scale. In both empirical research and trading strategies. LLMs can do the context-specific qualitative analysis that used to be the purview of the fundamental analyst covering a small set of firms.

For example, a simple quantitative trading strategy might sort stocks by their P/E ratio, the notion being that high- vs. low-P/E ratios indicate stocks that are under- vs. over-valued on average. Of course, some high-P/E ratio stocks are overvalued, but others “deserve” the multiple because of their high growth prospects from their strategic advantages. With LLMs, in principle, you can now scale trading strategies that condition the quantitative metric on a “human-like” qualitative evaluation of the business model. That seems like the low-hanging fruit—or rather, fruit that was previously out of reach, but which LLMs provide a ladder towards.

Previously, we had to choose between the “depth” of firm-specific analysis, the analyst reading the full 10-K, vs. the “breadth” of quantitative approaches. LLMs will change that. That’s the most fundamental shift.

Thanks for reading!

Drop me a line if you have story ideas, research, or upcoming conferences to share. [email protected]

USC's Matt Shaffer on AI Earnings Analysis

INTERVIEW

IN CASE YOU MISSED IT
Recent Five Minutes with Interviews

Thanks for reading!

Reply

More From Capital

AI Street Media

USC's Matt Shaffer on AI Earnings Analysis

INTERVIEW

IN CASE YOU MISSED IT Recent Five Minutes with Interviews

Thanks for reading!

Reply

More From Capital

AI Street Media

IN CASE YOU MISSED IT
Recent Five Minutes with Interviews