Fine-tuning NER vs LLM prompt for asset ticker extraction in finance UGC

Posted on January 26, 2025


With the increasing multiple-task capability of LLM model, it makes it easy to do specific ML task by prompting LLM to focus on specified task, such as sentiment classification, name entity recognition, document analysis, OCR, et al. Before LLM, it needs to collect domain data and train particular ML model to complete these tasks. Easily and quickly developing product by calling LLM API such as openapi, azure, google gemini, does not mean the performance match requirement in practice. Only depending on prompt engineering cannot handle wrong cases and consistently improve performance. Let alone hallucination in LLM.

In the post, I study how LLM works in asset ticker (stock ticker, crypto ticker, forex code) extraction from user message in the AI based fintech product. I will compare prompt engineering based LLM (commercial API) with specific NER model, GLiNER, which demonstrates good performance in many tasks, and fine-tuned version using internal data.

We collect a few ten thousands of user messages and LLM extracted tickers, which is treated as a ground-truth data when fine-tuning GLiNER model.

How LLM perform in finance UGC?

  • Analysis on message containing tickers

Randomly select 50 samples which LLM find at least one ticker and manually check and annotate truth tickers. Compared with human label, LLM only has F1 62% with recall only about 52%.

Even in the small randomly selected set, there are a few hallucination cases found, i.e. LLM output tickers that does not exist in the user message, e.g. "provide an in-the-money options trade for one of the companies (UnitedHealth Group Inc, AbbVie Inc) with an expiration date of xxx", which LLM output tickers such as "UNH, ABBV".

  • Analysis on message not containing tickers

Randomly select 50 samples which LLM cannot find any ticker and manually check and annotate. It is found there is 14 messages which actually contain tickers, but LLM misses.

NER task-specific GLiNER

Based on our internal user messages and LLM generated tickers, randomly samples about 10K balanced samples (messages with/without tickers) to study performance. We also samples a few ten thousands of messages with tickers to fine-tune GLiNER. Based on these train data & test set, we compare F1 on asset ticker aamong

  • Pretrained model (small & large): it is found small model works better than large model in out data set
  • Fine-tuned pretrained model with full layers (small)
  • Fine-tuned pretrained model with frozen layers (small)
  • Fine-tuned pretrained model with frozen layers (small) plus valid ticker filters

Here is a F1 metric summary.

GLiNER Model F1
Pretrained small 43%
Pretrained large 18%
Fine-tuned small (full layers) 79%
Fine-tuned small with frozen layers 85%
Fine-tuned small with frozen layers + valid ticker filter 91%

Conclusion

  • Fine-tune pretrained model using domain data always better than pretrained, which is normal and common practice
  • When training data is not BIG, only train selected layers (normally these layers in the output or closest. Embedding layer as a feature extraction layer, normally we frozen) is better. It is also a common practice when fine-tuning.
  • Analysis on the result show many wrong extracted tickers, which are actually not valid based on domain knowledge, thus using domain knowledge as filter will improve precision

How GLiNER works on messages of LLM without tickers

Using above 50 samples where LLM without empty, it is found GLiNer achieves F1 59% with recall >90%. Thus GLiNER will increase recall as expected.

More importantly, Small GLiNER can deploy in CPU-only service. It can improve response latency (calling commercial LLM API, e.g. openAPI may need a few seconds), no hallucination, good data privacy, and performance.