Build my first AI product taxi invoice metadata extraction

Posted on August 15, 2025

Background

In the current era, AI is a fancy term, being talked everywhere. But when we visit and learn from SME in Singapore, it is a different scenaria. For example, ERP system in SME have not no function to support auto-process taxi invoice. The admin still need upload receipt and mannually key in date, from, to, total payment, …., in the invoice. As a tech guy, I cannot imagine how this happens, because the technology is actually ready for the problem.

There is another case when we visit a company in Marine industry. Their staffs need mannually check hundreds of reports submitted by their clients to valid if there is risk, goverance and policy issue.

When AI landing in business, it is not simply calling LLM API and promt engineering to solve the problem. We need to understand what is the problem, what is processing flow currently in the corporate (understanding input - ouput), and how human process it (business rule or domain knowledge). Next step is that transform the busines description into data, AI/ML model, and frontend and backend pipelines. Sometimes, it also need to think how to integrate your solution into custom existing ERP system

In the post, I will share a small use case: automate taxi invoice process. Upload taxi invoice => structured data exported to Excel.

Objective

The requirement to automate taxi invoice processing is requested by my wife, an admin that needs key in hundrends of taxi invoices into their ERP system every month. The objective is to develop an app that the customer uploads a batch of taxi invoice images and export structured taxi metadata into Excel.

Architecture

Backend

The backend is designed using FastAPI. The core functions include:

  • Process a batch of uploaded invoice images from Frontend. In order to extract structured data, OCR and LLM agent are applied. The general processing flow is:
    • OCR extract text from invoice
    • Taxi invoice agent: task specific agent, OCR text as context, generate structured output
  • Payment process: Stripe payment link is used. Handling payment url generation, payment status check using Stripe webhook,

Frontend

Before this project, Streamlit is my favorite when building proof-of-concept (POC). But it is not suitable for product. I am not familiar with Frontend development. But I know the develop languages like JS, TS, React, Flutter. As a personal developer, I want to reduce effort for developing cross-platform app. Thus, Flutter is best choice. Just a few steep learning curve, to learn Flutter and Dart.

The good news is that I can leverage LLM coder to help me. It is normal to open claud, chatgpt, qwen, gemini, grok, after starting up computer everyday. Free version is used. Just switch among them.

The core functions in the frontend include:

  • Upload taxi invoice images
  • Trigger payment process
  • Trigger backend invoice proocess
  • View, edit and export to Excel

Payment

I ask AI to recommend payment system to support like PayNow and card. It suggest HitPay and Stripe. HitPay needs business registration to use their service. The Stripe policy is a little loose, at least at the developemt stage, it does not require.

At the beginning, I test stripe payment using real money, that incurs a little transaction fees charged by Stripe. Then I find Strip has a Test Mode, using mock cards provided to test payment logic. It is strongly recommended to use test mode in Strip.

Deploy

As a personal developer, we do not have sufficient budget to support. Fortunatelly there are a few platform provding free resources to deploy service. After a few survey, I select Vecel to deploy frontend flutter (Severless), and Railway to deploy backend API. Both support Github CI integrattion. So you just update the repos in Github, which trigger updates in your frontend and backend services.

I also add a subdomain to my Vercel frontend service, making it look better than Vercel default app url.

Reflection

  • OCR solution: I always use opensource model in my personal GPU 4070. The first version I test Docling, paddleOCR, LLM multi-modal to directly extract structured data (no OCR). The LLM multi-modal works worse (test in chatgpt, gemini, qwen). When processing taxi image, the extract text has high hallucination. It means OCR is a must. Using Docling and paddleOCR has a high requirement on computation resource (e.g. storage, RAM, vCpu), which is not supported by free tier in Vercel and Railway. Finally google vision API is used, which as 1000 calls per month (USD 1.5 for 1000), enough for test.

  • Optimize docker image: limitted resource ask me to optimize docker file to remove non-necessary package.

  • Iterative optimize LLM codes:

    • LLM code => local test to validate logic is expected and find bugs => Fix bug by LLM code or Add/Remove/Modify code by detailed explanation of feature requirements.