Build my first AI product taxi invoice metadata extraction - AI Consultant | Machine Learning Solutions

Background

In the current era, AI is a fancy term, being talked everywhere. But when we visit and learn from SME in Singapore, it is a different scenaria. For example, ERP system in SME have not no function to support auto-process taxi invoice. The admin still need upload receipt and mannually key in date, from, to, total payment, …., in the invoice. As a tech guy, I cannot imagine how this happens, because the technology is actually ready for the problem.

There is another case when we visit a company in Marine industry. Their staffs need mannually check hundreds of reports submitted by their clients to valid if there is risk, goverance and policy issue.

When AI landing in business, it is not simply calling LLM API and promt engineering to solve the problem. We need to understand what is the problem, what is processing flow currently in the corporate (understanding input - ouput), and how human process it (business rule or domain knowledge). Next step is that transform the busines description into data, AI/ML model, and frontend and backend pipelines. Sometimes, it also need to think how to integrate your solution into custom existing ERP system

In the post, I will share a small use case: automate taxi invoice process. Upload taxi invoice => structured data exported to Excel.

Objective

The requirement to automate taxi invoice processing is requested by my wife, an admin that needs key in hundrends of taxi invoices into their ERP system every month. The objective is to develop an app that the customer uploads a batch of taxi invoice images and export structured taxi metadata into Excel.

Architecture

Backend

The backend is designed using FastAPI. The core functions include:

Process a batch of uploaded invoice images from Frontend. In order to extract structured data, OCR and LLM agent are applied. The general processing flow is:
- OCR extract text from invoice
- Taxi invoice agent: task specific agent, OCR text as context, generate structured output
Payment process: Stripe payment link is used. Handling payment url generation, payment status check using Stripe webhook,

Frontend

Before this project, Streamlit is my favorite when building proof-of-concept (POC). But it is not suitable for product. I am not familiar with Frontend development. But I know the develop languages like JS, TS, React, Flutter. As a personal developer, I want to reduce effort for developing cross-platform app. Thus, Flutter is best choice. Just a few steep learning curve, to learn Flutter and Dart.

The good news is that I can leverage LLM coder to help me. It is normal to open claud, chatgpt, qwen, gemini, grok, after starting up computer everyday. Free version is used. Just switch among them.

The core functions in the frontend include:

Upload taxi invoice images
Trigger payment process
Trigger backend invoice proocess
View, edit and export to Excel

Payment

I ask AI to recommend payment system to support like PayNow and card. It suggest HitPay and Stripe. HitPay needs business registration to use their service. The Stripe policy is a little loose, at least at the developemt stage, it does not require.

At the beginning, I test stripe payment using real money, that incurs a little transaction fees charged by Stripe. Then I find Strip has a Test Mode, using mock cards provided to test payment logic. It is strongly recommended to use test mode in Strip.

Deploy

As a personal developer, we do not have sufficient budget to support. Fortunatelly there are a few platform provding free resources to deploy service. After a few survey, I select Vecel to deploy frontend flutter (Severless), and Railway to deploy backend API. Both support Github CI integrattion. So you just update the repos in Github, which trigger updates in your frontend and backend services.

I also add a subdomain to my Vercel frontend service, making it look better than Vercel default app url.

Reflection

OCR solution: I always use opensource model in my personal GPU 4070. The first version I test Docling, paddleOCR, LLM multi-modal to directly extract structured data (no OCR). The LLM multi-modal works worse (test in chatgpt, gemini, qwen). When processing taxi image, the extract text has high hallucination. It means OCR is a must. Using Docling and paddleOCR has a high requirement on computation resource (e.g. storage, RAM, vCpu), which is not supported by free tier in Vercel and Railway. Finally google vision API is used, which as 1000 calls per month (USD 1.5 for 1000), enough for test.
Optimize docker image: limitted resource ask me to optimize docker file to remove non-necessary package.
Iterative optimize LLM codes:
- LLM code => local test to validate logic is expected and find bugs => Fix bug by LLM code or Add/Remove/Modify code by detailed explanation of feature requirements.

FEATURED TAGS

computer program javascript nvm node.js Pipenv Python 美食 AI artifical intelligence Machine learning data science digital optimiser user profile Cooking cycling green railway feature spot 景点 e-commerce work technology F1 中秋节 dog setting sun sql photograph Alexandra canal flowers bee greenway corridors programming C++ passion fruit sentosa Marina bay sands pigeon squirrel Pandan reservoir rain otter Christmas orchard road PostgreSQL fintech sunset thean hou temple in sungai lembing 海上日出 SQL optimization pieces of memory 回忆 garden festival ta-lib backtrader chatGPT generative AI stable diffusion webui draw.io streamlit LLM speech recognition AI goverance prompt engineering fastapi stock trading artificial-intelligence Tariffs AI coding AI agent FastAPI 人工智能 Tesla AI5 AI6 FSD AI Safety AI governance LLM risk management Vertical AI Insight by LLM LLM evaluation AI safety enterprise AI security AI Governance Privacy & Data Protection Compliance Microsoft Scale AI Claude Anthropic 新加坡传统早餐咖啡 Coffee Singapore traditional coffee breakfast Quantitative Assessment Oracle OpenAI Market Analysis Dot-Com Era AI Era Rise and fall of U.S. High-Tech Companies Technology innovation Sun Microsystems Bell Lab Agentic AI McKinsey report Dot.com era AI era Speech recognition Natural language processing ChatGPT Meta Privacy Google PayPal Edge AI Enterprise AI Nvdia AI cluster COE Singapore Shadow AI AI Goverance & risk Tiny Hopping Robot Robot Materials SCIGEN RL environments Reinforcement learning Continuous learning Google play store AI strategy Model Minimalism Fine-tuning smaller models LLM inference Closed models Open models Privacy trade-off MIT Innovations Federal Reserve Rate Cut Mortgage Interest Rates Credit Card Debt Management Nvidia SOC automation Investor Sentiment Enterprise AI adoption AI Innovation AI Agents AI Infrastructure Humanoid robots AI benchmarks AI productivity Generative AI Workslop Federal Reserve AI automation Multimodal AI Google AI AI agents AI integration Market Volatility Government Shutdown Rate-cut odds AI Fine-Tuning LLMOps Frontier Models Hugging Face Multimodal Models Energy Efficiency AI coding assistants AI infrastructure Semiconductors Gold & index inclusion Multimodal Chinese open-source AI AI hardware Semiconductor supply chain Open-Source AI prompt injection LLM security AI spending AI Bubble Quantum Computing Open-source AI AI shopping Multi-agent systems AI research breakthroughs AI in finance Financial regulation Custom AI Chips Solo Founder Success Newsletter Business Models Indie Entrepreneur Growth Apple Claude AI Infrastructure AI chips robotaxi Global expansion AI security embodied AI AI tools IPO artificial intelligence venture capital multimodal AI startup funding AI chatbot AI browser space funding Alibaba quantum computing DeepSeek enterprise AI AI investing tech bubble AI investment prompt injection attacks AI red teaming agentic browsing agentic AI cybersecurity AI search AI boom AI adoption data centre model quantization AI therapy neuro-symbolic AI AI bubble tech valuations sovereign cloud Microsoft Sentinel large language models investment-grade bonds data residency