March | 2024 | AI, Tech & Life

Monthly Archives: March 2024

Singapore AI government resources

Posted on March 26, 2024 by sheng gao Leave a comment

The post is credited to LinkedIn https://www.linkedin.com/feed/update/urn:li:activity:7178299726792978432/, shared here as a record.

1. AI Verify Framework
🔗 https://lnkd.in/dzKG8ycs

2. National AI Strategy
🔗 https://lnkd.in/df7Gcabc

3. Model Governance Framework for Generative AI
🔗 https://lnkd.in/dmWwEKW5

4. Generative AI Sandbox
🔗 https://lnkd.in/djxFTAp9

5. Advisory Guidelines on Use of Personal Data in AI Recommendation and Decision Systems
🔗 https://lnkd.in/duU-S-iN

6. Principles to Promote Fairness, Ethics, Accountability and Transparency (FEAT) in the Use of AI and Data Analytics in Singapore’s Financial Services Sector
🔗 https://lnkd.in/dQj_4DG8

7. Veritas Toolkit 2.0
🔗 https://lnkd.in/dRPK-8ss

8. AI in Healthcare Guidelines
🔗 https://lnkd.in/dsa84NpG

How to draw animation graph using Graph.io

Posted on March 22, 2024 by sheng gao Leave a comment

I often read the posts shared in LinkedIn, which contain the amazing animation graph. So I am thinking how do I also draw it? Previously, I shared a post, Draw.io, a beautiful tool to draw graph and flow . After googling, it is found Draw.io have the functio, which let us set the propertu of arrow to enable animation. Before introduce how to do, I firstly share how to setup Draw.io, focusing in Ubuntu (Windows should be also easy).

4 ways to use Draw.io

Draw.io web version
- It is simple. There is nothing to setup. Just visit the URL .
Draw.io App
- If you like install an invidual App, you can visit the URL to download the Windows/Mac/Linux version, and install.
Draw.io plugin in Visual studio code
- If you are frequently use VSC to programming and don’t want to switch between VSC and another Draw.io tool, you can install the plugin, Draw.io Integration v1.6.6. Just search draw.io. There are many plugins, I select the Draw.io extension developed by Henning Dieterich.
- Issue:
  - The functions are similar to the above two. But after completing animation setting, the export svg is still a still image, without animation.
  - Thus, for animation, the web version or App are recommend.
Jupyterlab-drawio
- There is plugin in Jupyter-lab. But I find it is not full functions available. For example, I cannot find how load and edit a drawio file. So I do not suggest

How to enable animation of edge?

There is a few difference between the Draw.io App and web version, due to the version of development.

In Draw.io App

Click the edge you want to enable animation
Then in the right menu, find Flow animation under Style. Select it, and you will see the edge is flowing now.
Note: drawio file extension is suggest SVG

In Draw.io web version

Click the edge you want to enable animation
Then in the right menu, click and expand Properties under Style. Scroll down to find Flow animation and select it. You will see the edge is flowing now.
Note: drawio file extension is suggest SVG

Example animation graph:

Animation graph

visualization

How to read annual report

Posted on March 21, 2024 by sheng gao Leave a comment

The content in the followinbg is from a post in Linkedin author , Pieter Slegers. It is interesting to share.

The 6-step framework from Aswath Damodaran

You can steal it here:

Have clear goals

Always know why you are reading an annual report. What’s your purpose?

Here are 5 essential things:

Cash Flows
Investments in Future Growth
Operational Efficiency
Quality of Earnings
Risk

1️⃣ Cash Flows

How much revenue is translated into cash flow?
How much capital does the company need to generate these cash flows?

Free Cash Flow per share growth is one of the main drivers for stock prices.

2️⃣ Investments in Future Growth

How much growth is generated by increased productivity?
How much growth CAPEX does the company use?

3️⃣ Operational Efficiency

How efficiently does the company allocate capital?
Does management put a lot of emphasis on operational efficiency?

The better the capital allocation skills of management, the better for you as an investor.

4️⃣ Quality of Earnings

Not all earnings are created equal

Look for:

Amount reinvested
Return on invested capital (ROIC)

5️⃣ Risk

There are two primary concerns regarding risk:

Operational risk involves issues related to the core business activities of the company.
Financing risk pertains to challenges associated with the company’s funding methods.

Now you know what you’re looking for and why, you can start using Damodaran’s 6-step approach.

Step 1:

Confirm the timing and currency

What period is covered?
What currency are they reporting in?

Step 2:

Map the business mix:

In which segment does the company operates?
What does the geographic breakdown look like?

Step 3:

Find the base inputs for valuation

From the Balance Sheet:

How much debt the company have?
Does the company have more current assets and current liabilities?
Does the company have a lot of goodwill on its balance sheet Image
From the Income Statement
Are revenues steadily increasing over time?
Does the company need a lot of COGS to sell its products?
How much revenue is translated into net income?

From the Cash Flow Statement

Are most earnings translated into operating cash flow?
Does the company have a positive free cash flow (operating cash flow – CAPEX)?
Did the company manage to increase its cash position compared to last year?

Step 4:

Keep digging

In the footnotes look for:

Does the company use a lot of SBCs?
When does the company’s debt mature?
…

Step 5:

Confirm The Units

How many shares outstanding does the company have?
Does the company have preferred shares?
Are acquisitions paid with stocks?

Step 6:

Corporate Governance

Do insiders get special priveleges?
Does management have a lot of skin in the game?

Fine-tuning Whisper model for speech recognition

Posted on March 18, 2024 by sheng gao Leave a comment

Although OpenAI open-sourced multi-lingual Whisper model, https://github.com/openai/whisper, achived the state-of-art results in the benchmark dataset, there are many scenaria the pretrained models donot work well. For example, the languages not covered in the pretrained model. Whisper-V3 support 100 languages. Thus, the model must be re-trained in order to support new language. For the minority languages, even they are covered in the pretrained model, the accuracy is often worse and need to collect more data to adapt the pretrained model in order to reduce word error rate. In the post, huggingface implemented whisper version is used to fine-tune the Chinese language (It is just for testing training code functionality rather than training a product version model. No computing resource. As soon as resources (computing and data) ready, SOTA model can be trained). You can refer Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers . The following figure shows the logic flow for model training.

Get annotated training data, i.e. a set of speech-text pair samples. In the demo code, I get the data from the common-voice (Chinese)
Download the pretrained whisper based model
Transform speech wave data to Mel-spectrum features, which are further feeded into transformer encoder
- For Whisper, speech must be 16K sample rate. If not, nee re-sampling
- The size of Mel band is 80 or 128 (for whisper large)
Transformer encoder-decoder is trained to learn text-audio cross attension and audio self-attention. The predicted next token, P(next-token|text, audio) (the probability is calculated in the decoder), together with ground-truth text to calculate cross-entropy loss. Then gradients are computed and model parameters are updated.

The training code and word-error-rate (WER) evaluation code is in the following:

"""
Test codes for fine-tuning Whisper speech-to-text, i.e. speech recognition 
"""

from datasets import load_dataset, DatasetDict, Audio
from transformers import WhisperFeatureExtractor, WhisperTokenizer, WhisperProcessor, WhisperForConditionalGeneration
import torch
from dataclasses import dataclass
from typing import Any, Dict, List, Union
import evaluate
from transformers import Seq2SeqTrainingArguments
from transformers import Seq2SeqTrainer
from evaluate import load


@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need different padding methods
        # first treat the audio inputs by simply returning torch tensors
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        # get the tokenized label sequences
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        # pad the labels to max length
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        # if bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch
        
def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    # replace -100 with the pad_token_id
    label_ids[label_ids == -100] = tokenizer.pad_token_id

    # we do not want to group tokens when computing the metrics
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    wer = 100 * metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}
    
def prepare_dataset(batch):
    # load and resample audio data from 48 to 16kHz
    audio = batch["audio"]
    # print(f"""** {batch}""")
    # compute log-Mel input features from input audio array 
    batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]
    # encode target text to label ids 
    batch["labels"] = tokenizer(batch["sentence"]).input_ids
    # print(f"""** {batch["labels"]}""")
    return batch

#Step-1: Define model structure & initialization, feature extractor, text tokenizer
model_base_default = "openai/whisper-small"
language = "zh"
save_dir = "whisper-small-zh-me"
max_steps = 500

feature_extractor = WhisperFeatureExtractor.from_pretrained(model_base_default, language=language)
tokenizer = WhisperTokenizer.from_pretrained(model_base_default, task="transcribe", language=language)
processor = WhisperProcessor.from_pretrained(model_base_default, task="transcribe", language=language)

#Step-2: Prepare train/dev/test data
common_voice1 = DatasetDict()
common_voice1["train"] = load_dataset("mozilla-foundation/common_voice_11_0", "zh-CN", split="train", use_auth_token=False).select(range(1000))
common_voice1["test"] = load_dataset("mozilla-foundation/common_voice_11_0", "zh-CN", split="test", use_auth_token=False).select(range(50))
common_voice = common_voice1.remove_columns(["accent", "age", "client_id", "down_votes", "gender", "locale", "path", "segment", "up_votes"])
common_voice = common_voice.cast_column("audio", Audio(sampling_rate=16000))
common_voice = common_voice.map(prepare_dataset, remove_columns=common_voice.column_names["train"], num_proc=1)


data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)
metric = evaluate.load("wer")                                             

#
model = WhisperForConditionalGeneration.from_pretrained(model_base_default)
model.generation_config.language = language

model.config.forced_decoder_ids = None
model.config.suppress_tokens = []


training_args = Seq2SeqTrainingArguments(
    output_dir=save_dir, #"./whisper-small-zh-me",  # change to a repo name of your choice
    per_device_train_batch_size=16,
    gradient_accumulation_steps=1,  # increase by 2x for every 2x decrease in batch size
    learning_rate=1e-5,
    warmup_steps=100, #500,
    max_steps=max_steps, #4000,
    gradient_checkpointing=True,
    fp16=True,
    evaluation_strategy="steps",
    per_device_eval_batch_size=8,
    predict_with_generate=True,
    generation_max_length=225,
    save_steps=500, #500, #1000,
    eval_steps=500, #500, #1000,
    logging_steps=25,
    report_to=["tensorboard"],
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=False,
    push_to_hub=False, #True,


)

trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=common_voice["train"],
    eval_dataset=common_voice["test"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=processor.feature_extractor,
)
trainer.train()

#evaluate the model trained
#evaluation

# from datasets import load_dataset
# from transformers import WhisperForConditionalGeneration, WhisperProcessor
# import torch
# from evaluate import load
# from datasets import Audio

# librispeech_test_clean = load_dataset("librispeech_asr", "clean", split="test")

model_pth = f"{save_dir}/checkpoint-{max_steps}" #(
model_token_pth = "openai/whisper-small"
# model_pth = "openai/whisper-large"
is_local = True
processor = WhisperProcessor.from_pretrained(model_token_pth)#"openai/whisper-small")
model = WhisperForConditionalGeneration.from_pretrained(model_pth, local_files_only=is_local).to("cuda")

def map_to_pred(batch):
    audio = batch["audio"]
    input_features = processor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt").input_features
    batch["reference"] = processor.tokenizer._normalize(batch['sentence'])

    with torch.no_grad():
        predicted_ids = model.generate(input_features.to("cuda"))[0]
    transcription = processor.decode(predicted_ids)
    batch["prediction"] = processor.tokenizer._normalize(transcription)
    return batch

common_voice2 = common_voice1["test"].cast_column("audio", Audio(sampling_rate=16000))
result = common_voice2.map(map_to_pred)

wer = load("wer")
print(100 * wer.compute(references=result["reference"], predictions=result["prediction"]))

The above code is tested in 4070.

AI
Algorithm

Document summarization and question-answering on document -LLM + LangChain

Posted on March 10, 2024 by sheng gao Leave a comment

It is known that large language model (LLM) based chat can answer any query with satisfied accuracy. But when ask questions about the fact, in particular the fact is the latest reported and is not covered in the training data lifecycle. With high probability the answer is wrong. One method is to exploit retrieval augmented generation (RAG), and provide the latest documents to LLM as a context. Combining context documents and query will provide satisfied result. The following figure shows the overall flow of RAG based question and answering.

PDF to text:
- PyPDF is used to extract text from PDF. I use it rather than pdf reader in LangChain because I want to have more control on text post-processing.
Web document:
- Use WebBaseLoader in LangChain to download and extract text from the URL (In future, it may change to my own crawl if a lot of webpage downloaded.)
Document chunk and index
- Document chunk is processed by first extracting paragraphs and then group paragraph untill the specified chunk size
- Index engine: In LangChain, many index engines are provided from commercial to open-source. Firstly, vector database such as FAISS and Chroma are exploited, in which vector embedding of documents is extracted using LLM (Here Google Gemma used). Unfortunately, after a few try, the accuracy of recall is very bad. Then I change to traditional indexing such BM25 or TFIDF. At least, when query words exist in the document, the accuracy looks good (Just OK. Semantic vector match will be investigated in fugure to address semathc gap between query question and indexed documents.)
Prompt template for question & answer task :

qa_prompt_template_cfg = """Answer the question as precise as possible using the provided context. If the answer is
                    not contained in the context, say "answer not available in context" \n\n
                    Context: \n {context}?\n
                    Question: \n {question} \n
                    Answer:
                  """
qa_prompt_template = PromptTemplate(
    template = qa_prompt_template_cfg, 
    input_variables = ["context", "question"]
)

LLM model: Google Gemma
The UI and evaluate sample is in the following
- Test sample is a news article, providing a URL, taylor-swift-show-over-but-singapore-will-keep-looking-at-you
- In the first, it is to generate a summary for the news
- Then ask a question: how many fans?

Currently processing a large PDF, e.g. 30-page financial report, is very very slow in my 4070, and the document length is much longer than the context token limit (8192). Testing in the financial resport looks good, if the query words do not have semantic issue. The quality of RAG retrieved documents highly affect answer quality. For RAG based chat, building high recall retrieval system is critical.

AI
AIGC

AI, Tech & Life

涓涓细流，汇成江河