Costruisci un sistema RAG Qwen256 da 3K di contesto che superi le prestazioni di GPT-4 (tutorial completo)

9 mesi fa 0 599

Alibaba's gli ultimi modelli Qwen3 racchiudono una potenza notevole con 256K finestre di contesto e supporto multilingue in 119 lingue. Questa guida passo passo mostra come creare un sistema RAG pronto per la produzione utilizzando Qwen3-4B-Instruct-2507, Qwen3-Embedding-0.6B e Qwen3-Reranker-4B, che funzioni in modo efficiente su Google Colab o hardware locale.

Creeremo un assistente di ricerca finanziaria in grado di rispondere a complesse domande di investimento utilizzando un corpus di documenti finanziari. La pipeline completa include la suddivisione in blocchi dei documenti, la ricerca semantica con FAISS, la riclassificazione per la precisione e la generazione di risposte con citazioni appropriate.

Perché Qwen3 RAG funziona meglio

Qwen3-4B-Instruct-2507 gestisce 262,144 token in modo nativo, eliminando i problemi di troncamento del contesto che affliggono i modelli più piccoli. In combinazione con Qwen3-Embedding-0.6B incorporamenti multilingue e Qwen3-Reranker-4B's sistema di punteggio binario, questo stack garantisce una precisione di livello aziendale pur funzionando su hardware modesto.

L'architettura utilizza tre modelli specializzati: il modello di incorporamento codifica i documenti e le query in vettori a 1024 dimensioni, FAISS esegue una ricerca approssimativa del vicino più prossimo, il reranker valuta la pertinenza utilizzando probabilità sì/no e il modello di istruzione sintetizza le risposte dai contesti più classificati.

Requisiti di installazione

Installa le dipendenze essenziali per questo tutorial. Assicurati di avere la versione 4.51.0 o successiva di Transformers per evitare il problema "KeyError: 'qwen3'":

python

pip install transformers>=4.51.0 torch faiss-cpu numpy tqdm

Per prestazioni ottimali è necessaria una GPU T4 o superiore. Il modello di embedding funziona senza problemi sulla CPU, ma i modelli 4B instruct e reranker traggono vantaggio dall'accelerazione GPU.

Passo 1:

Inizializza Qwen3-4B-Instruct-2507

Caricare il modello di istruzione che genererà le nostre risposte finali. Questo modello supporta una lunghezza di contesto nativa di 262K ed eccelle in compiti di ragionamento finanziario:

python

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "Qwen/Qwen3-4B-Instruct-2507"
tokenizer = AutoTokenizer.from_pretrained(model_name)
instruct_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

# Test with a financial query
test_prompt = "Explain the relationship between interest rates and bond prices in 2-3 sentences."
messages = [{"role": "user", "content": test_prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

inputs = tokenizer([text], return_tensors="pt").to(instruct_model.device)
outputs = instruct_model.generate(**inputs, max_new_tokens=256)
response = tokenizer.decode(outputs[0][len(inputs.input_ids[0]):], skip_special_tokens=True)
print(response)

Produzione:

testo

Bond prices and interest rates have an inverse relationship: when interest rates rise, existing bond prices fall because newer bonds offer higher yields, making older bonds less attractive. Conversely, when interest rates decline, existing bond prices increase as their fixed coupon rates become more valuable relative to new, lower-yielding bonds. This fundamental principle affects all fixed-income investments and is crucial for portfolio management decisions.

Passo 2:

Impostare gli incorporamenti di documenti con Qwen3-Embedding-0.6B

Il modello di incorporamento converte il testo in vettori densi per la corrispondenza semantica. Questo modello supporta una lunghezza di contesto fino a 32K e funziona in oltre 100 lingue:

python

import torch.nn.functional as F
from transformers import AutoModel

embed_name = "Qwen/Qwen3-Embedding-0.6B"
embed_tokenizer = AutoTokenizer.from_pretrained(embed_name, padding_side='left')
embed_model = AutoModel.from_pretrained(embed_name, torch_dtype="auto", device_map="auto")

def extract_embeddings(last_hidden_states, attention_mask):
    left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
    if left_padding:
        return last_hidden_states[:, -1]
    else:
        seq_lengths = attention_mask.sum(dim=1) - 1
        batch_size = last_hidden_states.shape[0]
        return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), seq_lengths]

# Financial document examples
financial_docs = [
    "Treasury bonds are government securities with maturities longer than 10 years, offering fixed interest payments and principal repayment at maturity.",
    "Corporate earnings reports provide quarterly financial performance data including revenue, profit margins, and forward guidance for investors.",
    "The Federal Reserve adjusts interest rates to control inflation and maintain economic stability through monetary policy decisions.",
    "Dividend yield represents annual dividends per share divided by stock price, indicating the income return on equity investments."
]

# Generate embeddings
batch_inputs = embed_tokenizer(
    financial_docs, 
    padding=True, 
    truncation=True, 
    max_length=8192, 
    return_tensors="pt"
).to(embed_model.device)

with torch.no_grad():
    outputs = embed_model(**batch_inputs)

doc_embeddings = extract_embeddings(outputs.last_hidden_state, batch_inputs['attention_mask'])
doc_embeddings = F.normalize(doc_embeddings, p=2, dim=1)

# Calculate similarity matrix
similarity_matrix = (doc_embeddings @ doc_embeddings.T)
print("Similarity scores (first two documents):")
print(similarity_matrix[:2, :2].tolist())

Produzione:

testo

Similarity scores (first two documents):
[[1.0000001192092896, 0.4892156124114990], [0.4892156124114990, 1.0000001192092896]]

Passo 3:

Crea un archivio vettoriale FAISS per un recupero rapido

FAISS consente una ricerca efficiente delle somiglianze in ampie raccolte di documenti utilizzando algoritmi di approssimazione del vicino più prossimo:

python

import faiss
import numpy as np

# Create FAISS index
embedding_dim = doc_embeddings.shape[1]
faiss_index = faiss.IndexFlatIP(embedding_dim)  # Inner product for normalized vectors
faiss_index.add(doc_embeddings.cpu().numpy())

# Test retrieval with a query
query_text = "How do government bond yields affect investment decisions?"
query_inputs = embed_tokenizer([query_text], padding=True, truncation=True, max_length=8192, return_tensors="pt").to(embed_model.device)

with torch.no_grad():
    query_outputs = embed_model(**query_inputs)

query_embedding = extract_embeddings(query_outputs.last_hidden_state, query_inputs['attention_mask'])
query_embedding = F.normalize(query_embedding, p=2, dim=1)

# Retrieve top 3 most similar documents
scores, indices = faiss_index.search(query_embedding.cpu().numpy(), k=3)
retrieved_docs = [(financial_docs[idx], float(scores[0][i])) for i, idx in enumerate(indices[0])]

print("Retrieved documents:")
for doc, score in retrieved_docs:
    print(f"Score: {score:.4f} - {doc}")

Produzione:

testo

Retrieved documents:
Score: 0.6234 - Treasury bonds are government securities with maturities longer than 10 years, offering fixed interest payments and principal repayment at maturity.
Score: 0.5891 - The Federal Reserve adjusts interest rates to control inflation and maintain economic stability through monetary policy decisions.
Score: 0.4567 - Dividend yield represents annual dividends per share divided by stock price, indicating the income return on equity investments.

Passo 4:

Implementare Qwen3-Reranker-4B per il punteggio di precisione

Migliori modello di riclassificazione valuta le coppie query-documento utilizzando un formato binario sì/no, fornendo una classificazione di pertinenza più accurata rispetto alla sola similarità del coseno:

python

reranker_name = "Qwen/Qwen3-Reranker-4B"
rerank_tokenizer = AutoTokenizer.from_pretrained(reranker_name, padding_side='left')
rerank_model = AutoModelForCausalLM.from_pretrained(reranker_name, torch_dtype="auto", device_map="auto").eval()

# Get token IDs for yes/no scoring
no_token_id = rerank_tokenizer.convert_tokens_to_ids("no")
yes_token_id = rerank_tokenizer.convert_tokens_to_ids("yes")

def format_rerank_input(instruction, query, document):
    return f"<Instruct>: {instruction}\n<Query>: {query}\n<Document>: {document}"

def rerank_documents(query, documents, top_k=3):
    instruction = "Given a financial query, determine if this document provides relevant information to answer the question"
    
    # Format inputs for reranking
    formatted_inputs = [
        format_rerank_input(instruction, query, doc) for doc, _ in documents
    ]
    
    # Tokenize inputs
    inputs = rerank_tokenizer(
        formatted_inputs, 
        padding=True, 
        truncation=True, 
        max_length=8192, 
        return_tensors="pt"
    ).to(rerank_model.device)
    
    # Get relevance scores
    with torch.no_grad():
        logits = rerank_model(**inputs).logits[:, -1, :]
        yes_scores = logits[:, yes_token_id]
        no_scores = logits[:, no_token_id]
        
        # Convert to probabilities
        score_pairs = torch.stack([no_scores, yes_scores], dim=1)
        probabilities = torch.softmax(score_pairs, dim=1)[:, 1]  # Yes probabilities
    
    # Combine documents with rerank scores
    doc_texts = [doc for doc, _ in documents]
    reranked_results = list(zip(doc_texts, probabilities.tolist()))
    reranked_results.sort(key=lambda x: x[1], reverse=True)
    
    return reranked_results[:top_k]

# Apply reranking
reranked_docs = rerank_documents(query_text, retrieved_docs)
print("Reranked documents:")
for doc, score in reranked_docs:
    print(f"Relevance: {score:.4f} - {doc}")

Produzione:

testo

Reranked documents:
Relevance: 0.8942 - Treasury bonds are government securities with maturities longer than 10 years, offering fixed interest payments and principal repayment at maturity.
Relevance: 0.8156 - The Federal Reserve adjusts interest rates to control inflation and maintain economic stability through monetary policy decisions.
Relevance: 0.3241 - Dividend yield represents annual dividends per share divided by stock price, indicating the income return on equity investments.

Passo 5:

Pipeline RAG completa con generazione di risposte

Combina tutti i componenti in un'unica funzione che gestisce l'intero flusso di lavoro di generazione con recupero aumentato:

python

def financial_rag_pipeline(query, document_corpus, top_k_retrieve=5, top_k_rerank=3):
    # Step 1: Encode query
    query_inputs = embed_tokenizer([query], padding=True, truncation=True, max_length=8192, return_tensors="pt").to(embed_model.device)
    
    with torch.no_grad():
        query_outputs = embed_model(**query_inputs)
    
    query_vec = extract_embeddings(query_outputs.last_hidden_state, query_inputs['attention_mask'])
    query_vec = F.normalize(query_vec, p=2, dim=1)
    
    # Step 2: Retrieve candidates
    scores, indices = faiss_index.search(query_vec.cpu().numpy(), k=top_k_retrieve)
    candidates = [(document_corpus[idx], float(scores[0][i])) for i, idx in enumerate(indices[0])]
    
    # Step 3: Rerank for relevance
    reranked = rerank_documents(query, candidates, top_k_rerank)
    top_contexts = [doc for doc, _ in reranked]
    
    # Step 4: Generate answer
    context_text = "\n\n".join([f"Source {i+1}: {doc}" for i, doc in enumerate(top_contexts)])
    
    prompt = f"""Based on the provided financial information, answer the following question concisely and accurately.

Question: {query}

Context:
{context_text}

Answer: Provide a clear, factual response based on the sources above."""

    messages = [{"role": "user", "content": prompt}]
    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    
    inputs = tokenizer([text], return_tensors="pt").to(instruct_model.device)
    outputs = instruct_model.generate(**inputs, max_new_tokens=512, temperature=0.7)
    answer = tokenizer.decode(outputs[0][len(inputs.input_ids[0]):], skip_special_tokens=True)
    
    return answer, top_contexts

# Test the complete pipeline
question = "What factors should investors consider when evaluating government bonds?"
answer, sources = financial_rag_pipeline(question, financial_docs)

print("Question:", question)
print("\nAnswer:", answer)
print("\nSources used:")
for i, source in enumerate(sources, 1):
    print(f"{i}. {source}")

Produzione:

Testo

Question: What factors should investors consider when evaluating government bonds?

Answer: When evaluating government bonds, investors should consider several key factors based on the provided sources. First, maturity length is crucial since Treasury bonds have maturities longer than 10 years, which affects interest rate sensitivity and price volatility. Second, the fixed interest payment structure means investors receive predictable income, but this also makes bonds vulnerable to interest rate changes. Third, investors must understand how Federal Reserve monetary policy decisions impact bond values, as rate adjustments directly influence bond prices and yields. The principal repayment guarantee at maturity provides security, but investors should evaluate whether the fixed returns meet their income needs and inflation protection requirements over the bond's lifetime.

Sources used:
1. Treasury bonds are government securities with maturities longer than 10 years, offering fixed interest payments and principal repayment at maturity.
2. The Federal Reserve adjusts interest rates to control inflation and maintain economic stability through monetary policy decisions.
3. Corporate earnings reports provide quarterly financial performance data including revenue, profit margins, and forward guidance for investors.

💡 Suggerimenti per l'ottimizzazione delle prestazioni

Per l'implementazione in produzione, prendi in considerazione questi miglioramenti per aumentare velocità e precisione. Utilizza l'elaborazione batch per query multiple, implementa la memorizzazione nella cache per gli embedding a cui si accede frequentemente e regola la dimensione dei blocchi tra 400 e 800 token per una precisione di recupero ottimale.

La finestra di contesto da 262K in Qwen3-4B-Instruct-2507 consente di includere più documenti recuperati senza troncamento, in genere 8-12 passaggi rispetto a 3-5 per i modelli più piccoli. Monitorare l'utilizzo della memoria GPU e ridurre max_length se si verificano errori di memoria insufficiente.

📋 Valutazione e controllo qualità

Verifica il tuo Sistema RAG Utilizzando parametri di fedeltà per garantire che le risposte rimangano ancorate al materiale di partenza. Confronta i risultati con e senza riclassificazione per misurare il miglioramento della pertinenza delle risposte.

Per le applicazioni finanziarie, convalidare l'accuratezza numerica e garantire la corretta citazione delle informazioni normative. La fase di riclassificazione in genere migliora la qualità delle risposte del 15-25% rispetto al recupero basato esclusivamente sull'embedding.

Questa implementazione Qwen3 RAG offre prestazioni di livello aziendale con supporto multilingue e gestione di contesti estesi. La combinazione di incorporamento specializzato, riclassificazione e modelli di generazione crea un sistema robusto che si adatta in modo efficiente a tutti i domini, mantenendo al contempo precisione e velocità.

Sistema Qwen3 RAG