
主动学习改变我们的训练方式 AI 模型 通过智能地选择最有价值的数据进行注释。当与 强大的法学硕士 喜欢 谷歌双子座,它创建了高效的注释管道,在保持高数据质量的同时减少了人工工作量。
本指南探讨如何使用 Adala 框架 – 一个强大但未得到充分利用的工具 自主数据标记.
我们将利用 Gemini 实现医学症状分类器's 通过结构化的主动学习工作流程实现功能。
理解数据注释的主动学习

主动学习解决了 监督学习:获取大量标记数据。与其随机选择数据点进行注释, 主动学习算法 确定对模型改进贡献最大的最具信息量的样本。
为什么主动学习很重要:
Adala 框架将这些优势融入 生产工作流程 通过提供模块化组件来简化 主动学习过程。在深入实施之前,'s 研究 Adala 为何特别适合 积分 拥有 Google Gemini 等现代法学硕士学位。
Adala 是什么?框架介绍

Adala(自主数据标记代理) 开源框架 专为实施专门代理而设计 数据处理。与传统的注释工具不同,Adala 采用基于代理的方法,结合了:
看着阿达拉's 快速入门示例,我们可以看到它的结构 情感分类:
蟒蛇
import pandas as pd
from adala.agents import Agent
from adala.environments import StaticEnvironment
from adala.skills import ClassificationSkill
from adala.runtimes import OpenAIChatRuntime
from rich import print
# Train dataset
train_df = pd.DataFrame([
["It was the negative first impressions, and then it started working.", "Positive"],
["Not loud enough and doesn't turn on like it should.", "Negative"],
["I don't know what to say.", "Neutral"],
["Manager was rude, but the most important that mic shows very flat frequency response.", "Positive"],
["The phone doesn't seem to accept anything except CBR mp3s.", "Negative"],
["I tried it before, I bought this device for my son.", "Neutral"],
], columns=["text", "sentiment"])
# Test dataset
test_df = pd.DataFrame([
"All three broke within two months of use.",
"The device worked for a long time, can't say anything bad.",
"Just a random line of text."
], columns=["text"])
agent = Agent(
# connect to a dataset
environment=StaticEnvironment(df=train_df),
# define a skill
skills=ClassificationSkill(
name='sentiment',
instructions="Label text as positive, negative or neutral.",
labels=["Positive", "Negative", "Neutral"],
input_template="Text: {text}",
output_template="Sentiment: {sentiment}"
),
# define runtimes
runtimes = {
'openai': OpenAIChatRuntime(model='gpt-4o'),
},
teacher_runtimes = {
'default': OpenAIChatRuntime(model='gpt-4o'),
},
default_runtime='openai',
)
agent.learn(learning_iterations=3, accuracy_threshold=0.95)
predictions = agent.run(test_df)
对于我们的医学症状分类任务,我们将调整此架构以整合 谷歌双子座 同时实施定制的主动学习策略。
设置您的环境
让's 首先安装 Adala 和所需的依赖项:
蟒蛇
# Install Adala directly from GitHub
!pip install -q git+https://github.com/HumanSignal/Adala.git
# Verify installation
!pip list | grep adala
# Install additional dependencies
!pip install -q google-generativeai pandas matplotlib numpy
我们还需要克隆存储库以便直接访问其组件:
蟒蛇
# Clone the repository for access to source files
!git clone https://github.com/HumanSignal/Adala.git
# Ensure the package is in our Python path
import sys
sys.path.append('./Adala')
# Import key components
from Adala.adala.annotators.base import BaseAnnotator
from Adala.adala.strategies.random_strategy import RandomStrategy
from Adala.adala.utils.custom_types import TextSample, LabeledSample
集成 Google Gemini 作为自定义注释器
与使用 Google Gemini 基本包装器的原始实现不同,我们将构建一个更 健壮的注释器 跟随阿达拉's 设计模式。这使得我们的解决方案更加 可维护的 并且可扩展。
首先,我们需要设置 Google 生成 AI 客户:
蟒蛇
import google.generativeai as genai
import os
# Set API key from environment or enter manually
GEMINI_API_KEY = os.getenv("GEMINI_API_KEY") or getpass("Enter your Gemini API Key: ")
genai.configure(api_key=GEMINI_API_KEY)
现在,我们将通过扩展 Adala 来创建自定义注释器's BaseAnnotator 类:
蟒蛇
import json
import re
from typing import List, Dict, Any, Optional
class GeminiAnnotator(BaseAnnotator):
"""Custom annotator using Google Gemini for medical symptom classification."""
def __init__(self,
model_name: str = "models/gemini-2.0-flash-lite",
categories: List[str] = None,
temperature: float = 0.1):
"""Initialize the Gemini annotator.
Args:
model_name: The Gemini model to use
categories: List of valid classification categories
temperature: Controls randomness in generation (lower = more deterministic)
"""
self.model = genai.GenerativeModel(
model_name=model_name,
generation_config={"temperature": temperature}
)
self.categories = categories or ["Cardiovascular", "Respiratory",
"Gastrointestinal", "Neurological"]
def _build_prompt(self, text: str) -> str:
"""Create a structured prompt for the model.
Args:
text: The symptom text to classify
Returns:
A formatted prompt string
"""
return f"""Classify this medical symptom into one of these categories:
{', '.join(self.categories)}.
Return JSON format: {{"category": "selected_category",
"confidence": 0.XX, "explanation": "brief_reason"}}
SYMPTOM: {text}"""
def _parse_response(self, response: str) -> Dict[str, Any]:
"""Extract structured data from model response.
Args:
response: Raw text response from Gemini
Returns:
Dictionary containing parsed fields
"""
try:
# Extract JSON from response even if surrounded by text
json_match = re.search(r'(\{.*\})', response, re.DOTALL)
result = json.loads(json_match.group(1) if json_match else response)
return {
"category": result.get("category", "Unknown"),
"confidence": result.get("confidence", 0.0),
"explanation": result.get("explanation", "")
}
except Exception as e:
return {
"category": "Unknown",
"confidence": 0.0,
"explanation": f"Error parsing response: {str(e)}"
}
def annotate(self, samples: List[TextSample]) -> List[LabeledSample]:
"""Annotate a batch of text samples.
Args:
samples: List of TextSample objects
Returns:
List of LabeledSample objects with annotations
"""
results = []
for sample in samples:
prompt = self._build_prompt(sample.text)
try:
response = self.model.generate_content(prompt).text
parsed = self._parse_response(response)
# Create labeled sample with metadata
labeled_sample = LabeledSample(
text=sample.text,
labels=parsed["category"],
metadata={
"confidence": parsed["confidence"],
"explanation": parsed["explanation"]
}
)
except Exception as e:
# Graceful error handling
labeled_sample = LabeledSample(
text=sample.text,
labels="Unknown",
metadata={"error": str(e)}
)
# Store reference to original sample
labeled_sample._sample = sample
results.append(labeled_sample)
return results
此实现比原始实现有了显著的改进:
- 它遵循 Adala 的正确类继承's 基础注释器
- 实现用于提示构建和响应解析的私有辅助方法
- 使用结构化 错误处理 和类型提示
- 提供完整的文档
构建症状分类管道
让's 创建一个数据集 医学症状 用于我们的分类任务。与原始实现不同,我们将使用更加多样化的数据集, 均衡代表制 跨类别:
蟒蛇
# Create a more comprehensive dataset
symptom_data = [
# Cardiovascular symptoms
"Chest pain radiating to left arm during exercise",
"Heart palpitations when lying down",
"Swollen ankles and shortness of breath",
"Dizziness when standing up quickly",
# Respiratory symptoms
"Persistent dry cough with occasional wheezing",
"Shortness of breath when climbing stairs",
"Coughing up yellow or green mucus",
"Rapid breathing with chest tightness",
# Gastrointestinal symptoms
"Stomach cramps and nausea after eating",
"Burning sensation in upper abdomen",
"Frequent loose stools with abdominal pain",
"Yellowing of skin and eyes",
# Neurological symptoms
"Severe headache with sensitivity to light",
"Numbness in fingers of right hand",
"Memory loss and confusion",
"Tremors in hands when reaching for objects"
]
# Convert to TextSample objects
text_samples = [TextSample(text=text) for text in symptom_data]
实施先进的主动学习策略
最初的实现使用了简单的优先级评分机制。我们将通过多种策略来增强这一机制,以演示 Adala's 灵活性:
蟒蛇
import numpy as np
from typing import List, Callable
class PrioritizationStrategy:
"""Base class for sample prioritization strategies."""
def score_samples(self, samples: List[TextSample]) -> np.ndarray:
"""Assign priority scores to samples.
Args:
samples: List of samples to score
Returns:
Array of scores, higher values indicate higher priority
"""
raise NotImplementedError("Subclasses must implement this method")
def select(self, samples: List[TextSample], n: int = 1) -> List[TextSample]:
"""Select the top n highest scoring samples.
Args:
samples: List of samples to select from
n: Number of samples to select
Returns:
List of selected samples
"""
if not samples:
return []
scores = self.score_samples(samples)
indices = np.argsort(-scores)[:n] # Descending order
return [samples[i] for i in indices]
class KeywordPriority(PrioritizationStrategy):
"""Prioritize samples based on medical urgency keywords."""
def __init__(self, keyword_weights: Dict[str, float]):
"""Initialize with keyword weights.
Args:
keyword_weights: Dictionary mapping keywords to priority weights
"""
self.keyword_weights = keyword_weights
def score_samples(self, samples: List[TextSample]) -> np.ndarray:
scores = np.zeros(len(samples))
for i, sample in enumerate(samples):
# Base score
scores[i] = 0.1
# Add weights for each keyword found
text_lower = sample.text.lower()
for keyword, weight in self.keyword_weights.items():
if keyword in text_lower:
scores[i] += weight
return scores
class UncertaintyPriority(PrioritizationStrategy):
"""Prioritize samples based on model uncertainty."""
def __init__(self, model_fn: Callable[[List[TextSample]], List[float]]):
"""Initialize with uncertainty model function.
Args:
model_fn: Function that returns uncertainty scores for samples
"""
self.model_fn = model_fn
def score_samples(self, samples: List[TextSample]) -> np.ndarray:
# Higher uncertainty = higher priority
return np.array(self.model_fn(samples))
# Create a combined strategy
keyword_weights = {
"chest": 0.5,
"pain": 0.4,
"breathing": 0.4,
"dizz": 0.3,
"head": 0.2,
"numb": 0.2
}
keyword_strategy = KeywordPriority(keyword_weights)
现在,让's 实施我们增强的主动学习循环:
蟒蛇
from matplotlib import pyplot as plt
from IPython.display import clear_output
import time
def run_active_learning_loop(
samples: List[TextSample],
annotator: GeminiAnnotator,
strategy: PrioritizationStrategy,
iterations: int = 5,
batch_size: int = 1,
visualization_interval: int = 1
):
"""Run an active learning loop with visualization.
Args:
samples: Pool of unlabeled samples
annotator: Annotation system
strategy: Sample selection strategy
iterations: Number of learning iterations
batch_size: Samples to annotate per iteration
visualization_interval: How often to update visualizations
Returns:
List of labeled samples
"""
labeled_samples = []
remaining_samples = list(samples)
print("\nStarting Active Learning Loop:")
for i in range(iterations):
print(f"\n--- Iteration {i+1}/{iterations} ---")
# Filter out already labeled samples
remaining_samples = [
s for s in remaining_samples
if s not in [getattr(l, '_sample', l) for l in labeled_samples]
]
if not remaining_samples:
print("No more samples to label. Stopping.")
break
# Select most important samples
selected = strategy.select(remaining_samples, n=batch_size)
# Annotate selected samples
newly_labeled = annotator.annotate(selected)
labeled_samples.extend(newly_labeled)
# Display annotation results
for sample in newly_labeled:
print(f"Text: {sample.text}")
print(f"Category: {sample.labels}")
print(f"Confidence: {sample.metadata.get('confidence', 0):.2f}")
explanation = sample.metadata.get('explanation', '')
print(f"Explanation: {explanation[:100]}..." if len(explanation) > 100 else explanation)
print()
# Visualize results periodically
if (i + 1) % visualization_interval == 0:
visualize_results(labeled_samples)
return labeled_samples
def visualize_results(labeled_samples: List[LabeledSample]):
"""Create visualizations of annotation results.
Args:
labeled_samples: List of labeled samples to visualize
"""
if not labeled_samples:
return
# Extract data
categories = [s.labels for s in labeled_samples]
confidence = [s.metadata.get("confidence", 0) for s in labeled_samples]
texts = [s.text[:30] + "..." for s in labeled_samples]
# Set up plots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
# Plot 1: Confidence by category
category_counts = {}
category_confidence = {}
for cat, conf in zip(categories, confidence):
if cat not in category_counts:
category_counts[cat] = 0
category_confidence[cat] = 0
category_counts[cat] += 1
category_confidence[cat] += conf
for cat in category_confidence:
category_confidence[cat] /= category_counts[cat]
cats = list(category_counts.keys())
counts = list(category_counts.values())
avg_conf = list(category_confidence.values())
x = np.arange(len(cats))
width = 0.35
ax1.bar(x - width/2, counts, width, label='Count')
ax1.bar(x + width/2, avg_conf, width, label='Avg Confidence')
ax1.set_xticks(x)
ax1.set_xticklabels(cats, rotation=45)
ax1.set_title('Category Distribution and Confidence')
ax1.legend()
# Plot 2: Individual sample confidence
sorted_indices = np.argsort(confidence)
ax2.barh(range(len(texts)), [confidence[i] for i in sorted_indices])
ax2.set_yticks(range(len(texts)))
ax2.set_yticklabels([texts[i] for i in sorted_indices])
ax2.set_title('Sample Confidence')
ax2.set_xlabel('Confidence')
plt.tight_layout()
plt.show()
运行端到端管道
现在我们可以运行完整的主动学习流程:
蟒蛇
# Initialize components
categories = ["Cardiovascular", "Respiratory", "Gastrointestinal", "Neurological"]
annotator = GeminiAnnotator(categories=categories)
strategy = keyword_strategy
# Run the active learning loop
labeled_data = run_active_learning_loop(
samples=text_samples,
annotator=annotator,
strategy=strategy,
iterations=5,
visualization_interval=2
)
# Final visualization and analysis
visualize_results(labeled_data)
# Print summary statistics
print("\nAnnotation Summary:")
print(f"Total samples annotated: {len(labeled_data)}")
categories = [s.labels for s in labeled_data]
unique_categories = set(categories)
print(f"Categories found: {len(unique_categories)}")
for category in unique_categories:
count = categories.count(category)
print(f" - {category}: {count} samples ({count/len(labeled_data):.1%})")
avg_confidence = sum(s.metadata.get("confidence", 0) for s in labeled_data) / len(labeled_data)
print(f"Average confidence: {avg_confidence:.2f}")
实际应用和扩展
除了医学症状分类之外,该流程还有许多实际应用:
1. 内容审核
2. 客户反馈分析
3.临床试验文件处理
您可以通过以下方式扩展此实现:
AiMojo 推荐:
结语
Adala 与 Google Gemini 的整合提供了 强大的框架 用于构建智能注释流程。通过利用主动 学习策略,我们可以大大减少所需的手动工作量,同时保持 高质量注释.
本教程中演示的模块化设计模式允许 容易适应 适用于不同的领域和注释任务。
对于那些有兴趣进一步探索的人来说, Adala GitHub 存储库 提供额外的示例和文档以将这些概念扩展到更多 复杂注释场景.

