数据注释的主动学习:Gemini + Adala 指南

Adala框架——数据注释的主动学习

主动学习改变我们的训练方式 AI 模型 通过智能地选择最有价值的数据进行注释。当与 强大的法学硕士 喜欢 谷歌双子座,它创建了高效的注释管道,在保持高数据质量的同时减少了人工工作量。

本指南探讨如何使用 Adala 框架 – 一个强大但未得到充分利用的工具 自主数据标记.

我们将利用 Gemini 实现医学症状分类器's 通过结构化的主动学习工作流程实现功能。

理解数据注释的主动学习

理解数据注释的主动学习

主动学习解决了 监督学习:获取大量标记数据。与其随机选择数据点进行注释, 主动学习算法 确定对模型改进贡献最大的最具信息量的样本。

为什么主动学习很重要:

通过将人力集中在最重要的地方来降低注释成本。
提高 模型精度 标记示例较少。
通过优先考虑代表性不足的类别来解决类别不平衡问题。
在模型和 注释者.

Adala 框架将这些优势融入 生产工作流程 通过提供模块化组件来简化 主动学习过程。在深入实施之前,'s 研究 Adala 为何特别适合 积分 拥有 Google Gemini 等现代法学硕士学位。

Adala 是什么?框架介绍

Adala(自主数据标记代理) 开源框架 专为实施专门代理而设计 数据处理。与传统的注释工具不同,Adala 采用基于代理的方法,结合了:

基于技能的架构:定义注释代理所需的特定功能。
运行时灵活性:在不同的 LLM 或自定义运行时之间切换。
环境连接:与各种数据源交互。
内置学习循环:培训代理,使其随着时间的推移不断进步。

看着阿达拉's 快速入门示例,我们可以看到它的结构 情感分类:

蟒蛇

import pandas as pd
from adala.agents import Agent
from adala.environments import StaticEnvironment
from adala.skills import ClassificationSkill
from adala.runtimes import OpenAIChatRuntime
from rich import print

# Train dataset
train_df = pd.DataFrame([
    ["It was the negative first impressions, and then it started working.", "Positive"],
    ["Not loud enough and doesn't turn on like it should.", "Negative"],
    ["I don't know what to say.", "Neutral"],
    ["Manager was rude, but the most important that mic shows very flat frequency response.", "Positive"],
    ["The phone doesn't seem to accept anything except CBR mp3s.", "Negative"],
    ["I tried it before, I bought this device for my son.", "Neutral"],
], columns=["text", "sentiment"])

# Test dataset
test_df = pd.DataFrame([
    "All three broke within two months of use.",
    "The device worked for a long time, can't say anything bad.",
    "Just a random line of text."
], columns=["text"])

agent = Agent(
    # connect to a dataset
    environment=StaticEnvironment(df=train_df),
    # define a skill
    skills=ClassificationSkill(
        name='sentiment',
        instructions="Label text as positive, negative or neutral.",
        labels=["Positive", "Negative", "Neutral"],
        input_template="Text: {text}",
        output_template="Sentiment: {sentiment}"
    ),
    # define runtimes
    runtimes = {
        'openai': OpenAIChatRuntime(model='gpt-4o'),
    },
    teacher_runtimes = {
        'default': OpenAIChatRuntime(model='gpt-4o'),
    },
    default_runtime='openai',
)

agent.learn(learning_iterations=3, accuracy_threshold=0.95)
predictions = agent.run(test_df)

对于我们的医学症状分类任务,我们将调整此架构以整合 谷歌双子座 同时实施定制的主动学习策略。

设置您的环境

让's 首先安装 Adala 和所需的依赖项:

蟒蛇

# Install Adala directly from GitHub
!pip install -q git+https://github.com/HumanSignal/Adala.git

# Verify installation
!pip list | grep adala

# Install additional dependencies
!pip install -q google-generativeai pandas matplotlib numpy

我们还需要克隆存储库以便直接访问其组件:

蟒蛇

# Clone the repository for access to source files
!git clone https://github.com/HumanSignal/Adala.git

# Ensure the package is in our Python path
import sys
sys.path.append('./Adala')

# Import key components
from Adala.adala.annotators.base import BaseAnnotator
from Adala.adala.strategies.random_strategy import RandomStrategy
from Adala.adala.utils.custom_types import TextSample, LabeledSample

集成 Google Gemini 作为自定义注释器

与使用 Google Gemini 基本包装器的原始实现不同,我们将构建一个更 健壮的注释器 跟随阿达拉's 设计模式。这使得我们的解决方案更加 可维护的 并且可扩展。

首先,我们需要设置 Google 生成 AI 客户:

蟒蛇

import google.generativeai as genai
import os

# Set API key from environment or enter manually
GEMINI_API_KEY = os.getenv("GEMINI_API_KEY") or getpass("Enter your Gemini API Key: ")
genai.configure(api_key=GEMINI_API_KEY)

现在,我们将通过扩展 Adala 来创建自定义注释器's BaseAnnotator 类:

蟒蛇

import json
import re
from typing import List, Dict, Any, Optional

class GeminiAnnotator(BaseAnnotator):
    """Custom annotator using Google Gemini for medical symptom classification."""
    
    def __init__(self, 
                 model_name: str = "models/gemini-2.0-flash-lite", 
                 categories: List[str] = None,
                 temperature: float = 0.1):
        """Initialize the Gemini annotator.
        
        Args:
            model_name: The Gemini model to use
            categories: List of valid classification categories
            temperature: Controls randomness in generation (lower = more deterministic)
        """
        self.model = genai.GenerativeModel(
            model_name=model_name,
            generation_config={"temperature": temperature}
        )
        self.categories = categories or ["Cardiovascular", "Respiratory", 
                                         "Gastrointestinal", "Neurological"]
    
    def _build_prompt(self, text: str) -> str:
        """Create a structured prompt for the model.
        
        Args:
            text: The symptom text to classify
            
        Returns:
            A formatted prompt string
        """
        return f"""Classify this medical symptom into one of these categories:
        {', '.join(self.categories)}.
        
        Return JSON format: {{"category": "selected_category", 
        "confidence": 0.XX, "explanation": "brief_reason"}}
        
        SYMPTOM: {text}"""
    
    def _parse_response(self, response: str) -> Dict[str, Any]:
        """Extract structured data from model response.
        
        Args:
            response: Raw text response from Gemini
            
        Returns:
            Dictionary containing parsed fields
        """
        try:
            # Extract JSON from response even if surrounded by text
            json_match = re.search(r'(\{.*\})', response, re.DOTALL)
            result = json.loads(json_match.group(1) if json_match else response)
            return {
                "category": result.get("category", "Unknown"),
                "confidence": result.get("confidence", 0.0),
                "explanation": result.get("explanation", "")
            }
        except Exception as e:
            return {
                "category": "Unknown",
                "confidence": 0.0,
                "explanation": f"Error parsing response: {str(e)}"
            }
    
    def annotate(self, samples: List[TextSample]) -> List[LabeledSample]:
        """Annotate a batch of text samples.
        
        Args:
            samples: List of TextSample objects
            
        Returns:
            List of LabeledSample objects with annotations
        """
        results = []
        for sample in samples:
            prompt = self._build_prompt(sample.text)
            try:
                response = self.model.generate_content(prompt).text
                parsed = self._parse_response(response)
                
                # Create labeled sample with metadata
                labeled_sample = LabeledSample(
                    text=sample.text,
                    labels=parsed["category"],
                    metadata={
                        "confidence": parsed["confidence"],
                        "explanation": parsed["explanation"]
                    }
                )
            except Exception as e:
                # Graceful error handling
                labeled_sample = LabeledSample(
                    text=sample.text,
                    labels="Unknown",
                    metadata={"error": str(e)}
                )
            
            # Store reference to original sample
            labeled_sample._sample = sample
            results.append(labeled_sample)
            
        return results

此实现比原始实现有了显著的改进:

  1. 它遵循 Adala 的正确类继承's 基础注释器
  2. 实现用于提示构建和响应解析的私有辅助方法
  3. 使用结构化 错误处理 和类型提示
  4. 提供完整的文档

构建症状分类管道

让's 创建一个数据集 医学症状 用于我们的分类任务。与原始实现不同,我们将使用更加多样化的数据集, 均衡代表制 跨类别:

蟒蛇

# Create a more comprehensive dataset
symptom_data = [
    # Cardiovascular symptoms
    "Chest pain radiating to left arm during exercise",
    "Heart palpitations when lying down",
    "Swollen ankles and shortness of breath",
    "Dizziness when standing up quickly",
    
    # Respiratory symptoms
    "Persistent dry cough with occasional wheezing",
    "Shortness of breath when climbing stairs",
    "Coughing up yellow or green mucus",
    "Rapid breathing with chest tightness",
    
    # Gastrointestinal symptoms
    "Stomach cramps and nausea after eating",
    "Burning sensation in upper abdomen",
    "Frequent loose stools with abdominal pain",
    "Yellowing of skin and eyes",
    
    # Neurological symptoms
    "Severe headache with sensitivity to light",
    "Numbness in fingers of right hand",
    "Memory loss and confusion",
    "Tremors in hands when reaching for objects"
]

# Convert to TextSample objects
text_samples = [TextSample(text=text) for text in symptom_data]

实施先进的主动学习策略

最初的实现使用了简单的优先级评分机制。我们将通过多种策略来增强这一机制,以演示 Adala's 灵活性:

蟒蛇

import numpy as np
from typing import List, Callable

class PrioritizationStrategy:
    """Base class for sample prioritization strategies."""
    
    def score_samples(self, samples: List[TextSample]) -> np.ndarray:
        """Assign priority scores to samples.
        
        Args:
            samples: List of samples to score
            
        Returns:
            Array of scores, higher values indicate higher priority
        """
        raise NotImplementedError("Subclasses must implement this method")
    
    def select(self, samples: List[TextSample], n: int = 1) -> List[TextSample]:
        """Select the top n highest scoring samples.
        
        Args:
            samples: List of samples to select from
            n: Number of samples to select
            
        Returns:
            List of selected samples
        """
        if not samples:
            return []
        
        scores = self.score_samples(samples)
        indices = np.argsort(-scores)[:n]  # Descending order
        return [samples[i] for i in indices]

class KeywordPriority(PrioritizationStrategy):
    """Prioritize samples based on medical urgency keywords."""
    
    def __init__(self, keyword_weights: Dict[str, float]):
        """Initialize with keyword weights.
        
        Args:
            keyword_weights: Dictionary mapping keywords to priority weights
        """
        self.keyword_weights = keyword_weights
    
    def score_samples(self, samples: List[TextSample]) -> np.ndarray:
        scores = np.zeros(len(samples))
        for i, sample in enumerate(samples):
            # Base score
            scores[i] = 0.1
            
            # Add weights for each keyword found
            text_lower = sample.text.lower()
            for keyword, weight in self.keyword_weights.items():
                if keyword in text_lower:
                    scores[i] += weight
        
        return scores

class UncertaintyPriority(PrioritizationStrategy):
    """Prioritize samples based on model uncertainty."""
    
    def __init__(self, model_fn: Callable[[List[TextSample]], List[float]]):
        """Initialize with uncertainty model function.
        
        Args:
            model_fn: Function that returns uncertainty scores for samples
        """
        self.model_fn = model_fn
    
    def score_samples(self, samples: List[TextSample]) -> np.ndarray:
        # Higher uncertainty = higher priority
        return np.array(self.model_fn(samples))

# Create a combined strategy
keyword_weights = {
    "chest": 0.5,
    "pain": 0.4,
    "breathing": 0.4, 
    "dizz": 0.3,
    "head": 0.2,
    "numb": 0.2
}

keyword_strategy = KeywordPriority(keyword_weights)

现在,让's 实施我们增强的主动学习循环:

蟒蛇

from matplotlib import pyplot as plt
from IPython.display import clear_output
import time

def run_active_learning_loop(
    samples: List[TextSample],
    annotator: GeminiAnnotator,
    strategy: PrioritizationStrategy,
    iterations: int = 5,
    batch_size: int = 1,
    visualization_interval: int = 1
):
    """Run an active learning loop with visualization.
    
    Args:
        samples: Pool of unlabeled samples
        annotator: Annotation system
        strategy: Sample selection strategy
        iterations: Number of learning iterations
        batch_size: Samples to annotate per iteration
        visualization_interval: How often to update visualizations
    
    Returns:
        List of labeled samples
    """
    labeled_samples = []
    remaining_samples = list(samples)
    
    print("\nStarting Active Learning Loop:")
    
    for i in range(iterations):
        print(f"\n--- Iteration {i+1}/{iterations} ---")
        
        # Filter out already labeled samples
        remaining_samples = [
            s for s in remaining_samples 
            if s not in [getattr(l, '_sample', l) for l in labeled_samples]
        ]
        
        if not remaining_samples:
            print("No more samples to label. Stopping.")
            break
        
        # Select most important samples
        selected = strategy.select(remaining_samples, n=batch_size)
        
        # Annotate selected samples
        newly_labeled = annotator.annotate(selected)
        labeled_samples.extend(newly_labeled)
        
        # Display annotation results
        for sample in newly_labeled:
            print(f"Text: {sample.text}")
            print(f"Category: {sample.labels}")
            print(f"Confidence: {sample.metadata.get('confidence', 0):.2f}")
            explanation = sample.metadata.get('explanation', '')
            print(f"Explanation: {explanation[:100]}..." if len(explanation) > 100 else explanation)
            print()
        
        # Visualize results periodically
        if (i + 1) % visualization_interval == 0:
            visualize_results(labeled_samples)
            
    return labeled_samples

def visualize_results(labeled_samples: List[LabeledSample]):
    """Create visualizations of annotation results.
    
    Args:
        labeled_samples: List of labeled samples to visualize
    """
    if not labeled_samples:
        return
        
    # Extract data
    categories = [s.labels for s in labeled_samples]
    confidence = [s.metadata.get("confidence", 0) for s in labeled_samples]
    texts = [s.text[:30] + "..." for s in labeled_samples]
    
    # Set up plots
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
    
    # Plot 1: Confidence by category
    category_counts = {}
    category_confidence = {}
    
    for cat, conf in zip(categories, confidence):
        if cat not in category_counts:
            category_counts[cat] = 0
            category_confidence[cat] = 0
        category_counts[cat] += 1
        category_confidence[cat] += conf
    
    for cat in category_confidence:
        category_confidence[cat] /= category_counts[cat]
    
    cats = list(category_counts.keys())
    counts = list(category_counts.values())
    avg_conf = list(category_confidence.values())
    
    x = np.arange(len(cats))
    width = 0.35
    
    ax1.bar(x - width/2, counts, width, label='Count')
    ax1.bar(x + width/2, avg_conf, width, label='Avg Confidence')
    ax1.set_xticks(x)
    ax1.set_xticklabels(cats, rotation=45)
    ax1.set_title('Category Distribution and Confidence')
    ax1.legend()
    
    # Plot 2: Individual sample confidence
    sorted_indices = np.argsort(confidence)
    ax2.barh(range(len(texts)), [confidence[i] for i in sorted_indices])
    ax2.set_yticks(range(len(texts)))
    ax2.set_yticklabels([texts[i] for i in sorted_indices])
    ax2.set_title('Sample Confidence')
    ax2.set_xlabel('Confidence')
    
    plt.tight_layout()
    plt.show()

运行端到端管道

现在我们可以运行完整的主动学习流程:

蟒蛇

# Initialize components
categories = ["Cardiovascular", "Respiratory", "Gastrointestinal", "Neurological"]
annotator = GeminiAnnotator(categories=categories)
strategy = keyword_strategy

# Run the active learning loop
labeled_data = run_active_learning_loop(
    samples=text_samples,
    annotator=annotator,
    strategy=strategy,
    iterations=5,
    visualization_interval=2
)

# Final visualization and analysis
visualize_results(labeled_data)

# Print summary statistics
print("\nAnnotation Summary:")
print(f"Total samples annotated: {len(labeled_data)}")

categories = [s.labels for s in labeled_data]
unique_categories = set(categories)
print(f"Categories found: {len(unique_categories)}")
for category in unique_categories:
    count = categories.count(category)
    print(f"  - {category}: {count} samples ({count/len(labeled_data):.1%})")

avg_confidence = sum(s.metadata.get("confidence", 0) for s in labeled_data) / len(labeled_data)
print(f"Average confidence: {avg_confidence:.2f}")

实际应用和扩展

除了医学症状分类之外,该流程还有许多实际应用:

1. 内容审核

优先 用户举报的内容
关注高风险类别
根据内容类型调整置信度阈值

2. 客户反馈分析

确定紧急 客户问题
捕捉新出现的产品问题
将反馈发送给合适的团队

3.临床试验文件处理

提取不良事件报告
分类 患者报告结果
优先考虑安全信号

您可以通过以下方式扩展此实现:

添加反馈循环以改进注释器
实施不同的选择策略(多样性、 集群)
创建用于人机验证的 Web 界面
启用 多标签分类 对于复杂的症状

结语

Adala 与 Google Gemini 的整合提供了 强大的框架 用于构建智能注释流程。通过利用主动 学习策略,我们可以大大减少所需的手动工作量,同时保持 高质量注释.

本教程中演示的模块化设计模式允许 容易适应 适用于不同的领域和注释任务。

对于那些有兴趣进一步探索的人来说, Adala GitHub 存储库 提供额外的示例和文档以将这些概念扩展到更多 复杂注释场景.

发表评论

您的电邮地址不会被公开。 必填项 *

本网站使用Akismet来减少垃圾邮件。 了解您的评论数据是如何被处理的。

即刻加入 Aimojo 部落!

每周加入 76,200 多名会员获取内幕消息! 
🎁 奖金: 获得我们的 200 美元“AI 注册即可免费获得“精通工具包”!

热门 AI 工具
超大规模人工智能

几分钟内即可将任何网址转化为可立即投放的广告活动 此 AI 专为效果营销人员和以增长为导向的品牌打造的广告代理

TL;DV

不要让会议内容白白丢失。要将每次会议的内容都付诸行动。 此 AI 会议记录工具,能够记录对话并将其转化为可执行的成果。

AskYura

将每一次客户对话转化为一项完整的业务行动 无代码 AI 专为运营执行而构建的代理

库伯斯

部署更智能,扩展更快,云成本最多可降低 40%。 专为零配置全栈部署而构建的 AI 代理云 PaaS。

乌扎德

无需任何设计技能,即可将想法转化为交互式原型 AI 用于线框图、模型和应用原型设计的UI设计工具

© 2023 - 2026 版权所有 | 成为 AI 专业版 | 用心打造