Feature: AI Services Integration
Status: ⏳ Planned
Priority: High
Complexity: High
Estimate: 10-16 hours
Assignee: -
Created: May 31, 2025
Target Completion: -
PR: -
Related Features: Story Integration, Vocabulary System, Quiz System, Lesson Management
📌 Overview
Purpose
Integrate three AI services into the application: Mistral-Medium for text generation (stories, feedback), Vosk for speech recognition (speaking exercises), and Coqui TTS for text-to-speech (vocabulary, stories, quizzes).
User Story
As a learner, I want AI-powered features like generated stories, speech recognition for speaking practice, and TTS for audio content so that I can have an immersive and interactive learning experience.
Acceptance Criteria
📋 Requirements
Functional Requirements
| ID |
Requirement |
Priority |
| FR-001 |
Generate stories using Mistral-Medium |
High |
| FR-002 |
Generate writing feedback using Mistral-Medium |
High |
| FR-003 |
Transcribe speech using Vosk |
High |
| FR-004 |
Generate audio using Coqui TTS |
High |
| FR-005 |
Configure all services via configuration |
High |
| FR-006 |
Handle AI service errors gracefully |
High |
| FR-007 |
Cache/rate limit AI API calls |
Medium |
| FR-008 |
Validate AI outputs before use |
Medium |
Non-Functional Requirements
- Performance: TTS generation < 2 seconds per sentence
- Performance: Speech recognition < 3 seconds
- Performance: AI API calls < 5 seconds
- Reliability: Services should degrade gracefully on failure
- Cost: Minimize API call costs (caching, batching)
🏗️ Technical Design
Architecture Overview
┌─────────────────────────────────────────────────────────────┐
│ AI Services Layer │
├─────────────────────────────────────────────────────────────┤
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Mistral-Medium │ │ Vosk │ │ Coqui TTS │ │
│ │ (Text Gen) │ │ (Speech Recog.) │ │ (Audio Gen) │ │
│ └────────┬────────┘ └────────┬────────┘ └────────┬────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Application Services │ │
│ │ - StoryGenerationService │ │
│ │ - WritingFeedbackService │ │
│ │ - VoskService (Speech Recognition) │ │
│ │ - TtsService (Text-to-Speech) │ │
│ └─────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
Components Involved
- Backend Services:
IMistralService / MistralService - Text generation
IVoskService / VoskService - Speech recognition
ITtsService / TtsService - Text-to-speech
- Configuration: appsettings.json with AI settings
- External Dependencies:
- Mistral-Medium API
- Vosk Python library + German model
- Coqui TTS Python library + German model
Data Flow
Story Generation Flow
1. StoryGenerationService receives request with vocabulary list and level
2. Service constructs prompt for Mistral-Medium
3. MistralService sends prompt to Mistral API
4. Mistral API returns generated story text
5. StoryGenerationService validates and returns story
6. StoryService saves story and triggers audio generation
Speech Recognition Flow
1. User records speech in frontend
2. Frontend sends audio file to /api/speech/recognize
3. VoskService receives audio bytes
4. VoskService calls Vosk Python CLI with German model
5. Vosk returns transcribed text
6. Backend validates transcription and returns to frontend
TTS Flow
1. TtsService receives text to synthesize
2. Service calls Coqui TTS Python CLI
3. Coqui generates audio file
4. Audio file saved to filesystem
5. Audio URL returned to caller
🚀 Implementation Plan
Phase 1: Configuration & Interfaces (2 hours)
Phase 2: Mistral-Medium Integration (2-3 hours)
Phase 3: Vosk Speech Recognition (2-3 hours)
Phase 4: Coqui TTS Integration (2-3 hours)
Phase 5: Service Integration (2 hours)
Milestones
| Milestone |
Date |
Status |
| Configuration & Interfaces |
- |
⏳ |
| Mistral Integration |
- |
⏳ |
| Vosk Integration |
- |
⏳ |
| Coqui TTS Integration |
- |
⏳ |
| Service Integration |
- |
⏳ |
✅ Tasks
Backend - Configuration
Backend - Mistral Service
Backend - Vosk Service
Backend - Coqui TTS Service
Backend - Higher-Level Services
Infrastructure Setup
Frontend Integration
✅ Definition of Done
General Criteria (All Features)
AI-Specific Criteria
🧪 Testing Strategy
Testing Approach
| Test Type |
Coverage |
Tools |
Responsibility |
| Unit Tests |
80%+ code coverage |
MsTest, Moq |
Backend Dev |
| Integration Tests |
All service interactions |
MsTest, TestContainers |
Backend Dev |
| API Tests |
All endpoints |
MsTest, HttpClient |
Backend Dev |
| Frontend Unit Tests |
Component logic |
Vitest |
Frontend Dev |
| Frontend Integration |
Service integration |
Vitest |
Frontend Dev |
| E2E Tests |
Critical user journeys |
Playwright |
QA/Dev |
| Manual Testing |
Exploratory, edge cases |
BrowserStack |
QA |
| Load Testing |
AI service performance |
k6/JMeter |
DevOps |
AI-Specific Tests
Mistral Service Tests
Vosk Service Tests
Coqui TTS Service Tests
Test Data
- Sample audio files for Vosk testing (clear German speech, noisy audio, non-German speech)
- Sample texts for TTS testing (short, long, with special characters, with German umlauts)
- Sample prompts for Mistral testing (A1, A2, B1 levels)
🚨 Risks & Mitigations
Technical Risks
| Risk |
Likelihood |
Impact |
Mitigation |
Owner |
| Python-.NET integration failures |
High |
High |
Use Process class with proper error handling, implement process pooling, add timeouts |
Backend Dev |
| Vosk model compatibility issues |
Medium |
High |
Test with vosk-model-de-0.22 before implementation, have fallback to vosk-model-small-de-0.15 |
Backend Dev |
| Coqui model quality issues |
Medium |
Medium |
Test with sample German text, have alternative TTS service as fallback |
Backend Dev |
| Mistral API rate limits |
High |
Medium |
Implement caching (1h TTL), request queue, exponential backoff |
Backend Dev |
| Mistral API costs exceed budget |
Medium |
High |
Set budget alerts, implement cost tracking, cache aggressively |
Backend Dev |
| AI services slow performance |
High |
Medium |
Implement async processing, use background jobs for batch operations |
Backend Dev |
| Audio files too large |
Medium |
Medium |
Compress audio (16kHz, mono), implement streaming for large files |
Backend Dev |
| Model files too large for deployment |
Medium |
Medium |
Use Docker volumes, separate storage for models, consider cloud storage |
DevOps |
| Memory leaks in Python processes |
Medium |
High |
Implement process lifecycle management, add memory monitoring, use process pooling |
Backend Dev |
| Different Python versions cause issues |
Medium |
Medium |
Use Docker to pin Python version, document exact version in README |
DevOps |
Operational Risks
| Risk |
Likelihood |
Impact |
Mitigation |
Owner |
| AI service downtime |
Medium |
High |
Implement health checks, circuit breakers, fallback responses |
DevOps |
| Model files corrupted |
Low |
High |
Implement checksum validation, store backups, automated recovery |
DevOps |
| API key exposure |
Medium |
High |
Use GitHub secrets, Azure Key Vault, never commit to repo |
Security |
| Audio storage fills up |
Medium |
Medium |
Implement cleanup job, set size quotas, use cloud storage |
DevOps |
Business Risks
| Risk |
Likelihood |
Impact |
Mitigation |
Owner |
| User data privacy concerns |
Medium |
High |
Anonymize audio before processing, document data handling policy, comply with GDPR |
Legal |
| AI generates inappropriate content |
Low |
High |
Implement content moderation, add user reporting, use system prompts to prevent |
Backend Dev |
| AI services become too expensive |
Medium |
Medium |
Monitor costs, set budget caps, evaluate open-source alternatives |
Product |
🔗 Dependencies
Feature Dependencies
Technical Dependencies
- Python 3.8+
- Vosk Python library
- vosk-model-de-0.22 (German model)
- Coqui TTS Python library
- Coqui German TTS model
- Mistral-Medium API key
External Services
| Service |
Purpose |
Configuration |
| Mistral-Medium API |
Text generation (stories, feedback) |
API key, endpoint URL |
| Vosk |
Speech recognition |
Python path, model path |
| Coqui TTS |
Text-to-speech |
Python path, model name |
Blockers
🔧 Technical Deep Dive: Python-.NET Integration
Integration Patterns
Option 1: Process.Start (Recommended for MVP)
// Simple approach - spawn Python process for each request
public async Task<string> RecognizeSpeechAsync(byte[] audioData)
{
var tempFile = Path.GetTempFileName() + ".wav";
await File.WriteAllBytesAsync(tempFile, audioData);
var process = new Process
{
StartInfo = new ProcessStartInfo
{
FileName = "python",
Arguments = $"-m vosk.transcribe --model {_modelPath} --input {tempFile}",
RedirectStandardOutput = true,
RedirectStandardError = true,
UseShellExecute = false,
CreateNoWindow = true,
// Prevent process from hanging
EnvironmentVariables = new Dictionary<string, string>
{
["PYTHONPATH"] = "/path/to/vosk"
}
}
};
process.Start();
// Read output with timeout
var output = await process.StandardOutput.ReadToEndAsync();
var error = await process.StandardError.ReadToEndAsync();
await process.WaitForExitAsync();
if (process.ExitCode != 0)
{
throw new AiServiceException($"Vosk failed: {error}");
}
return output.Trim();
}
Pros: Simple, easy to implement, no additional dependencies
Cons: Process startup overhead (~100-500ms per call), resource-intensive
Option 2: Process Pooling (Recommended for Production)
// Maintain a pool of persistent Python processes
public class PythonProcessPool : IDisposable
{
private readonly ConcurrentQueue<Process> _pool = new();
private readonly SemaphoreSlim _semaphore;
private readonly string _pythonPath;
private readonly string _scriptPath;
public PythonProcessPool(int size, string pythonPath, string scriptPath)
{
_semaphore = new SemaphoreSlim(size);
_pythonPath = pythonPath;
_scriptPath = scriptPath;
// Pre-warm the pool
for (int i = 0; i < size; i++)
{
_pool.Enqueue(StartProcess());
}
}
public async Task<string> ExecuteAsync(string input)
{
await _semaphore.WaitAsync();
if (!_pool.TryDequeue(out var process))
{
process = StartProcess();
}
try
{
// Send input to stdin
await process.StandardInput.WriteLineAsync(input);
await process.StandardInput.FlushAsync();
// Read response from stdout
var response = await process.StandardOutput.ReadLineAsync();
return response;
}
finally
{
_pool.Enqueue(process);
_semaphore.Release();
}
}
private Process StartProcess()
{
return new Process
{
StartInfo = new ProcessStartInfo
{
FileName = _pythonPath,
Arguments = _scriptPath,
RedirectStandardInput = true,
RedirectStandardOutput = true,
RedirectStandardError = true,
UseShellExecute = false,
CreateNoWindow = true
}
}.Start();
}
public void Dispose()
{
foreach (var process in _pool)
{
try { process.Kill(); } catch { }
process.Dispose();
}
}
}
Pros: Eliminates process startup overhead, much faster for repeated calls
Cons: More complex, need to handle process lifecycle, stdin/stdout parsing
Option 3: gRPC (Best for Production)
- Create Python gRPC server for AI services
- .NET client calls gRPC methods
- Single persistent Python process
- Type-safe, high-performance
Pros: Best performance, type-safe, production-ready
Cons: Most complex to set up, requires gRPC knowledge
Error Handling Strategy
// Comprehensive error handling for AI services
public async Task<T> ExecuteWithRetryAsync<T>(
Func<Task<T>> action,
string operationName,
int maxRetries = 3,
TimeSpan? timeout = null)
{
var retryCount = 0;
timeout ??= TimeSpan.FromSeconds(30);
while (true)
{
try
{
using var cts = new CancellationTokenSource(timeout.Value);
return await action();
}
catch (OperationCanceledException) when (retryCount < maxRetries)
{
retryCount++;
var delay = TimeSpan.FromSeconds(Math.Pow(2, retryCount));
_logger.LogWarning(
"{Operation} timed out (attempt {Attempt}), retrying in {Delay}s...",
operationName, retryCount, delay.TotalSeconds);
await Task.Delay(delay);
}
catch (AiServiceException ex) when (IsRetryable(ex) && retryCount < maxRetries)
{
retryCount++;
var delay = TimeSpan.FromSeconds(Math.Pow(2, retryCount));
_logger.LogWarning(ex,
"{Operation} failed (attempt {Attempt}), retrying in {Delay}s...",
operationName, retryCount, delay.TotalSeconds);
await Task.Delay(delay);
}
catch (Exception ex)
{
_logger.LogError(ex, "{Operation} failed permanently after {Attempts} attempts",
operationName, retryCount + 1);
throw new AiServiceException($"{operationName} failed: {ex.Message}", ex);
}
}
bool IsRetryable(AiServiceException ex) =>
ex.ErrorCode switch
{
AiErrorCode.RateLimited => true,
AiErrorCode.Temporary => true,
AiErrorCode.Timeout => true,
_ => false
};
}
Health Check Implementation
// Health check for AI services
public class AiServicesHealthCheck : IHealthCheck
{
private readonly IMistralService _mistral;
private readonly IVoskService _vosk;
private readonly ITtsService _tts;
public async Task<HealthCheckResult> CheckHealthAsync(
HealthCheckContext context,
CancellationToken cancellationToken = default)
{
var checks = new Dictionary<string, HealthStatus>();
// Check Mistral
try
{
await _mistral.TestConnectionAsync(cancellationToken);
checks["Mistral"] = HealthStatus.Healthy;
}
catch (Exception ex)
{
checks["Mistral"] = HealthStatus.Unhealthy;
}
// Check Vosk
try
{
await _vosk.TestModelAsync(cancellationToken);
checks["Vosk"] = HealthStatus.Healthy;
}
catch (Exception ex)
{
checks["Vosk"] = HealthStatus.Unhealthy;
}
// Check Coqui TTS
try
{
await _tts.TestModelAsync(cancellationToken);
checks["Coqui TTS"] = HealthStatus.Healthy;
}
catch (Exception ex)
{
checks["Coqui TTS"] = HealthStatus.Unhealthy;
}
var allHealthy = checks.Values.All(s => s == HealthStatus.Healthy);
var status = allHealthy ? HealthStatus.Healthy : HealthStatus.Unhealthy;
return new HealthCheckResult(
status,
"AI Services health check",
data: checks);
}
}
Audio File Management
// Audio file storage service
public class AudioFileService
{
private readonly string _basePath;
private readonly ILogger<AudioFileService> _logger;
public AudioFileService(IConfiguration config, ILogger<AudioFileService> logger)
{
_basePath = config["Audio:StoragePath"] ?? "/var/audio";
_logger = logger;
Directory.CreateDirectory(_basePath);
}
public async Task<string> SaveAudioAsync(byte[] audioData, string category, int entityId)
{
// Validate audio data
if (audioData == null || audioData.Length == 0)
throw new ArgumentException("Audio data cannot be empty");
if (audioData.Length > 10 * 1024 * 1024) // 10MB limit
throw new ArgumentException("Audio file too large");
// Create category directory
var categoryPath = Path.Combine(_basePath, category);
Directory.CreateDirectory(categoryPath);
// Generate unique filename
var extension = ".wav"; // or detect from data
var filename = $"{entityId}{extension}";
var fullPath = Path.Combine(categoryPath, filename);
// Check for existing file
if (File.Exists(fullPath))
File.Delete(fullPath);
// Save file
await File.WriteAllBytesAsync(fullPath, audioData);
// Return relative path
return $"/audio/{category}/{filename}";
}
public async Task CleanupOldFilesAsync(TimeSpan olderThan)
{
var cutoff = DateTime.UtcNow - olderThan;
foreach (var categoryDir in Directory.GetDirectories(_basePath))
{
foreach (var file in Directory.GetFiles(categoryDir))
{
var fileInfo = new FileInfo(file);
if (fileInfo.LastWriteTimeUtc < cutoff)
{
try
{
File.Delete(file);
_logger.LogInformation("Deleted old audio file: {File}", file);
}
catch (Exception ex)
{
_logger.LogError(ex, "Failed to delete audio file: {File}", file);
}
}
}
}
}
}
Rate Limiting Implementation
// Rate limiter for AI services
public class AiRateLimiter
{
private readonly ConcurrentDictionary<string, RateLimitEntry> _limits = new();
private readonly int _maxRequests;
private readonly TimeSpan _window;
public AiRateLimiter(int maxRequestsPerWindow, TimeSpan window)
{
_maxRequests = maxRequestsPerWindow;
_window = window;
}
public bool TryAcquire(string serviceName)
{
var now = DateTime.UtcNow;
var entry = _limits.GetOrAdd(serviceName, _ => new RateLimitEntry());
lock (entry)
{
// Remove old requests
entry.Requests.RemoveAll(r => now - r > _window);
// Check if limit exceeded
if (entry.Requests.Count >= _maxRequests)
return false;
// Add new request
entry.Requests.Add(now);
return true;
}
}
private class RateLimitEntry
{
public List<DateTime> Requests { get; } = new();
}
}
// Usage in controller
[HttpPost("recognize")]
public async Task<IActionResult> RecognizeSpeech([FromBody] AudioRequest request)
{
if (!_rateLimiter.TryAcquire("Vosk"))
{
return StatusCode(429, "Too many requests");
}
// ... process request
}
📝 Notes & Decisions
| Date |
Decision |
Rationale |
| May 31, 2025 |
Use Mistral-Medium |
Best balance of quality and cost for this use case |
| May 31, 2025 |
Use Vosk for speech recognition |
Open-source, supports German, self-hostable |
| May 31, 2025 |
Use Coqui TTS |
Open-source, good quality, supports German |
| May 31, 2025 |
Self-host AI services |
More control, no external API dependencies (except Mistral) |
| May 31, 2025 |
Use Python CLI wrappers |
Easier integration with .NET, well-supported libraries |
Technical Notes
Vosk Configuration
{
"Vosk": {
"PythonPath": "/usr/bin/python3",
"ModelPath": "/models/vosk-model-de-0.22",
"SampleRate": 16000
}
}
Coqui TTS Configuration
{
"Coqui": {
"PythonPath": "/usr/bin/python3",
"ModelName": "tts_models/de/deu/fairseq/vits",
"AudioOutputFormat": "wav",
"SampleRate": 22050
}
}
Mistral Configuration
{
"Mistral": {
"ApiKey": "your-api-key",
"BaseUrl": "https://api.mistral.ai/v1/",
"DefaultModel": "mistral-medium",
"TimeoutSeconds": 30,
"MaxRetries": 3
}
}
Error Handling Strategy
- Transient errors: Retry with exponential backoff
- Rate limits: Return 429 to client, suggest retry
- Service unavailable: Return 503, log error
- Invalid response: Validate output, return meaningful error
- Timeout: Return 504, suggest retry
Caching Strategy
- Mistral responses: Cache for 1 hour (stories unlikely to change)
- TTS audio: Cache files permanently (regenerate only if text changes)
- Vosk: No caching (each audio is unique)
Gotchas
- ⚠️ Vosk model is ~500MB - ensure enough disk space
- ⚠️ Coqui model is ~1.5GB - ensure enough disk space
- ⚠️ Python processes may have memory leaks - monitor and restart
- ⚠️ AI services may fail silently - implement health checks
- ⚠️ Mistral API has costs - implement budget tracking
- ⚠️ Audio generation can be CPU-intensive - consider separate service
- ⚠️ Different Python versions may have compatibility issues
File Storage Structure
/public/
├── audio/
│ ├── vocabulary/ # Vocabulary word audio
│ │ └── {id}.wav
│ ├── story/ # Story segment audio
│ │ └── {levelId}-{order}.wav
│ └── quiz/ # Quiz question audio
│ └── {questionId}.wav
└── models/ # AI models
├── vosk/
│ └── vosk-model-de-0.22/
└── coqui/
└── tts_models/
Performance Considerations
- TTS generation: ~1-2 seconds per sentence
- Speech recognition: ~1-3 seconds per audio clip
- Mistral API: ~2-5 seconds per request
- Consider async/background processing for batch operations
📊 Progress History
| Date |
Status Change |
Notes |
| May 31, 2025 |
Created |
Initial plan based on application-plan.md |
📎 Related Files & Links
Feature created from application-plan.md