BadgerDB For BirdNET-Go: A New Data Model For Pi
Hey guys! Let's dive into this RFC proposing a new way to handle data in BirdNET-Go, especially for those of you rocking Raspberry Pis with micro SD cards. This is all about making things smoother, safer, and more efficient. Buckle up!
Overview
We're tackling the same issues as in issue #874 – you know, data redundancy, performance bottlenecks, and making things easier to scale. But instead of just tweaking the SQLite setup, we're looking at a whole new engine: BadgerDB. This is especially crucial because so many of you are running BirdNET-Go on Raspberry Pis, and those little SD cards have their quirks.
Motivation
As mentioned, this proposal builds on the analysis in #874. The goal? To solve the core issues with a different architectural approach specifically optimized for embedded systems with the constraints of flash storage.
Key Design Principles
Our roadmap is built on these principles:
- Flash Storage Optimization: Our main focus is to minimize write amplification on those poor micro SD cards. Flash storage has limited write cycles, so we want to make every write count.
- Data Safety: We need to reduce the risk of data corruption, especially in environments where storage might not be super reliable.
- Multi-OS Simplicity: We're aiming for zero external dependencies. This makes deployment way easier across different operating systems.
- Operational Simplicity: Backing up and restoring data should be a breeze, even for those who aren't tech wizards.
- Performance: Of course, we want things to be snappy, but it's not our only priority on these resource-constrained devices.
Proposed Architecture: BadgerDB Key-Value Store
So, what's BadgerDB? It's a pure Go, embeddable key-value database. Think of it like a super-efficient filing system. It’s designed with an LSM tree architecture, which gives us some serious advantages for our use case.
Key-Value Schema Design
Instead of the usual relational tables, we'll organize data using hierarchical keys with JSON values. This might sound a bit techy, but it's actually pretty intuitive once you see it in action:
// Species information
"species:{scientific_name}" → {
"common_names": {"en": "American Robin", "es": "Petirrojo", ...},
"taxonomy": {"family": "Turdidae", "order": "Passeriformes"},
"birdnet_labels": ["amrobi", "amrobi1"],
"created_at": "2024-01-01T00:00:00Z"
}
// Primary detections (chronological access pattern)
"detection:{timestamp}:{node_id}:{sequence}" → {
"species": "Turdus migratorius",
"confidence": 0.87,
"audio_file": "20240315_143022_Mic.wav",
"weather": {"temp": 15.5, "humidity": 68},
"source_node": "rpi4_garden",
"processing_time": 1.2
}
// Secondary indexes for common queries
"idx:species:{scientific_name}:{timestamp}" → "detection:{timestamp}:{node_id}:{sequence}"
"idx:node:{node_id}:{timestamp}" → "detection:{timestamp}:{node_id}:{sequence}"
"idx:date:{YYYY-MM-DD}:{timestamp}" → "detection:{timestamp}:{node_id}:{sequence}"
"idx:confidence:{level}:{timestamp}" → "detection:{timestamp}:{node_id}:{sequence}"
// Detection nodes and hardware
"node:{node_id}" → {
"name": "Garden Microphone",
"location": {"lat": 45.5231, "lon": -122.6765, "alt": 50},
"hardware": {"model": "RaspberryPi 4", "mic": "USB Audio"},
"config": {"sensitivity": 1.0, "overlap": 0.0}
}
// Flexible tagging system
"tag:{tag_name}" → {
"category": "behavior",
"description": "Territorial calling",
"color": "#FF5733"
}
"detection_tag:{detection_key}:{tag_name}" → {
"added_by": "user123",
"added_at": "2024-03-15T14:30:22Z"
}
// Reviews and manual verification
"review:{detection_key}:{timestamp}" → {
"reviewer": "expert_birder",
"status": "confirmed|rejected|uncertain",
"confidence_override": 0.95,
"notes": "Clear song pattern, good recording quality"
}
// System configuration and metadata
"config:schema_version" → "2.0.0"
"config:stats" → {"total_detections": 15420, "species_count": 127}
"config:retention" → {"audio_days": 30, "detection_days": 365}
Think of each line as a little fact. We have facts about species, detections, and even where our microphones are located. The keys help us find the information quickly, and the JSON values hold the juicy details.
Advanced Features
BadgerDB also lets us do some cool stuff like:
Time-Based Cleanup with TTL (Time-To-Live):
// Automatically expire old audio file references
"audio_file:{filename}" → value (TTL: 30 days)
// Temporary processing data
"temp:analysis:{session_id}" → value (TTL: 1 hour)
This is like setting an expiration date on certain data. For example, we can automatically delete audio file references after 30 days, which helps keep our storage tidy.
Versioning for Data Evolution:
// Track changes to species classifications over time
"species:Turdus_migratorius:v1" → original_data
"species:Turdus_migratorius:v2" → updated_data
"species:Turdus_migratorius" → current_data (points to v2)
This lets us track changes over time. Imagine if the scientific classification of a bird changes – we can keep a history of those changes without messing up our existing data.
Detailed Analysis
Let's break down the pros and cons for our Raspberry Pi and micro SD card crew.
✅ Advantages for Raspberry Pi + Micro SD Deployments
Data Safety & Storage Health
- LSM Tree Architecture: This is huge! It means sequential writes, which are way kinder to SD cards than SQLite's random write patterns. Think of it like writing in a notebook instead of constantly erasing and rewriting.
- Atomic Transactions: BadgerDB has built-in ACID guarantees. This means our data is safe, even if the power goes out mid-write. It ensures crash recovery.
- No WAL Corruption: SQLite's Write-Ahead Log (WAL) can sometimes get corrupted on SD cards. BadgerDB eliminates this risk.
- Value Log Design: This separates keys from values, which further reduces write amplification. Less wear and tear on your SD card!
Multi-OS Support & Deployment
- Pure Go: No external dependencies mean no headaches! It avoids CGO complications.
- Single Binary: BadgerDB compiles right into our application, making deployment super simple.
- Cross-Platform: Works the same on ARM, x86, Windows, Linux, macOS – you name it!
- No Installation: No need to install or configure a separate database server. How easy is that?
Operational Simplicity
- Single Directory Backup: Everything's in one folder, so backups are a piece of cake. Just use
rsync
ortar
. - Built-in CLI Tools: BadgerDB comes with its own backup and restore utilities.
- Consistent Snapshots: You can even back up while the system is running!
- No SQL Dependencies: This reduces the support burden, especially for users who aren't SQL experts.
Data Flexibility
- Schema Evolution: JSON values can change without needing complicated migrations.
- Versioning Support: Built-in key versioning lets us track data lineage.
- TTL Support: We've already talked about this – automatic cleanup is a lifesaver!
- Hierarchical Organization: Key prefixes give us a natural way to organize our data.
Performance Characteristics
- Read Optimized: LSM trees are read-heavy champions, perfect for those detection queries.
- Configurable Memory: We can tweak BadgerDB to play nice with the Pi's memory constraints.
- Concurrent Access: Multiple readers can access the database at the same time.
- Compression: Automatic value compression saves storage space.
❌ Disadvantages & Trade-offs
Okay, it's not all sunshine and roses. There are some downsides to consider.
Query Complexity
- No SQL: We'll have to write query logic in our application code, which can be more complex than writing SQL queries.
- Manual Indexing: Secondary indexes are crucial for performance, but we'll need to design and maintain them ourselves.
- Learning Curve: Developers will need to wrap their heads around key-value patterns.
- No Relational Constraints: The application needs to enforce data integrity, which SQL databases handle automatically.
Storage & Performance Considerations
- Data Duplication: Secondary indexes mean storing data in multiple places, increasing storage needs.
- Compaction Overhead: BadgerDB runs background compaction, which can sometimes cause CPU spikes on the Pi.
- Memory Usage: LSM trees like RAM, so we'll need to make sure we have enough.
- Query Patterns: Some complex analytical queries might be less efficient than with SQL.
Ecosystem & Tooling
- Limited Tooling: There are fewer third-party tools for BadgerDB compared to SQLite.
- Debugging: No SQL query interface makes troubleshooting a bit trickier.
- Monitoring: We'll need different metrics and monitoring approaches.
- Migration Complexity: Migrating from our existing SQLite database will require custom tools.
Raspberry Pi Specific Optimizations
Here’s an example of how we can tweak BadgerDB's settings for the Raspberry Pi:
// Optimized BadgerDB configuration for Raspberry Pi
options := badger.DefaultOptions(dataDir).
WithValueLogFileSize(16 << 20). // 16MB value logs (smaller for Pi)
WithNumMemtables(2). // Reduce memory usage
WithNumLevelZeroTables(2). // Fewer L0 tables
WithNumLevelZeroTablesStall(4). // Prevent excessive stalling
WithSyncWrites(false). // Async writes for performance
WithCompactL0OnClose(true). // Cleanup on shutdown
WithValueLogLoadingMode(options.FileIO) // FileIO for consistent performance
These settings help us balance performance and resource usage on the Pi.
Migration Strategy
Switching databases is a big deal, so we need a solid plan.
Phase 1: Parallel Implementation
- We'll implement the BadgerDB storage layer alongside our existing SQLite setup.
- We'll create a data synchronization layer for testing.
- We'll validate performance and reliability in real-world environments.
Phase 2: Feature Parity
- We'll make sure the BadgerDB layer can do everything our current SQLite setup can.
- We'll create migration tools to move existing data over.
- We'll do extensive testing on different Pi models and SD card types.
Phase 3: Gradual Transition
- We'll make BadgerDB the default for new installations.
- We'll provide an optional migration path for existing users.
- We'll keep SQLite support around for compatibility during the transition.
Implementation Roadmap
Here’s a rough timeline of how we'll roll this out:
Core Infrastructure (Week 1-2)
- [ ] BadgerDB integration and connection management
- [ ] Key schema implementation and validation
- [ ] Basic CRUD (Create, Read, Update, Delete) operations for all entity types
- [ ] Transaction management wrapper
Data Access Layer (Week 3-4)
- [ ] Detection storage and retrieval optimized for time-based queries
- [ ] Species and taxonomy management
- [ ] Secondary index management and queries
- [ ] Batch operations for bulk data processing
Advanced Features (Week 5-6)
- [ ] TTL-based cleanup for audio files and temporary data
- [ ] Review and tagging system implementation
- [ ] Statistical aggregations and reporting
- [ ] Backup/restore functionality with CLI tools
Migration & Compatibility (Week 7-8)
- [ ] SQLite to BadgerDB migration tools
- [ ] Performance benchmarking suite
- [ ] Raspberry Pi optimization testing
- [ ] Documentation and deployment guides
Performance Expectations
Based on benchmarks and the Pi's capabilities, here's what we're expecting:
Expected Improvements
- Write Performance: We're hoping for a 2-3x improvement thanks to BadgerDB's sequential writes.
- SD Card Longevity: Reduced write amplification should significantly extend SD card life.
- Crash Recovery: Faster recovery times with a simpler log structure.
- Memory Efficiency: More predictable memory usage.
Potential Concerns
- Read Latency: Complex queries might take a bit longer due to multiple key lookups.
- Storage Overhead: We might see a 10-20% increase in storage due to key overhead and indexing.
- Initial Learning: The team will need to get up to speed with BadgerDB.
Risk Assessment
Let's look at the potential risks involved.
Low Risk
- Data Safety: BadgerDB's ACID properties and crash recovery are well-tested.
- Cross-Platform: Pure Go ensures consistent behavior across platforms.
- Community: BadgerDB has an active community with production users like Dgraph and Jaeger.
Medium Risk
- Performance: We'll need to do extensive testing on actual Pi hardware with various SD cards.
- Operational: The support team will need training on BadgerDB-specific troubleshooting.
- Migration: Migrating data from SQLite could be complex.
High Risk
- Query Complexity: Some analytical features might be harder to implement efficiently.
- Ecosystem: Fewer third-party tools and integrations compared to SQLite.
Recommendation
For BirdNET-Go, especially on Raspberry Pis with micro SD cards, BadgerDB offers some compelling advantages. The improvements in data safety, deployment simplicity, and storage optimization might outweigh the increased application complexity.
I recommend:
- Prototype Development: Let's build a small prototype focusing on core detection storage and retrieval.
- Raspberry Pi Testing: We need to test extensively on different Pi models with various SD card types.
- Performance Benchmarking: Let's directly compare performance with our current SQLite implementation.
- Community Feedback: We need your input on the operational complexity trade-offs!
This proposal makes a lot of sense if we prioritize:
- Reliability on unreliable hardware (micro SD cards).
- Simple deployment (single binary, no dependencies).
- Long-term data safety over immediate query convenience.
Discussion Points
Let's chat about these things:
- Query Complexity: How much extra application complexity is acceptable for query implementation?
- Migration Path: How much automatic migration should we provide versus encouraging fresh installations?
- Hybrid Approach: Should we consider using BadgerDB for detections and SQLite for metadata?
- Performance Testing: What benchmarks are most important to the community?
This proposal builds on the excellent analysis in #874 and offers a complementary approach optimized for embedded deployments. Both solutions address the core scalability issues, but with different trade-offs that may suit different deployment scenarios.