Sequelizer Documentation

C toolkit for DNA sequence analysis and nanopore data processing

What is Sequelizer?

Sequelizer is a C-based toolkit designed for nanopore DNA/RNA sequencing analysis. It provides robust, efficient tools for:

Fast5 file analysis - Comprehensive metadata extraction and format validation
Signal conversion - Raw signal extraction for downstream analysis
Cross-platform support - Clean, portable C implementation
Integration-ready - Foundation for advanced tools like Ciren

Key Features

Robust Fast5 support - Handles standard and non-standard file formats with intelligent fallback detection
Minimal dependencies - Only requires HDF5 and CMake for maximum portability
High performance - Efficient metadata-only processing, memory-conscious design
Clean architecture - Modular subcommand design for easy extension
Battle-tested - Validated against real-world nanopore datasets

Core Commands

sequelizer fast5 - Fast5 Analysis

Comprehensive Fast5 file metadata extraction and validation.

# Single file analysis
./sequelizer fast5 data.fast5
# Dataset analysis with full details
./sequelizer fast5 /path/to/dataset/ --recursive --verbose
# Debug problematic files
./sequelizer fast5 problematic.fast5 --debug

→ Complete Fast5 command guide

sequelizer convert - Signal Extraction

Extract raw signals from Fast5 files for downstream analysis.

# Convert single file to raw signals
./sequelizer convert data.fast5 --to raw
# Batch convert with all reads
./sequelizer convert /path/to/dataset/ --to raw --recursive --all --output signals/

→ Complete convert command guide

Documentation

Getting Started

Getting Started - Installation, build instructions, and first steps

User Guides

Commands Reference - Complete usage guide for all commands
Fast5 Compatibility - Format support and troubleshooting

Technical Documentation

Architecture Overview - Core design and extension points
Integration Guide - Using Sequelizer in pipelines

Architecture

Clean Subcommand Design

src/
├── core/
│   ├── fast5_io.c/h          # Shared Fast5 I/O (used by Ciren)
│   ├── fast5_utils.c/h       # File utilities and metadata
│   └── util.c/h              # Common utilities
├── sequelizer.c              # Minimal main() with routing
├── sequelizer_subcommands.c/h # Command routing and help
├── sequelizer_fast5.c/h      # Fast5 analysis implementation
└── sequelizer_convert.c/h    # Signal conversion implementation

Key APIs

// Fast5 file discovery
char **find_fast5_files(const char *path, bool recursive, int *count);
// Metadata extraction
fast5_metadata_t *read_fast5_metadata(const char *filename, int *count);
// Format detection (automatic and robust)
fast5_format_t detect_fast5_format(hid_t file_id);

Extension Points

Adding new subcommands follows a simple pattern:

Create sequelizer_<command>.c/h files
Add enum entry and detection logic
Update CMakeLists.txt
Follow established error handling and help patterns

Integration

Pipeline Integration

# Generate file lists for processing
./sequelizer fast5 dataset/ --recursive > file_list.txt
# Extract signals for analysis
./sequelizer convert dataset/ --to raw --recursive --all -o signals/
# Validate files before processing
./sequelizer fast5 dataset/ --recursive 2> validation.log

Ciren Integration

Sequelizer serves as the open-source foundation for Ciren:

# Use Sequelizer for initial analysis
./sequelizer fast5 dataset/ --recursive --verbose
# Use Ciren for advanced features
../ciren/build/ciren fast5 dataset/ --format json --enhanced-stats

Shared components:

src/core/fast5_io.c/h - Identical Fast5 I/O implementation
Build patterns and architectural principles
Compatible APIs and data structures

Tested Datasets

Sequelizer has been validated against real-world nanopore datasets:

# SquiggleFilter project data
./sequelizer fast5 /path/to/SquiggleFilter/data/lambda/fast5/ --recursive
# slow5tools test data  
./sequelizer fast5 /path/to/slow5tools/test/data --recursive
# Various Oxford Nanopore formats
# - Standard multi-read Fast5
# - Legacy single-read Fast5  
# - Non-standard variants missing file_type attributes

Performance

Characteristics

Metadata-only processing - Signal data never loaded into memory
Efficient file access - Minimal HDF5 operations per file
Scalable design - Linear performance with file count, not file size
Memory conscious - < 1KB overhead per file processed

Benchmarks

Small files (< 1MB): Instant processing
Large files (> 100MB): Same performance (metadata-only)
Large directories (1000+ files): Efficient batch processing

Support and Development

Getting Help

Documentation: Start with this index and the commands reference
Troubleshooting: See Fast5 compatibility guide
Issues: Report problems with specific file examples and debug output

Contributing

Code style: 2-space indentation, descriptive names
Testing: Validate against real nanopore datasets
Documentation: Update relevant guides for new features

License

Sequelizer is open-source software. See LICENSE for details.

Quick Reference

Most Common Operations

# Analyze a dataset
./sequelizer fast5 /path/to/data/ --recursive --verbose
# Extract all signals from multi-read files  
./sequelizer convert /path/to/data/ --to raw --all --recursive -o signals/
# Debug a problematic file
./sequelizer fast5 problem_file.fast5 --debug
# Get help for any command
./sequelizer <command> --help

Next Steps

Try the commands - Start with Fast5 analysis
Understand compatibility - Learn about format support
Integrate with pipelines - Use in your analysis workflow
Explore Ciren - Advanced features and performance enhancements

For the complete feature set and enhanced performance, consider Ciren which builds on Sequelizer’s foundation.