Foundation Models¶

Pre-trained models for specialized omics domains.

Overview¶

Foundation models are large pre-trained models specialized for specific data types. Agentomics-ML can pre-download these models to speed up agent runs.

Available Types¶

Type	Domain	Use Case
`dna`	Genomics	DNA sequences, variants
`rna`	Transcriptomics	RNA sequences, expression
`protein`	Proteomics	Protein sequences, structure
`molecule`	Chemistry	Small molecules, drugs

Pre-downloading Models¶

Download foundation models before running:

./run.sh --foundation-model-type dna

This downloads relevant models to the Docker image, avoiding download delays during agent execution.

You can also use --foundation-model-type all to include every type.

Multiple Types¶

Download multiple types by running multiple times:

./run.sh --foundation-model-type dna
./run.sh --foundation-model-type protein

In local mode (--local), models are downloaded into the workspace instead of being baked into a Docker image.

DNA Models¶

For genomic sequence data:

Variant effect prediction
Regulatory element detection
Sequence classification

Example datasets: - Gene expression from DNA features - SNP effect prediction - Promoter classification

RNA Models¶

For transcriptomic data:

RNA sequence analysis
Secondary structure prediction
Expression-based classification

Example datasets: - RNA-seq classification - Splice site prediction - RNA modification detection

Protein Models¶

For protein sequence data:

Protein function prediction
Structure-based classification
Interaction prediction

Example datasets: - Protein family classification - Enzyme activity prediction - Binding site detection

Molecule Models¶

For small molecule/chemical data:

Drug property prediction
Molecular classification
Activity prediction

Example datasets: - Drug-target interaction - Toxicity prediction - ADMET properties

How the Agent Uses Foundation Models¶

Discovery - Agent queries available foundation models
Selection - Agent chooses appropriate model for the data
Embedding - Features extracted using the model
Training - Embeddings used as input to ML model

Configuration¶

Foundation model configurations are in:

foundation_models/

Each type has a configuration specifying: - Model names and sources - Download locations - Usage instructions for the agent

Without Pre-downloading¶

If you don't pre-download, the agent can still use foundation models but will download them during execution (slower first run).

Storage Requirements¶

Foundation models can be large:

Type	Approximate Size
DNA	1-5 GB
RNA	1-5 GB
Protein	2-10 GB
Molecule	0.5-2 GB

Ensure sufficient disk space in the Docker volume or local environment.

GPU Acceleration¶

Foundation models benefit significantly from GPU:

With GPU: Fast embedding generation
CPU only: Much slower, but functional

Use --cpu-only if GPU unavailable, but expect longer run times for foundation model-based approaches.

Custom Foundation Models¶

To add custom foundation models:

Add configuration to foundation_models/
Update the download script
Add usage instructions for the agent

Agent Architecture - How models are used
GPU Settings - GPU configuration