Preparing Datasets¶

Agentomics-ML works with CSV datasets for classification or regression tasks.

Quick Setup¶

Create a folder in datasets/ with your data:

datasets/my_dataset/
├── train.csv              # Required
├── validation.csv         # Optional
├── test.csv               # Optional
├── dataset_description.md # Optional
└── dataset_config.json    # Optional — avoids interactive prompts during dataset preparation

File Requirements¶

train.csv (Required)¶

Your training data with features and a target column.

feature1,feature2,feature3,target
1.2,3.4,5.6,positive
7.8,9.0,1.2,negative

validation.csv (Optional)¶

Separate validation data. If not provided, the agent creates a train/validation split from train.csv.

test.csv (Optional)¶

Hidden test set for final evaluation. The agent never sees this data during training - it's only used to report final metrics.

dataset_description.md (Optional)¶

Domain information to help the agent understand your data:

# Gene Expression Dataset

This dataset contains RNA-seq expression levels from tumor samples.

## Features
- Columns 1-100: Gene expression values (log2 TPM)
- Samples are from breast cancer patients

## Target
- `class`: tumor subtype (Basal, Her2, LumA, LumB, Normal)

## Notes
- Data is already normalized
- Consider using models that handle high-dimensional data

Dataset Config File (Optional)¶

Add an optional dataset_config.json to your dataset folder to avoid interactive prompts during dataset preparation.

task_type is the most important field — without it you'll be prompted every time you prepare the dataset.

{
    "task_type": "classification",
    "target_col": "label",
    "positive_class": 1,
    "negative_class": 0
}

Fields:

task_type (optional): "classification" or "regression"; if omitted, you will be prompted during dataset preparation.
target_col (optional): column name to predict; auto-detected if omitted.
positive_class (optional): value that counts as "positive"; only applicable for some binary classification metrics, auto-detected if omitted.
negative_class (optional): value that counts as "negative"; only applicable for some binary classification metrics, auto-detected if omitted.

Include only the fields you need — at minimum just task_type. Values from this file take precedence over auto-detection, but CLI flags (--task-type, --target-col, etc.) override the config file.

Target Column Detection¶

The target column is resolved in this order: 1. CLI flag (--target-col) 2. dataset_config.json (target_col field) 3. Auto-detection from common names: class, target, label, y 4. Interactive prompt (if running interactively)

If all of the above fail, preparation will raise an error.

Manual Dataset Preparation¶

For more control, run preparation separately:

# Create preparation environment
conda env create -f envs/environment_prepare.yaml
conda activate agentomics-prepare-env

# Prepare datasets
python src/prepare_datasets.py --prepare-all

Preparation Options¶

python src/prepare_datasets.py --help

Key options:

Option	Description
`--dataset-dir`	Specific dataset to prepare
`--task-type`	Specify `classification` or `regression`
`--target-col`	Specify target column name
`--positive-class`	Define positive class for binary classification
`--negative-class`	Define negative class for binary classification

Note: already-prepared datasets are skipped on re-runs (preserves --positive-class/--negative-class). To re-prepare, delete the folder under prepared_datasets/ (and under prepared_test_sets/ if a test set was provided) and rerun the preparation script.

Prepared Dataset Structure¶

After preparation, datasets are stored in:

prepared_datasets/my_dataset/
├── train.csv              # Training data
├── validation.csv         # Validation data (created if not provided)
├── dataset_description.md # Copied/created description
└── metadata.json          # Task type, classes, etc.

prepared_test_sets/my_dataset/
├── test.csv               # Test data (if provided)
└── test.no_label.csv      # Test data without labels

Example Datasets¶

Download example datasets:

./scripts/download_example_dataset.sh --all

Data Format Tips¶

Classification¶

Target column should contain class labels (strings or integers)
Binary: positive/negative, 1/0, yes/no
Multi-class: class_a, class_b, class_c
Multi-label classification is not supported (use a single label per row)

Regression¶

Target column should contain numeric values
Select regression during preparation or pass --task-type regression

Feature Columns¶

Numeric features work best
Categorical features are supported (encoded automatically)
Missing values are handled, but clean data performs better

Common Issues¶

"Could not detect target column"¶

Solution: Add --target-col your_column_name to preparation command, or rename your target column to class, target, label, or y.

"Task type required"¶

Solution (preferred): Add a dataset_config.json to your dataset folder with {"task_type": "classification"} or {"task_type": "regression"}.

Alternative: Pass --task-type classification or --task-type regression to the preparation command, or run preparation interactively and select when prompted.

Next Steps¶

Running the Agent - Use your prepared dataset
Understanding Outputs - See what the agent produces