Preparing Datasets¶
Agentomics-ML works with CSV datasets for classification or regression tasks.
Quick Setup¶
Create a folder in datasets/ with your data:
datasets/my_dataset/
├── train.csv # Required
├── validation.csv # Optional
├── test.csv # Optional
└── dataset_description.md # Optional
File Requirements¶
train.csv (Required)¶
Your training data with features and a target column.
validation.csv (Optional)¶
Separate validation data. If not provided, the agent creates a train/validation split from train.csv.
test.csv (Optional)¶
Hidden test set for final evaluation. The agent never sees this data during training - it's only used to report final metrics.
dataset_description.md (Optional)¶
Domain information to help the agent understand your data:
# Gene Expression Dataset
This dataset contains RNA-seq expression levels from tumor samples.
## Features
- Columns 1-100: Gene expression values (log2 TPM)
- Samples are from breast cancer patients
## Target
- `class`: tumor subtype (Basal, Her2, LumA, LumB, Normal)
## Notes
- Data is already normalized
- Consider using models that handle high-dimensional data
Target Column Detection¶
The agent auto-detects the target column from common names:
classtargetlabely
If auto-detection fails, you'll be prompted to select the target column during preparation.
For non-interactive preparation, pass --target-col to avoid prompts.
Manual Dataset Preparation¶
For more control, run preparation separately:
# Create preparation environment
conda env create -f environment_prepare.yaml
conda activate agentomics-prepare-env
# Prepare datasets
python src/prepare_datasets.py --prepare-all
Preparation Options¶
Key options:
| Option | Description |
|---|---|
--dataset-dir |
Specific dataset to prepare |
--task-type |
Force classification or regression |
--target-col |
Specify target column name |
--positive-class |
Define positive class for binary classification |
--negative-class |
Define negative class for binary classification |
Prepared Dataset Structure¶
After preparation, datasets are stored in:
prepared_datasets/my_dataset/
├── train.csv # Training data
├── validation.csv # Validation data (created if not provided)
├── train.no_label.csv # Training data without labels (for inference)
├── validation.no_label.csv
├── dataset_description.md # Copied/created description
└── metadata.json # Task type, classes, etc.
prepared_test_sets/my_dataset/
├── test.csv # Test data (if provided)
└── test.no_label.csv
Example Datasets¶
Download example datasets:
Data Format Tips¶
Classification¶
- Target column should contain class labels (strings or integers)
- Binary:
positive/negative,1/0,yes/no - Multi-class:
class_a,class_b,class_c - Multi-label classification is not supported (use a single label per row)
Regression¶
- Target column should contain numeric values
- The agent auto-detects regression when target is continuous
Feature Columns¶
- Numeric features work best
- Categorical features are supported (encoded automatically)
- Missing values are handled, but clean data performs better
Common Issues¶
"Could not detect target column"¶
Solution: Add --target-col your_column_name to preparation command, or rename your target column to class, target, label, or y.
"Task type unclear"¶
Solution: Add --task-type classification or --task-type regression to force the task type.
Next Steps¶
- Running the Agent - Use your prepared dataset
- Understanding Outputs - See what the agent produces