Skip to content
/ molbox Public

A molecular database system built on Parquet for storing molecules with multiple conformers and properties

Notifications You must be signed in to change notification settings

molML/molbox

Repository files navigation

MolBox Logo

MolBox

A molecular database system built on Parquet for storing molecules with multiple conformers and properties

Status License


Overview

MolBox provides a Python API for efficient storage and retrieval of molecular structures with their conformers and properties. The system uses PyArrow/Parquet as the storage backend, enabling memory-efficient operations on large molecular datasets.

Key Features:

  • Store molecules with multiple conformers
  • Add molecule-level, atom-level, and bond-level properties
  • Canonical property indexing (independent of atom ordering)
  • Memory-efficient chunked I/O for large datasets
  • Support for RDKit and OpenEye (optional) molecules
  • Coordinate-only loading without full molecule deserialization

This is a beta release. We welcome bug reports, feature requests, and feedback through GitHub Issues.

Installation

Option 1: Using Conda (Recommended)

The recommended way to install MolBox is using conda, which handles all dependencies including RDKit:

# Clone the repository
git clone https://github.com/molML/molbox.git
cd molbox

# Create and activate the conda environment
conda env create -f environment.yaml
conda activate molbox

Verify installation:

python -c "import molbox; print('MolBox installed successfully')"

Option 2: Using pip

If you prefer pip, make sure you have RDKit installed first:

# Install RDKit (if not already installed)
conda install -c conda-forge rdkit

# Clone and install MolBox
git clone https://github.com/molML/molbox.git
cd molbox
pip install -e .

Verify installation:

python -c "import molbox; print('MolBox installed successfully')"

Optional Dependencies

  • OpenEye Toolkit (optional, for OpenEye molecule support):

    Requires a license from OpenEye Scientific. Academic licenses are available for non-commercial research.

    # Install toolkit
    conda install -c openeye openeye-toolkits
    
    # Set license file path
    export OE_LICENSE=/path/to/oe_license.txt

    For licensing information, see OpenEye Academic Licensing.

Requirements: Python ≥3.12.1, pandas, pyarrow, numpy, rdkit, tqdm, joblib

Quick Start

from molbox import MolBox
from rdkit import Chem

# Save molecules to a .box file
molecules = [Chem.MolFromSmiles(smi) for smi in ['CCO', 'c1ccccc1', 'CC(C)O']]
MolBox.save_molecules(molecules, "molecules.box")

# Load molecules
mols = MolBox.load_molecules("molecules.box")

# Load metadata (without deserializing molecules)
df = MolBox.load_database("molecules.box")
print(df[['MolBox-index', 'MolBox-smiles', 'MolBox-conformers']])

Usage Examples

Saving Molecules with Properties

# Save with molecule-level properties
MolBox.save_molecules(
    molecules,
    "database.box",
    energy=[1.2, 3.4, 5.6],
    score=[0.8, 0.9, 0.7]
)

# Load molecules and metadata separately
mols = MolBox.load_molecules("database.box")
df = MolBox.load_database("database.box")
print(df[['energy', 'score']])

Computing and Adding Properties

from rdkit.Chem import Descriptors

# Define a property function
def calc_mol_weight(mol):
    return Descriptors.MolWt(mol)

# Add computed property to existing database
MolBox.add_property("database.box", "mol_weight", property_function=calc_mol_weight)

Memory-Efficient Iteration

For large databases, iterate over molecules without loading everything into memory:

# Iterate over molecules one at a time
for idx, mol in MolBox.iterate_molecules("database.box"):
    result = expensive_computation(mol)

# Iterate over metadata in batches
for batch_df in MolBox.iterate_database("database.box", batch_size=10000):
    energies = batch_df['energy'].mean()

Loading Coordinates Only

Load coordinate data without deserializing full molecules (useful for RMSD, alignment, etc.):

import numpy as np

# Load coordinates only (returns list of arrays with shape [n_conformers, n_atoms, 3])
coords_list = MolBox.load_coordinates("database.box")

# Calculate RMSD between first two conformers of first molecule
coords1 = coords_list[0][0]  # Shape: (n_atoms, 3)
coords2 = coords_list[0][1]
rmsd = np.sqrt(np.mean((coords1 - coords2)**2))

Input Format Support

MolBox automatically detects input formats:

# From SDF file
MolBox.save_molecules("molecules.sdf", "database.box")

# From DataFrame with SMILES
import pandas as pd
df = pd.DataFrame({
    'smiles': ['CCO', 'c1ccccc1'],
    'energy': [1.2, 3.4]
})
MolBox.save_molecules(
    df,
    "database.box",
    smiles_column='smiles',
    auto_properties=True  # Automatically includes 'energy' column
)

API Reference

Core Methods

  • MolBox.save_molecules(molecules, filepath, **properties) - Save molecules with optional properties
  • MolBox.load_molecules(filepath, indices=None) - Load molecules from database
  • MolBox.load_database(filepath, columns=None) - Load metadata without deserializing molecules
  • MolBox.iterate_molecules(filepath, batch_size=1000) - Memory-efficient molecule iteration
  • MolBox.load_coordinates(filepath, indices=None) - Load coordinate arrays only

Property Management

  • MolBox.add_property(filepath, property_name, property_function=None) - Add molecule-level property
  • MolBox.add_bond_property(filepath, property_name, property_function=None) - Add bond-level property
  • MolBox.add_atom_property(filepath, property_name, property_function=None) - Add atom-level property

Technical Details

Storage Format:

  • Dual serialization (native binary + SDF format)
  • Coordinate packing with configurable precision
  • Canonical properties for atom/bond data
  • Chunked I/O using PyArrow

Performance:

  • Parallel property computation
  • Memory-efficient streaming for large datasets
  • Fast coordinate-only loading
  • Efficient metadata queries via columnar format

Contributing & Feedback

This is a public beta release. We actively welcome:

  • 🐛 Bug reports - Found something broken? Let us know!
  • 💡 Feature requests - Have ideas for improvements?
  • 📝 Feedback - How can we make MolBox better?
  • 🔧 Pull requests - Contributions are welcome!

Please open an issue on our GitHub Issues page.

License

MIT License

About

A molecular database system built on Parquet for storing molecules with multiple conformers and properties

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages