MolBox

A molecular database system built on Parquet for storing molecules with multiple conformers and properties

Overview

MolBox provides a Python API for efficient storage and retrieval of molecular structures with their conformers and properties. The system uses PyArrow/Parquet as the storage backend, enabling memory-efficient operations on large molecular datasets.

Key Features:

Store molecules with multiple conformers
Add molecule-level, atom-level, and bond-level properties
Canonical property indexing (independent of atom ordering)
Memory-efficient chunked I/O for large datasets
Support for RDKit and OpenEye (optional) molecules
Coordinate-only loading without full molecule deserialization

This is a beta release. We welcome bug reports, feature requests, and feedback through GitHub Issues.

Installation

Option 1: Using Conda (Recommended)

The recommended way to install MolBox is using conda, which handles all dependencies including RDKit:

# Clone the repository
git clone https://github.com/molML/molbox.git
cd molbox

# Create and activate the conda environment
conda env create -f environment.yaml
conda activate molbox

Verify installation:

python -c "import molbox; print('MolBox installed successfully')"

Option 2: Using pip

If you prefer pip, make sure you have RDKit installed first:

# Install RDKit (if not already installed)
conda install -c conda-forge rdkit

# Clone and install MolBox
git clone https://github.com/molML/molbox.git
cd molbox
pip install -e .

Verify installation:

python -c "import molbox; print('MolBox installed successfully')"

Optional Dependencies

OpenEye Toolkit (optional, for OpenEye molecule support):

Requires a license from OpenEye Scientific. Academic licenses are available for non-commercial research.
```
# Install toolkit
conda install -c openeye openeye-toolkits

# Set license file path
export OE_LICENSE=/path/to/oe_license.txt
```
For licensing information, see OpenEye Academic Licensing.

Requirements: Python ≥3.12.1, pandas, pyarrow, numpy, rdkit, tqdm, joblib

Quick Start

from molbox import MolBox
from rdkit import Chem

# Save molecules to a .box file
molecules = [Chem.MolFromSmiles(smi) for smi in ['CCO', 'c1ccccc1', 'CC(C)O']]
MolBox.save_molecules(molecules, "molecules.box")

# Load molecules
mols = MolBox.load_molecules("molecules.box")

# Load metadata (without deserializing molecules)
df = MolBox.load_database("molecules.box")
print(df[['MolBox-index', 'MolBox-smiles', 'MolBox-conformers']])

Usage Examples

Saving Molecules with Properties

# Save with molecule-level properties
MolBox.save_molecules(
    molecules,
    "database.box",
    energy=[1.2, 3.4, 5.6],
    score=[0.8, 0.9, 0.7]
)

# Load molecules and metadata separately
mols = MolBox.load_molecules("database.box")
df = MolBox.load_database("database.box")
print(df[['energy', 'score']])

Computing and Adding Properties

from rdkit.Chem import Descriptors

# Define a property function
def calc_mol_weight(mol):
    return Descriptors.MolWt(mol)

# Add computed property to existing database
MolBox.add_property("database.box", "mol_weight", property_function=calc_mol_weight)

Memory-Efficient Iteration

For large databases, iterate over molecules without loading everything into memory:

# Iterate over molecules one at a time
for idx, mol in MolBox.iterate_molecules("database.box"):
    result = expensive_computation(mol)

# Iterate over metadata in batches
for batch_df in MolBox.iterate_database("database.box", batch_size=10000):
    energies = batch_df['energy'].mean()

Loading Coordinates Only

Load coordinate data without deserializing full molecules (useful for RMSD, alignment, etc.):

import numpy as np

# Load coordinates only (returns list of arrays with shape [n_conformers, n_atoms, 3])
coords_list = MolBox.load_coordinates("database.box")

# Calculate RMSD between first two conformers of first molecule
coords1 = coords_list[0][0]  # Shape: (n_atoms, 3)
coords2 = coords_list[0][1]
rmsd = np.sqrt(np.mean((coords1 - coords2)**2))

Input Format Support

MolBox automatically detects input formats:

# From SDF file
MolBox.save_molecules("molecules.sdf", "database.box")

# From DataFrame with SMILES
import pandas as pd
df = pd.DataFrame({
    'smiles': ['CCO', 'c1ccccc1'],
    'energy': [1.2, 3.4]
})
MolBox.save_molecules(
    df,
    "database.box",
    smiles_column='smiles',
    auto_properties=True  # Automatically includes 'energy' column
)

API Reference

Core Methods

MolBox.save_molecules(molecules, filepath, **properties) - Save molecules with optional properties
MolBox.load_molecules(filepath, indices=None) - Load molecules from database
MolBox.load_database(filepath, columns=None) - Load metadata without deserializing molecules
MolBox.iterate_molecules(filepath, batch_size=1000) - Memory-efficient molecule iteration
MolBox.load_coordinates(filepath, indices=None) - Load coordinate arrays only

Property Management

MolBox.add_property(filepath, property_name, property_function=None) - Add molecule-level property
MolBox.add_bond_property(filepath, property_name, property_function=None) - Add bond-level property
MolBox.add_atom_property(filepath, property_name, property_function=None) - Add atom-level property

Technical Details

Storage Format:

Dual serialization (native binary + SDF format)
Coordinate packing with configurable precision
Canonical properties for atom/bond data
Chunked I/O using PyArrow

Performance:

Parallel property computation
Memory-efficient streaming for large datasets
Fast coordinate-only loading
Efficient metadata queries via columnar format

Contributing & Feedback

This is a public beta release. We actively welcome:

🐛 Bug reports - Found something broken? Let us know!
💡 Feature requests - Have ideas for improvements?
📝 Feedback - How can we make MolBox better?
🔧 Pull requests - Contributions are welcome!

Please open an issue on our GitHub Issues page.

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
molbox		molbox
.gitignore		.gitignore
README.md		README.md
benchmark_formats.py		benchmark_formats.py
environment.yaml		environment.yaml
logo.png		logo.png
setup.py		setup.py
test_fresh_install.py		test_fresh_install.py
test_molbox_comprehensive.py		test_molbox_comprehensive.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MolBox

Overview

Installation

Option 1: Using Conda (Recommended)

Option 2: Using pip

Optional Dependencies

Quick Start

Usage Examples

Saving Molecules with Properties

Computing and Adding Properties

Memory-Efficient Iteration

Loading Coordinates Only

Input Format Support

API Reference

Core Methods

Property Management

Technical Details

Contributing & Feedback

License

About

Uh oh!

Releases

Packages

Languages

molML/molbox

Folders and files

Latest commit

History

Repository files navigation

MolBox

Overview

Installation

Option 1: Using Conda (Recommended)

Option 2: Using pip

Optional Dependencies

Quick Start

Usage Examples

Saving Molecules with Properties

Computing and Adding Properties

Memory-Efficient Iteration

Loading Coordinates Only

Input Format Support

API Reference

Core Methods

Property Management

Technical Details

Contributing & Feedback

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages