A molecular database system built on Parquet for storing molecules with multiple conformers and properties
MolBox provides a Python API for efficient storage and retrieval of molecular structures with their conformers and properties. The system uses PyArrow/Parquet as the storage backend, enabling memory-efficient operations on large molecular datasets.
Key Features:
- Store molecules with multiple conformers
- Add molecule-level, atom-level, and bond-level properties
- Canonical property indexing (independent of atom ordering)
- Memory-efficient chunked I/O for large datasets
- Support for RDKit and OpenEye (optional) molecules
- Coordinate-only loading without full molecule deserialization
This is a beta release. We welcome bug reports, feature requests, and feedback through GitHub Issues.
The recommended way to install MolBox is using conda, which handles all dependencies including RDKit:
# Clone the repository
git clone https://github.com/molML/molbox.git
cd molbox
# Create and activate the conda environment
conda env create -f environment.yaml
conda activate molboxVerify installation:
python -c "import molbox; print('MolBox installed successfully')"If you prefer pip, make sure you have RDKit installed first:
# Install RDKit (if not already installed)
conda install -c conda-forge rdkit
# Clone and install MolBox
git clone https://github.com/molML/molbox.git
cd molbox
pip install -e .Verify installation:
python -c "import molbox; print('MolBox installed successfully')"-
OpenEye Toolkit (optional, for OpenEye molecule support):
Requires a license from OpenEye Scientific. Academic licenses are available for non-commercial research.
# Install toolkit conda install -c openeye openeye-toolkits # Set license file path export OE_LICENSE=/path/to/oe_license.txt
For licensing information, see OpenEye Academic Licensing.
Requirements: Python ≥3.12.1, pandas, pyarrow, numpy, rdkit, tqdm, joblib
from molbox import MolBox
from rdkit import Chem
# Save molecules to a .box file
molecules = [Chem.MolFromSmiles(smi) for smi in ['CCO', 'c1ccccc1', 'CC(C)O']]
MolBox.save_molecules(molecules, "molecules.box")
# Load molecules
mols = MolBox.load_molecules("molecules.box")
# Load metadata (without deserializing molecules)
df = MolBox.load_database("molecules.box")
print(df[['MolBox-index', 'MolBox-smiles', 'MolBox-conformers']])# Save with molecule-level properties
MolBox.save_molecules(
molecules,
"database.box",
energy=[1.2, 3.4, 5.6],
score=[0.8, 0.9, 0.7]
)
# Load molecules and metadata separately
mols = MolBox.load_molecules("database.box")
df = MolBox.load_database("database.box")
print(df[['energy', 'score']])from rdkit.Chem import Descriptors
# Define a property function
def calc_mol_weight(mol):
return Descriptors.MolWt(mol)
# Add computed property to existing database
MolBox.add_property("database.box", "mol_weight", property_function=calc_mol_weight)For large databases, iterate over molecules without loading everything into memory:
# Iterate over molecules one at a time
for idx, mol in MolBox.iterate_molecules("database.box"):
result = expensive_computation(mol)
# Iterate over metadata in batches
for batch_df in MolBox.iterate_database("database.box", batch_size=10000):
energies = batch_df['energy'].mean()Load coordinate data without deserializing full molecules (useful for RMSD, alignment, etc.):
import numpy as np
# Load coordinates only (returns list of arrays with shape [n_conformers, n_atoms, 3])
coords_list = MolBox.load_coordinates("database.box")
# Calculate RMSD between first two conformers of first molecule
coords1 = coords_list[0][0] # Shape: (n_atoms, 3)
coords2 = coords_list[0][1]
rmsd = np.sqrt(np.mean((coords1 - coords2)**2))MolBox automatically detects input formats:
# From SDF file
MolBox.save_molecules("molecules.sdf", "database.box")
# From DataFrame with SMILES
import pandas as pd
df = pd.DataFrame({
'smiles': ['CCO', 'c1ccccc1'],
'energy': [1.2, 3.4]
})
MolBox.save_molecules(
df,
"database.box",
smiles_column='smiles',
auto_properties=True # Automatically includes 'energy' column
)MolBox.save_molecules(molecules, filepath, **properties)- Save molecules with optional propertiesMolBox.load_molecules(filepath, indices=None)- Load molecules from databaseMolBox.load_database(filepath, columns=None)- Load metadata without deserializing moleculesMolBox.iterate_molecules(filepath, batch_size=1000)- Memory-efficient molecule iterationMolBox.load_coordinates(filepath, indices=None)- Load coordinate arrays only
MolBox.add_property(filepath, property_name, property_function=None)- Add molecule-level propertyMolBox.add_bond_property(filepath, property_name, property_function=None)- Add bond-level propertyMolBox.add_atom_property(filepath, property_name, property_function=None)- Add atom-level property
Storage Format:
- Dual serialization (native binary + SDF format)
- Coordinate packing with configurable precision
- Canonical properties for atom/bond data
- Chunked I/O using PyArrow
Performance:
- Parallel property computation
- Memory-efficient streaming for large datasets
- Fast coordinate-only loading
- Efficient metadata queries via columnar format
This is a public beta release. We actively welcome:
- 🐛 Bug reports - Found something broken? Let us know!
- 💡 Feature requests - Have ideas for improvements?
- 📝 Feedback - How can we make MolBox better?
- 🔧 Pull requests - Contributions are welcome!
Please open an issue on our GitHub Issues page.
MIT License