Skip to content
TopicTracker
From HackerNewsView original
TranslationTranslation

ProteinTensor – a Parquet-like tensor format for protein-structure ML

ProteinTensor is a new Parquet-like tensor format designed for protein-structure machine learning, enabling efficient storage and retrieval of protein data tensors for ML pipelines.

Background

- ProteinTensor is an open-source file format (created by the Moore Neural research group) designed to store 3D protein structure data efficiently for machine learning, similar to how Parquet stores tabular data. - The "Parquet-like" comparison is key: Parquet is a standard columnar storage format in big-data ML (pandas, Spark, etc.); ProteinTensor aims to do the same for structural biology by packing atomic coordinates, residue types, and other 3D features into compact, fast-to-read tensors. - Why it matters: Training protein-folding models (AlphaFold, ESMFold) or drug-discovery models requires huge datasets of protein structures (e.g., from the Protein Data Bank). Current formats like PDB/mmCIF are text-based and slow to load at scale. ProteinTensor promises faster I/O and smaller file sizes, making it practical to train larger models on more data. - The repo is early-stage — it provides a Python library for conversion and a specification for the format.