data | Notion

dependencies

Dataset Overview

For this project, I used the open Catalyst 2020 Dataset (OC20).

A few important points:

→ data is stored in PyTorch Geometric objects and stored in LMDB files

→ for each task, there are several sized training splits.

→ validation/test splits are broken into subsplits

→ in domain (ID)

→ out of domain adsorbate (OOD-Ads)

→ out of domain catalyst (OOD-Cat)

→ out of domain adsorbate and catalyst (OOD-Both)

Train

S2EF - 200k, 2M, 20M, 134M (all)
IS2RE/IS2RS - 10k, 100k, 460k (All)

Val/test

S2EF - ~1M across all subsplits
IS2RE/IS2RS- ~25k scross all splits

For tutorial purposes, OC20 offers smaller splits (100 train, 20 val for all tasks) so users can easily store, train, and predict across various tasks

Data visualization:

import matplotlib
matplotlib.use('Agg')

import os
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

params = {
   'axes.labelsize': 14,
   'font.size': 14,
   'font.family': ' DejaVu Sans',
   'legend.fontsize': 20,
   'xtick.labelsize': 20,
   'ytick.labelsize': 20,
   'axes.labelsize': 25,
   'axes.titlesize': 25,
   'text.usetex': False,
   'figure.figsize': [12, 12]
}
matplotlib.rcParams.update(params)

import ase.io
from ase.io.trajectory import Trajectory
from ase.io import extxyz
from ase.calculators.emt import EMT
from ase.build import fcc100, add_adsorbate, molecule
from ase.constraints import FixAtoms
from ase.optimize import LBFGS
from ase.visualize.plot import plot_atoms
from ase import Atoms
from IPython.display import Image

matplotlib.use('Agg') - "Agg" backend, which stands for "Anti-Grain Geometry". This backend is used for saving plots to files, rather than displaying them on the screen.

params dictionary sets some default options for matplotlib, such as the font size and family, the size of the labels and ticks on the axes, and the size of the figure.

matplotlib.rcParams.update(params) line updates default options with the ones specified in the params dictionary

rest of the code imports various functions and classes from the ase and IPython modules, which are used for tasks such as reading and writing atomic simulation data, building and optimizing atomic structures, and displaying images in the notebook.

Understanding the data

Atomic Simulation Environment (ASE) library is used to interact with atomic data
OC20 dataset was generated using density functional theory (DFT) in a quantum chemistry method
faster but less accurate effective-medium theory (EMT) is used to generate sample data
structural relaxations = iteratively updating atom positions to minimize energy of the structure, using methods like the Limited-Memory Broyden–Fletcher–Goldfarb–Shanno (LBFGS) algorithm
each optimization step in a structural relaxation is considered one example for the S2EF task, and the entire set of steps is referred to as a trajectory