diff --git a/doc/pypolymlp.md b/doc/pypolymlp.md
index 4a1bd73d..00403198 100644
--- a/doc/pypolymlp.md
+++ b/doc/pypolymlp.md
@@ -15,13 +15,15 @@ The training process involves using a dataset consisting of supercell
displacements, forces, and energies. The trained MLPs are then employed to
compute forces for supercells with specific displacements.
-For more details on the methodology, refer to A. Togo and A. Seko, J. Chem. Phys.
-**160**, 211001 (2024) [[doi](https://doi.org/10.1063/5.0211296)].
+For further details on combining phono3py calculations with pypolymlp, refer to
+A. Togo and A. Seko, J. Chem. Phys. **160**, 211001 (2024)
+[[doi](https://doi.org/10.1063/5.0211296)]
+[[arxiv](https://arxiv.org/abs/2401.17531)].
An example of its usage can be found in the `example/NaCl-pypolymlp` directory
in the distribution from GitHub or PyPI.
-## Requirement
+## Requirements
- [pypolymlp](https://github.com/sekocha/pypolymlp)
- [symfc](https://github.com/symfc/symfc)
@@ -228,7 +230,7 @@ displacement distance of 0.001 Angstrom. The forces for these supercells are
then evaluated using pypolymlp. Both the generated displacements and the
corresponding forces are stored in the `phono3py_mlp_eval_dataset` file.
-### Steps 4-6: Force constants calculation (random displacements in step 5)
+### Steps 4-7: Force constants calculation (random displacements in step 5)
After developing MLPs, random displacements are generated by specifying
{ref}`--rd ` option. To compute force constants
@@ -329,6 +331,14 @@ an additional 200 supercells. In total, 400 supercells are created. The forces
for these supercells are then evaluated. Finally, the force constants are
calculated using symfc.
+## Convergence with respect to dataset size
+
+In general, increasing the amount of data improves the accuracy of representing
+force constants. Therefore, it is recommended to check the convergence of the
+target property with respect to the number of supercells in the training
+dataset. Lattice thermal conductivity may be a convenient property to monitor
+when assessing convergence.
+
## Parameters for developing MLPs
A few parameters can be specified using the `--mlp-params` option for the
diff --git a/phono3py/file_IO.py b/phono3py/file_IO.py
index 0b9066b7..672bc3d6 100644
--- a/phono3py/file_IO.py
+++ b/phono3py/file_IO.py
@@ -413,6 +413,26 @@ def read_fc2_from_hdf5(filename="fc2.hdf5", p2s_map=None):
)
+def write_datasets_to_hdf5(
+ dataset: dict,
+ phonon_dataset: dict = None,
+ filename: str = "datasets.hdf5",
+ compression: str = "gzip",
+):
+ """Write dataset and phonon_dataset in datasets.hdf5."""
+
+ def _write_dataset(w, dataset: dict, group_name: str):
+ dataset_w = w.create_group(group_name)
+ for key in dataset:
+ dataset_w.create_dataset(key, data=dataset[key], compression=compression)
+
+ with h5py.File(filename, "w") as w:
+ w.create_dataset("version", data=np.bytes_(__version__))
+ _write_dataset(w, dataset, "dataset")
+ if phonon_dataset:
+ _write_dataset(w, phonon_dataset, "phonon_dataset")
+
+
def write_grid_address_to_hdf5(
grid_address,
mesh,