Skip to content

Commit

Permalink
DSI Documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
terryturton committed Apr 9, 2024
0 parents commit 5cdaaab
Show file tree
Hide file tree
Showing 73 changed files with 7,952 additions and 0 deletions.
4 changes: 4 additions & 0 deletions .buildinfo
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Sphinx build info version 1
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
config: 7c38a6d79d19c67a234cb3141041178f
tags: 645f666f9bcd5a90fca523b33c5a78b7
Binary file added .doctrees/backends.doctree
Binary file not shown.
Binary file added .doctrees/core.doctree
Binary file not shown.
Binary file added .doctrees/environment.pickle
Binary file not shown.
Binary file added .doctrees/examples.doctree
Binary file not shown.
Binary file added .doctrees/index.doctree
Binary file not shown.
Binary file added .doctrees/installation.doctree
Binary file not shown.
Binary file added .doctrees/introduction.doctree
Binary file not shown.
Binary file added .doctrees/plugins.doctree
Binary file not shown.
Binary file added .doctrees/tiers.doctree
Binary file not shown.
Empty file added .nojekyll
Empty file.
Binary file added _images/BackendClassHierarchy.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/PluginClassHierarchy.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/data_lifecycle.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/example-pennant-output.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/jupyter_frontend.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/user_story.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
25 changes: 25 additions & 0 deletions _sources/backends.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
Backends
========

Backends connect users to DSI Core middleware and backends allow DSI middleware data structures to read and write to persistent external storage. Backends are modular to support user contribution. Backend contributors are encouraged to offer custom backend abstract classes and backend implementations. A contributed backend abstract class may extend another backend to inherit the properties of the parent. In order to be compatible with DSI core middleware, backends should create an interface to Python built-in data structures or data structures from the Python ``collections`` library. Backend extensions will be accepted conditional to the extention of ``backends/tests`` to demonstrate new Backend capability. We can not accept pull requests that are not tested.

Note that any contributed backends or extensions must include unit tests in ``backends/tests`` to demonstrate the new Backend capability.

.. figure:: BackendClassHierarchy.png
:alt: Figure depicting the current backend class hierarchy.
:class: with-shadow
:scale: 100%

Figure depicts the current DSI backend class hierarchy.

.. automodule:: dsi.backends.filesystem
:members:

.. automodule:: dsi.backends.sqlite
:members:

.. automodule:: dsi.backends.gufi
:members:

.. automodule:: dsi.backends.parquet
:members:
87 changes: 87 additions & 0 deletions _sources/core.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
Core
====

The DSI Core middleware defines the Terminal concept. An instantiated Terminal is the human/machine DSI interface. The person setting up a Core Terminal only needs to know how they want to ask questions, and what metadata they want to ask questions about. If they don’t see an option to ask questions the way they like, or they don’t see the metadata they want to ask questions about, then they should ask a Driver Contributor or a Plugin Contributor, respectively.

A Core Terminal is a home for Plugins (Readers/Writers), and an interface for Backends. A Core Terminal is instantiated with a set of default Plugins and Backends, but they must be loaded before a user query is attempted. Here's an example of how you might work with DSI using an interactive Python interpreter for your data science workflows::

>>> from dsi.core import Terminal
>>> a=Terminal()
>>> a.list_available_modules('plugin')
>>> # ['Bueno', 'Hostname', 'SystemKernel']
>>> a.load_module('plugin','Bueno','reader',filename='./data/bueno.data')
>>> # Bueno plugin reader loaded successfully.
>>> a.load_module('plugin','Hostname','writer')
>>> # Hostname plugin writer loaded successfully.
>>> a.list_loaded_modules()
>>> # {'writer': [<dsi.plugins.env.Hostname object at 0x7f21232474d0>],
>>> # 'reader': [<dsi.plugins.env.Bueno object at 0x7f2123247410>],
>>> # 'front-end': [],
>>> # 'back-end': []}


At this point, you might decide that you are ready to collect data for inspection. It is possible to utilize DSI Backends to load additional metadata to supplement your Plugin metadata, but you can also sample Plugin data and search it directly.


The process of transforming a set of Plugin writers and readers into a querable format is called transloading. A DSI Core Terminal has a ``transload()`` method which may be called to execute all Plugins at once::

>>> a.transload()
>>> a.active_metadata
>>> # OrderedDict([('uid', [1000]), ('effective_gid', [1000]), ('moniker', ['qwofford'])...

Once a Core Terminal has been transloaded, no further Plugins may be added. However, the transload method can be used to samples of each plugin as many times as you like::

>>> a.transload()
>>> a.transload()
>>> a.transload()
>>> a.active_metadata
>>> # OrderedDict([('uid', [1000, 1000, 1000, 1000]), ('effective_gid', [1000, 1000, 1000...

If you perform data science tasks using Python, it is not necessary to create a DSI Core Terminal front-end because the data is already in a Python data structure. If your data science tasks can be completed in one session, it is not required to interact with DSI Backends. However, if you do want to save your work, you can load a DSI Backend with a back-end function::

>>> a.list_available_modules('backend')
>>> # ['Gufi', 'Sqlite', 'Parquet']
>>> a.load_module('backend','Parquet','back-end',filename='parquet.data')
>>> # Parquet backend loaded successfully.
>>> a.list_loaded_modules()
>>> # {'writer': [<dsi.plugins.env.Hostname object at 0x7f21232474d0>],
>>> # 'reader': [<dsi.plugins.env.Bueno object at 0x7f2123247410>],
>>> # 'front-end': [],
>>> # 'back-end': [<dsi.backends.parquet.Parquet object at 0x7f212325a110>]}
>>> a.artifact_handler(interaction_type='put')

The contents of the active DSI Core Terminal metadata storage will be saved to a Parquet object at the path you provided at module loading time.

It is possible that you prefer to perform data science tasks using a higher level abstraction than Python itself. This is the purpose of the DSI Driver front-end functionality. Unlike Plugins, Drivers can be added after the initial ``transload()`` operation has been performed::

>>> a.load_module('backend','Parquet','front-end',filename='parquet.data')
>>> # Parquet backend front-end loaded successfully.
>>> a.list_loaded_modules()
>>> # {'writer': [<dsi.plugins.env.Hostname object at 0x7fce3c612b50>],
>>> # 'reader': [<dsi.plugins.env.Bueno object at 0x7fce3c622110>],
>>> # 'front-end': [<dsi.backends.parquet.Parquet object at 0x7fce3c622290>],
>>> # 'back-end': [<dsi.backends.parquet.Parquet object at 0x7fce3c622650>]}

Any front-end may be used, but in this case the Parquet backend has a front-end implementation which builds a jupyter notebook from scratch that loads your metadata collection into a Pandas Dataframe. The Parquet front-end will then launch the Jupyter Notebook to support an interactive data science workflow::

>>> a.artifact_handler(interaction_type='inspect')
>>> # Writing Jupyter notebook...
>>> # Opening Jupyter notebook...

.. image:: jupyter_frontend.png
:scale: 33%

You can then close your Jupyter notebook, ``transload()`` additionally to increase your sample size, and use the interface to explore more data.

Although this demonstration only used one Plugin per Plugin functionality, any number of plugins can be added to collect an arbitrary amount of queriable metadata::

>>> a.load_module('plugin','SystemKernel','writer')
>>> # SystemKernel plugin writer loaded successfully
>>> a.list_loaded_modules()
>>> # {'writer': [<dsi.plugins.env.Hostname object at 0x7fce3c612b50>, <dsi.plugins.env.SystemKernel object at 0x7fce68519250>],
>>> # 'reader': [<dsi.plugins.env.Bueno object at 0x7fce3c622110>],
>>> # 'front-end': [<dsi.backends.parquet.Parquet object at 0x7fce3c622290>],
>>> # 'back-end': [<dsi.backends.parquet.Parquet object at 0x7fce3c622650>]}

.. automodule:: dsi.core
:members:
107 changes: 107 additions & 0 deletions _sources/examples.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@

DSI Examples
============

PENNANT mini-app
----------------

`PENNANT`_ is an unstructured mesh physics mini-application developed at Los Alamos National Laboratory
for advanced architecture research.
It contains mesh data structures and a few
physics algorithms from radiation hydrodynamics and serves as an example of
typical memory access patterns for an HPC simulation code.

This DSI PENNANT example is used to show a common use case: create and query a set of metadata derived from an ensemble of simulation runs. The example GitHub directory includes 10 PENNANT runs using the PENNANT *Leblanc* test problem.

In the first step, a python script is used to parse the slurm output files and create a CSV (comma separated value) file with the output metadata.

.. code-block:: unixconfig
./parse_slurm_output.py --testname leblanc
.. literalinclude:: ../examples/pennant/parse_slurm_output.py

A second python script,

.. code-block:: unixconfig
./create_and_query_dsi_db.py --testname leblanc
reads in the CSV file and creates a database:

.. code-block:: python
"""
Creates the DSI db from the csv file
"""
"""
This script reads in the csv file created from parse_slurm_output.py.
Then it creates a DSI db from the csv file and performs a query.
"""
import argparse
import sys
from dsi.backends.sqlite import Sqlite, DataType
isVerbose = True
def import_pennant_data(test_name):
csvpath = 'pennant_' + test_name + '.csv'
dbpath = 'pennant_' + test_name + '.db'
store = Sqlite(dbpath)
store.put_artifacts_csv(csvpath, "rundata", isVerbose=isVerbose)
store.close()
# No error implies success
Finally, the database is queried:

.. code-block:: python
"""
Performs a sample query on the DSI db
"""
def test_artifact_query(test_name):
dbpath = "pennant_" + test_name + ".db"
store = Sqlite(dbpath)
_ = store.get_artifact_list(isVerbose=isVerbose)
data_type = DataType()
data_type.name = "rundata"
query = "SELECT * FROM " + str(data_type.name) + \
" where hydro_cycle_run_time > 0.006"
print("Running Query", query)
result = store.sqlquery(query)
store.export_csv(result, "pennant_query.csv")
store.close()
if __name__ == "__main__":
""" The testname argument is required """
parser = argparse.ArgumentParser()
parser.add_argument('--testname', help='the test name')
args = parser.parse_args()
test_name = args.testname
if test_name is None:
parser.print_help()
sys.exit(0)
import_pennant_data(test_name)
test_artifact_query(test_name)
Resulting in the output of the query:

.. figure:: example-pennant-output.png
:alt: Screenshot of computer program output.
:class: with-shadow


The output of the PENNANT example.



Wildfire Dataset
----------------


.. _PENNANT: https://github.com/lanl/PENNANT
26 changes: 26 additions & 0 deletions _sources/index.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
.. DSI documentation master file, created by
sphinx-quickstart on Fri Apr 14 14:04:07 2023.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
The Data Science Infrastructure Project (DSI)
=============================================

.. toctree::
:maxdepth: 2
:caption: Contents:

introduction
installation
plugins
backends
core
tiers
examples

Indices and tables
==================

* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`
38 changes: 38 additions & 0 deletions _sources/installation.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
Quick Start: Installation
=========================

#. If this is the first time using DSI, start by creating a DSI virtual environment with a name of your choice, e.g., **mydsi**:

.. code-block:: unixconfig
python -m venv mydsi
#. Then activate the environment (start here if you already have a DSI virtual environment):

.. code-block:: unixconfig
source mydsi/bin/activate
#. Go down into the project space root and use pip to install dsi:

.. code-block:: unixconfig
cd dsi
pip install .
#. [Optional] If you are running DSI unit tests, you may need other packages:

.. code-block:: unixconfig
pip install pytest gitpython coverage-badge pytest-cov .
Plus ``pip install`` any other packages that your unit tests may need.

#. [Optional] If you are updating the GitHub pages documentation, see `DSI Documentation README <https://github.com/lanl/dsi/blob/main/docs/README.rst>`_ for additional python packages needed.

#. When you've completed work, deactivate the environment with:

.. code-block:: unixconfig
deactivate
75 changes: 75 additions & 0 deletions _sources/introduction.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@



The goal of the Data Science Infrastructure Project (DSI) is to manage data through metadata capture and curation. DSI capabilities can be used to develop workflows to support management of simulation data, AI/ML approaches, ensemble data, and other sources of data typically found in scientific computing. DSI infrastructure is designed to be flexible and with these considerations in mind:

- Data management is subject to strict, POSIX-enforced, file security.
- DSI capabilities support a wide range of common metadata queries.
- DSI interfaces with multiple database technologies and archival storage options.
- Query-driven data movement is supported and is transparent to the user.
- The DSI API can be used to develop user-specific workflows.

.. figure:: data_lifecycle.png
:alt: Figure depicting the data life cycle
:class: with-shadow
:scale: 50%

A depiction of data life cycle can be seen here. The Data Science Infrastructure API supports the user to manage the life cycle aspects of their data.

DSI system design has been driven by specific use cases, both AI/ML and more generic usage. These use cases can often be generalized to user stories and needs that can be addressed by specific features, e.g., flexible, human-readable query capabilities. DSI uses Object Oriented design principles to encourage modularity and to support contributions by the user community. The DSI API is Python-based.

Implementation Overview
=======================

The DSI API is broken into three main categories:

- Plugins: these are frontend capabilities that will be commonly used by the generic DSI user. These include readers and writers.
- Backends: these are used to interact with storage devices and other ways of moving data.
- DSI Core: the *middleware* that contains the basic functionality to use the DSI API.

Plugin Abstract Classes
-----------------------

Plugins transform an arbitrary data source into a format that is compatible with the DSI core. The parsed and queryable attributes of the data are called *metadata* -- data about the data. Metadata share the same security profile as the source data.

Plugins can operate as data readers or data writers. A simple data reader might parse an application's output file and place it into a core-compatible data structure such as Python built-ins and members of the popular Python ``collection`` module. A simple data writer might execute an application to supplement existing data and queryable metadata, e.g., adding locations of outputs data or plots after running an analysis workflow.

Plugins are defined by a base abstract class, and support child abstract classes which inherit the properties of their ancestors.

Currently, DSI has the following readers:

- CSV file reader: reads in comma separated value (CSV) files.
- Bueno reader: can be used to capture performance data from `Bueno <https://github.com/lanl/bueno>`_.

.. figure:: PluginClassHierarchy.png
:alt: Figure depicting the current plugin class hierarchy.
:class: with-shadow
:scale: 100%

Figure depicting the current DSI plugin class hierarchy.

Backend Abstract Classes
------------------------

Backends are an interface between the core and a storage medium.
Backends are designed to support a user-needed functionality. Given a set of user metadata captured by a DSI frontend, a typical functionality needed by DSI users is to query that metadata by SQL query. Because the files associated with the queryable metadata may be spread across filesystems and security domains, a supporting backend is required to assemble query results and present them to the DSI core for transformation and return.

.. figure:: user_story.png
:alt: This figure depicts a user asking a typical query on the user's metadata
:class: with-shadow
:scale: 50%

In this typical **user story**, the user has metadata about their data stored in DSI storage of some type. The user needs to extract all files with the variable **foo** above a specific threshold. DSI backends query the DSI metadata store to locate and return all such files.

Current DSI backends include:

- Sqlite: Python based SQL database and backend; the default DSI API backend.
- GUFI: the Grand Unified File Index system `Grand Unified File-Index <https://github.com/mar-file-system/GUFI>`_ ; developed at LANL, GUFI is a fast, secure metadata search across a filesystem accessible to both privileged and unprivileged users.
- Parquet: a columnar storage format for `Apache Hadoop <https://hadoop.apache.org>`_.

DSI Core
--------

DSI basic functionality is contained within the middleware known as the *core*. The DSI core is focused on delivering user-queries on unified metadata which can be distributed across many files and security domains. DSI currently supports Linux, and is tested on RedHat- and Debian-based distributions. The DSI core is a home for DSI Plugins and an interface for DSI Backends.

Core Documentation
23 changes: 23 additions & 0 deletions _sources/plugins.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
Plugins
=======
Plugins connect data-producing applications to DSI core functionalities. Plugins have *writers* or *readers* functions. A Plugin reader function deals with existing data files or input streams. A Plugin writer deals with generating new data. Plugins are modular to support user contribution.

Plugin contributors are encouraged to offer custom Plugin abstract classes and Plugin implementations. A contributed Plugin abstract class may extend another plugin to inherit the properties of the parent. In order to be compatible with DSI core, Plugins should produce data in Python built-in data structures or data structures sourced from the Python ``collections`` library.

Note that any contributed plugins or extensions must include unit tests in ``plugins/tests`` to demonstrate the new Plugin capability.

.. figure:: PluginClassHierarchy.png
:alt: Figure depicting the current plugin class hierarchy.
:class: with-shadow
:scale: 100%

Figure depicts the current DSI plugin class hierarchy.

.. automodule:: dsi.plugins.plugin
:members:

.. automodule:: dsi.plugins.metadata
:members:

.. automodule:: dsi.plugins.env
:members:
Loading

0 comments on commit 5cdaaab

Please sign in to comment.