Skip to content

Commit

Permalink
DSI Documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
qwofford committed Dec 19, 2023
0 parents commit 86fb19e
Show file tree
Hide file tree
Showing 84 changed files with 9,581 additions and 0 deletions.
4 changes: 4 additions & 0 deletions .buildinfo
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Sphinx build info version 1
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
config: 9902b6c654c5a4d9e5c6da7f53d416a7
tags: 645f666f9bcd5a90fca523b33c5a78b7
Binary file added .doctrees/core.doctree
Binary file not shown.
Binary file added .doctrees/drivers.doctree
Binary file not shown.
Binary file added .doctrees/environment.pickle
Binary file not shown.
Binary file added .doctrees/index.doctree
Binary file not shown.
Binary file added .doctrees/installation.doctree
Binary file not shown.
Binary file added .doctrees/introduction.doctree
Binary file not shown.
Binary file added .doctrees/permissions.doctree
Binary file not shown.
Binary file added .doctrees/plugins.doctree
Binary file not shown.
Empty file added .nojekyll
Empty file.
Binary file added _images/DriverClassHierarchy.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/PluginClassHierarchy.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/data_lifecycle.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/jupyter_frontend.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/three_easy_pieces.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/user_story.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
86 changes: 86 additions & 0 deletions _sources/core.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
Core
===================
The DSI Core middleware defines the Terminal concept. An instantiated Terminal is the human/machine DSI interface. The person setting up a Core Terminal only needs to know how they want to ask questions, and what metadata they want to ask questions about. If they don’t see an option to ask questions the way they like, or they don’t see the metadata they want to ask questions about, then they should ask a Driver Contributor or a Plugin Contributor, respectively.

A Core Terminal is a home for Plugins, and an interface for Drivers. A Core Terminal is instantiated with a set of default Plugins and Drivers, but they must be loaded before a user query is attempted. Here's an example of how you might work with DSI using an interactive Python interpreter for your data science workflows::

>>> from dsi.core import Terminal
>>> a=Terminal()
>>> a.list_available_modules('plugin')
>>> # ['Bueno', 'Hostname', 'SystemKernel']
>>> a.load_module('plugin','Bueno','consumer',filename='./data/bueno.data')
>>> # Bueno plugin consumer loaded successfully.
>>> a.load_module('plugin','Hostname','producer')
>>> # Hostname plugin producer loaded successfully.
>>> a.list_loaded_modules()
>>> # {'producer': [<dsi.plugins.env.Hostname object at 0x7f21232474d0>],
>>> # 'consumer': [<dsi.plugins.env.Bueno object at 0x7f2123247410>],
>>> # 'front-end': [],
>>> # 'back-end': []}


At this point, you might decide that you are ready to collect data for inspection. It is possible to utilize DSI Drivers to load additional metadata to supplement your Plugin metadata, but you can also sample Plugin data and search it directly.


The process of transforming a set of Plugin producers and consumers into a querable format is called transloading. A DSI Core Terminal has a ``transload()`` method which may be called to execute all Plugins at once::

>>> a.transload()
>>> a.active_metadata
>>> # OrderedDict([('uid', [1000]), ('effective_gid', [1000]), ('moniker', ['qwofford'])...

Once a Core Terminal has been transloaded, no further Plugins may be added. However, the transload method can be used to samples of each plugin as many times as you like::

>>> a.transload()
>>> a.transload()
>>> a.transload()
>>> a.active_metadata
>>> # OrderedDict([('uid', [1000, 1000, 1000, 1000]), ('effective_gid', [1000, 1000, 1000...

If you perform data science tasks using Python, it is not necessary to create a DSI Core Terminal front-end because the data is already in a Python data structure. If your data science tasks can be completed in one session, it is not required to interact with DSI Drivers. However, if you do want to save your work, you can load a DSI Driver with a back-end function::

>>> a.list_available_modules('driver')
>>> # ['Gufi', 'Sqlite', 'Parquet']
>>> a.load_module('driver','Parquet','back-end',filename='parquet.data')
>>> # Parquet driver back-end loaded successfully.
>>> a.list_loaded_modules()
>>> # {'producer': [<dsi.plugins.env.Hostname object at 0x7f21232474d0>],
>>> # 'consumer': [<dsi.plugins.env.Bueno object at 0x7f2123247410>],
>>> # 'front-end': [],
>>> # 'back-end': [<dsi.drivers.parquet.Parquet object at 0x7f212325a110>]}
>>> a.artifact_handler(interaction_type='put')

The contents of the active DSI Core Terminal metadata storage will be saved to a Parquet object at the path you provided at module loading time.

It is possible that you prefer to perform data science tasks using a higher level abstraction than Python itself. This is the purpose of the DSI Driver front-end functionality. Unlike Plugins, Drivers can be added after the initial ``transload()`` operation has been performed::

>>> a.load_module('driver','Parquet','front-end',filename='parquet.data')
>>> # Parquet driver front-end loaded successfully.
>>> a.list_loaded_modules()
>>> # {'producer': [<dsi.plugins.env.Hostname object at 0x7fce3c612b50>],
>>> # 'consumer': [<dsi.plugins.env.Bueno object at 0x7fce3c622110>],
>>> # 'front-end': [<dsi.drivers.parquet.Parquet object at 0x7fce3c622290>],
>>> # 'back-end': [<dsi.drivers.parquet.Parquet object at 0x7fce3c622650>]}

Any front-end may be used, but in this case the Parquet driver has a front-end implementation which builds a jupyter notebook from scratch that loads your metadata collection into a Pandas Dataframe. The Parquet front-end will then launch the Jupyter Notebook to support an interactive data science workflow::

>>> a.artifact_handler(interaction_type='inspect')
>>> # Writing Jupyter notebook...
>>> # Opening Jupyter notebook...

.. image:: jupyter_frontend.png
:scale: 33%

You can then close your Jupyter notebook, ``transload()`` additionally to increase your sample size, and use the interface to explore more data.

Although this demonstration only used one Plugin per Plugin functionality, any number of plugins can be added to collect an arbitrary amount of queriable metadata::

>>> a.load_module('plugin','SystemKernel','producer')
>>> # SystemKernel plugin producer loaded successfully
>>> a.list_loaded_modules()
>>> # {'producer': [<dsi.plugins.env.Hostname object at 0x7fce3c612b50>, <dsi.plugins.env.SystemKernel object at 0x7fce68519250>],
>>> # 'consumer': [<dsi.plugins.env.Bueno object at 0x7fce3c622110>],
>>> # 'front-end': [<dsi.drivers.parquet.Parquet object at 0x7fce3c622290>],
>>> # 'back-end': [<dsi.drivers.parquet.Parquet object at 0x7fce3c622650>]}

.. automodule:: dsi.core
:members:
21 changes: 21 additions & 0 deletions _sources/drivers.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
Drivers
========================

Drivers have front-end and back-end functions. Drivers connect users to DSI Core middleware (front-end), and Drivers allow DSI Middleware data structures to read and write to persistent external storage (back-end). Drivers are modular to support user contribution. Driver contributors are encouraged to offer custom Driver abstract classes and Driver implementations. A contributed Driver abstract class may extend another Driver to inherit the properties of the parent. In order to be compatible with DSI Core middleware, Drivers should create an interface to Python built-in data structures or data structures from the Python ``collections`` library. Driver extensions will be accepted conditional to the extention of ``drivers/tests`` to demonstrate new Driver capability. We can not accept pull requests that are not tested.


.. image:: DriverClassHierarchy.png

.. automodule:: dsi.drivers.filesystem
:members:

.. automodule:: dsi.drivers.sqlite
:members:

.. automodule:: dsi.drivers.gufi
:members:

.. automodule:: dsi.drivers.parquet
:members:


26 changes: 26 additions & 0 deletions _sources/index.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
.. DSI documentation master file, created by
sphinx-quickstart on Fri Apr 14 14:04:07 2023.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
=============================================
The Data Science Infrastructure Project (DSI)
=============================================

.. toctree::
:maxdepth: 2
:caption: Contents:

introduction
installation
core
plugins
drivers
permissions

Indices and tables
==================

* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`
19 changes: 19 additions & 0 deletions _sources/installation.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
Installation
===================

1. Create or activate a DSI virtual environment.
2. ``cd`` into the project space root
3. ``python -m pip install .``
4. [Optional] If you are running DSI Unit tests ``python -m pip install pytest gitpython coverage-badge pytest-cov``.
5. [Optional] If you are HTML documentation ``python -m pip install sphinx sphinx_rtd_theme``

How to create and activate a virtual environment
--------------------------------------------------
We recommend Miniconda for virtual environment management (`https://docs.conda.io/en/latest/miniconda.html`). To create and activate a Miniconda virtual environment:

1. Download and install the appropriate Miniconda installer for your platform.
2. If this is the first time creating a DSI virtual environment: ``conda create -n 'dsi' python=3.11``. The ``-n`` name argument can be anything you like.
3. Once the virtual environment is created, activate it with ``conda activate dsi``, or whatever name you picked in the preceding step.
4. Proceed with Step 2 in the "Installation" section.
5. When you've completed work, deativate the conda environment with ``conda deactivate``.

39 changes: 39 additions & 0 deletions _sources/introduction.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
============
Introduction
============

The goal of the Data Science Infrastructure Project (DSI) is to provide a flexible, AI-ready metadata query capability which returns data subject to strict, POSIX-enforced file security. The data lifecycle for AI/ML requires seamless transitions from data-intensive/AI/ML research activity to long-term archiving and share data repositories. DSI enables flexible, data-intensive scientific workflows that meet researcher needs.

.. image:: data_lifecycle.png
:scale: 50%

DSI system design is driven by experiences which satisfy User Stories. DSI uses Object Oriented design principles to encourage modularity and to support contributions by the user community.


Implementation
==============
The DSI system is composed of three fundamental parts:

.. image:: three_easy_pieces.png
:scale: 33%

DSI Core Middleware
-------------------
DSI's core middleware is focused on delivering user-queries on unified metadata which are distributed across many files and security domains. DSI currently supports Linux, and is tested on RedHat- and Debian-based distributions. The DSI Core middleware is a home for DSI Plugins and an interface for DSI Drivers.

Plugin Abstract Classes
-----------------------
Plugins transform an arbitrary data source into a format that is compatible with our middleware. We call the parsed and queriable attributes "metadata" (data about the data). Metadata share the same security profile as the source data.

Plugins can operate as data consumers or data producers. A simple data consumer might parse an application's output file and place it into a middleware compatible data structure: Python built-ins and members of the popular Python ``collection`` module. A simple data producer might execute an application to supplement existing data and queriable metadata.

Plugins are defined by a base abstract class, and support child abstract classes which inherit the properties of their ancestors.

.. image:: PluginClassHierarchy.png

Driver Abstract Classes
-----------------------
Drivers are an interface between the User and the Core, or an interface between the Core and a storage medium. Drivers can operate as Front-ends or Back-ends, and a Driver contributor can choose to implement one or both. Driver front-ends are built to deliver an experience which is compatible with a User Story. A simple supporting User Story is a need to query metadata by SQL query. Because the set of queriable metadata are spread across filesystems and security domains, a supporting Driver Back-end is required to assemble query results and present them to the DSI core middleware for transformation and return, creating an experience which is compatible with the User Story.

.. image:: user_story.png
:scale: 50%
55 changes: 55 additions & 0 deletions _sources/permissions.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
Permissions
===================
DSI is capable of consuming information from files, environments, and in-situ processes which may or may not have the same permissions authority. To track this information for the purposes of returning user queries into DSI storage, we utilize a permissions handler. The permissions handler bundles the authority by which information is read and adds this to each column data structure. Most relational database systems require that types are encforced by column, and DSI extends this idea to require that permissions are enforced by column. By tracking the permissions associated with each column, DSI can save files using the same POSIX permissions authority that initially granted access to the information, therefore preserving POSIX permssions as files are saved.

By default, DSI will stop users from saving any metadata if the length of the union of the set of column permissions is greater than one. This prevents users from saving files that might have complex security implications. If a user enables the ``allow_multiple_permissions`` parameter of the ``PermissionsManager``, then the number of files that will be saved is equal to the length of the union of the set of column permissions in the middelware data structures being written. There will be one file for each set of columns read by the same permissions authority.

By default, DSI will always respect the POSIX security information by which information was read. If the usr wishes to override this behavior and write all of their metadata to the same file with a unified UID and GID, they can enable the ``squash_permissions`` perameter of the ``PermissionsManager``. The user should be very certain that the information they are writing is protected appropriately in this case.

An example helps illustrate these scenarios:

+----------+----------+----------+
| Col A | Col B | Col C |
+----------+----------+----------+
| *Perm D* | *Perm D* | *Perm F* |
+----------+----------+----------+
| Row A1 | Row B1 | Row C1 |
+----------+----------+----------+
| Row A2 | Row B2 | Row C2 |
+----------+----------+----------+

By default, DSI will refuse to write this data structure to disk because ``len(union({D,D,F})) > 1``

If a user enables the ``allow_multiple_permissions`` parameter, two files will be saved:

>>> $ cat file1
>>> | Col A | Col B |
>>> ===================
>>> | Perm D | Perm D |
>>> | Row A1 | Row B1 |
>>> | Row A2 | Row B2 |
>>> $ get_perms(file1)
>>> Perm D
>>> $ cat file2
>>> | Col C |
>>> ==========
>>> | Perm F |
>>> | Row C1 |
>>> | Row C2 |
>>> $ get_perms(file2)
>>> Perm F

If a user enables ``allow_multiple_permissions`` and ``squash_permissions``, then a single file will be written with the users UID and effective GID and 660 access:

>>> $ cat file
>>> | Col A | Col B | Col C |
>>> ============================
>>> | Perm D | Perm D | Perm F |
>>> | Row A1 | Row B1 | Row C1 |
>>> | Row A2 | Row B2 | Row C2 |
>>> $ get_perms(file)
>>> My UID and Effective GID, with 660 access controls.


.. automodule:: dsi.permissions.permissions
:members:
12 changes: 12 additions & 0 deletions _sources/plugins.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
Plugins
===================
Plugins connect data-producing applications to DSI middleware. Plugins have "producer" or "consumer" functions. A Plugin consumer function deals with existing data files or input streams. A Plugin producer deals with generating new data. Plugins are modular to support user contribution. Plugin contributors are encouraged to offer custom Plugin abstract classes and Plugin implementations. A contributed Plugin abstract class may extend another plugin to inherit the properties of the parent. In order to be compatible with DSI middleware, Plugins should produce data in Python built-in data structures or data structures sourced from the Python ``collections`` library. Plugin extensions will be accepted conditional to the extention of ``plugins/tests`` to demonstrate the new Plugin capability. We can not accept pull requests that are not tested.

.. image:: PluginClassHierarchy.png

.. automodule:: dsi.plugins.metadata
:members:

.. automodule:: dsi.plugins.env
:members:

123 changes: 123 additions & 0 deletions _static/_sphinx_javascript_frameworks_compat.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
/* Compatability shim for jQuery and underscores.js.
*
* Copyright Sphinx contributors
* Released under the two clause BSD licence
*/

/**
* small helper function to urldecode strings
*
* See https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/decodeURIComponent#Decoding_query_parameters_from_a_URL
*/
jQuery.urldecode = function(x) {
if (!x) {
return x
}
return decodeURIComponent(x.replace(/\+/g, ' '));
};

/**
* small helper function to urlencode strings
*/
jQuery.urlencode = encodeURIComponent;

/**
* This function returns the parsed url parameters of the
* current request. Multiple values per key are supported,
* it will always return arrays of strings for the value parts.
*/
jQuery.getQueryParameters = function(s) {
if (typeof s === 'undefined')
s = document.location.search;
var parts = s.substr(s.indexOf('?') + 1).split('&');
var result = {};
for (var i = 0; i < parts.length; i++) {
var tmp = parts[i].split('=', 2);
var key = jQuery.urldecode(tmp[0]);
var value = jQuery.urldecode(tmp[1]);
if (key in result)
result[key].push(value);
else
result[key] = [value];
}
return result;
};

/**
* highlight a given string on a jquery object by wrapping it in
* span elements with the given class name.
*/
jQuery.fn.highlightText = function(text, className) {
function highlight(node, addItems) {
if (node.nodeType === 3) {
var val = node.nodeValue;
var pos = val.toLowerCase().indexOf(text);
if (pos >= 0 &&
!jQuery(node.parentNode).hasClass(className) &&
!jQuery(node.parentNode).hasClass("nohighlight")) {
var span;
var isInSVG = jQuery(node).closest("body, svg, foreignObject").is("svg");
if (isInSVG) {
span = document.createElementNS("http://www.w3.org/2000/svg", "tspan");
} else {
span = document.createElement("span");
span.className = className;
}
span.appendChild(document.createTextNode(val.substr(pos, text.length)));
node.parentNode.insertBefore(span, node.parentNode.insertBefore(
document.createTextNode(val.substr(pos + text.length)),
node.nextSibling));
node.nodeValue = val.substr(0, pos);
if (isInSVG) {
var rect = document.createElementNS("http://www.w3.org/2000/svg", "rect");
var bbox = node.parentElement.getBBox();
rect.x.baseVal.value = bbox.x;
rect.y.baseVal.value = bbox.y;
rect.width.baseVal.value = bbox.width;
rect.height.baseVal.value = bbox.height;
rect.setAttribute('class', className);
addItems.push({
"parent": node.parentNode,
"target": rect});
}
}
}
else if (!jQuery(node).is("button, select, textarea")) {
jQuery.each(node.childNodes, function() {
highlight(this, addItems);
});
}
}
var addItems = [];
var result = this.each(function() {
highlight(this, addItems);
});
for (var i = 0; i < addItems.length; ++i) {
jQuery(addItems[i].parent).before(addItems[i].target);
}
return result;
};

/*
* backward compatibility for jQuery.browser
* This will be supported until firefox bug is fixed.
*/
if (!jQuery.browser) {
jQuery.uaMatch = function(ua) {
ua = ua.toLowerCase();

var match = /(chrome)[ \/]([\w.]+)/.exec(ua) ||
/(webkit)[ \/]([\w.]+)/.exec(ua) ||
/(opera)(?:.*version|)[ \/]([\w.]+)/.exec(ua) ||
/(msie) ([\w.]+)/.exec(ua) ||
ua.indexOf("compatible") < 0 && /(mozilla)(?:.*? rv:([\w.]+)|)/.exec(ua) ||
[];

return {
browser: match[ 1 ] || "",
version: match[ 2 ] || "0"
};
};
jQuery.browser = {};
jQuery.browser[jQuery.uaMatch(navigator.userAgent).browser] = true;
}
Loading

0 comments on commit 86fb19e

Please sign in to comment.