Skip to content

Latest commit

 

History

History
217 lines (154 loc) · 13.6 KB

DAP.md

File metadata and controls

217 lines (154 loc) · 13.6 KB
layout
page

Data Access Patterns

The Data Access Patterns (DAP) Framework was developed for the purpose of making it possible to analyze and predict the behaviour of object-oriented applications in terms of the (read and write) access operations performed at run-time, over domain data, when operating within a particular execution context. The concept of context can represent any meaningful unit of application functionality, such as application methods, services, or anything else that might be appropriate for the given target application.

What does it do?

The framework is capable of providing answers to questions along the lines of:

  • What is the likelihood of accessing instances of domain data type X when executing method/service F?
  • What types of domain data are most likely to be accessed when executing method G?
  • What is the probability of accessing instances of type Z, globally, at application level?
  • How likely is it to access domain instances of type Y, when the previously accessed domain instance was of type W?
  • etc.

How is this done?

To answer these questions, the framework:

  • at compile-time Generates and injects extra code into the target application - this code contains call-backs to the DAP framework for updating the statistical information whenever an access (read/write operation) is performed over a domain instance, as well as code for identifying the context within which the operations are taking place.
  • at run-time Collects information about the effectively observed domain data access patterns
  • at run-time When a representative volume of behaviour information has been collected, analyzes and predicts future behaviour by employing one of three alternative stochastic model implementations - Bayesian Updating, Markov Chains and Importance Analysis.

Model Information

Bayesian Updating

Bayesian analysis techniques are used for parameter estimation. They give an estimate of the statistical descriptors of the analyzed parameters (corresponding, in this particular case, to the likelihood of reading/writing attributes of domain application classes) and can update them when new information becomes available.

Two sets of statistical data are used to generate the predictions. The first set is called prior set and it contains data about access patterns observed in the past, up to a given point in time. This time reference corresponds to the moment when the model prediction was updated last. The second set is called current set, and it contains data from the point in time when the prior set ends, to the moment when the new updated prediction is to be generated.

Once a representative volume of behavioural data has been collected, it is time to update the model predictions. The probability density functions of the prior and current data sets are defined. These functions describe the behaviour of the target application, in terms of data accesses performed for the periods of time to which the sets belong. Using the current probability density function, the prior function is updated, obtaining thus the so-called posterior probability density function. The posterior function corresponds to the prediction generated by the model, and describes what is the most likely future behaviour of the target application, in terms of the data it manipulates. Based on the posterior function, the actual access probabilities for all of the application domain class fields are calculated.

Markov Chains

A Markov Chain is a mathematical description of a system capable of navigating from one state to another, out of a finite number of possible states. It is a stochastic process with the memoryless property: the subsequent state depends only on the current state and not on the sequence of steps that led to it.

The Markov Chain model generates a transition matrix. This matrix contains the probabilities of navigating from one system state to another (automata theory). In the context of the DAP framework, the states correspond to the manipulation (read/write access operation) of a given domain class field. In a transition matrix T=[tij], the cell tij contains the probability of manipulating field i immediately after having manipulated field j.

Importance Analysis

Importance Analysis is an approach for analyzing the potential failure modes within a system by classifying them based on the severity or the effect of failures on the system. Tools used in the design stage for identifying failures and determining their consequences are Risk Priority Numbers (RPN), among others. For the current work, this approach was adapted to indicate which groups of fields are more critical or important for the operation of the considered target application.

The Risk Priority Number (RPN) system is a relative rating system that assigns a numerical value to the issue in each of three different categories. These ratings are multiplied together to determine the overall RPN for the issue. The criteria used in each rating scale are determined based on the particular circumstances for the item that is being analyzed.

For the DAP framework, the RPN ratings (for a given domain class attribute) are its local access probability (probability of being accessed within a particular execution context, such as a service, method, etc), global access probability (probability of being accessed at application level) and impact factor (an expert judgement coefficient whose value is provided by someone with solid knowledge of how the application operates). The RPN values indicate how important are their respective domain class attributes.

Pros and Cons

Bayesian Updating:

  • (+) Intermediate performance overheads from data collection
  • (+) Incremental operation (updates old predictions when new data is available, from an observed change in target behaviour)
  • (-) Predictions subject to some instability if the model is not supplied with a large enough input data set (in other words, if the input data used to calculate the predictions is not representative of the actual target application behaviour)

Markov Chains:

  • (+) Possible to extract logic time notion from model predictions (e.g. what is the most likely sequence of accesses to domain data types to occur)
  • (-) High performance overheads from data collection
  • (-) When target behaviour changes, necessary to perform a new analysis

Importance Analysis:

  • (+) Intermediate performance overheads from data collection
  • (+) Conceptually easier to understand model
  • (-) When target behaviour changes, necessary to perform a new analysis

Model Applications

The predictions generated by these models have been employed for a series of engineering applications:

  • Guiding a high-level (at domain instance) software cache policy for achieving high hit ratios.
  • Identifying optimal clusters (in terms of size and composition) of application data and functionality.
  • Improving the memory efficiency of domain data.
  • Improving performance in clustered web server systems through load balancing policies.

How to set-up an application to use DAP

Depending on the backend you wish to employ for your application, some of the configurations shown next need to be changed to the appropriate backend. The sample configurations shown here assume that the backend to be used is backend-infinispan_

It is important to make sure the following JVM property is enabled when booting the application to use DAP: -DautomaticLocalityHints=true

pom.xml

The first thing that needs to be configured in the pom.xml of the application is the appropriate code-generator for the backend. This can be done by having:

<properties>
    ...
    <fenixframework.code.generator>pt.ist.fenixframework.backend.infinispan.InfinispanCodeGenerator</fenixframework.code.generator>
    ...
    ...
</properties>

It is necessary to have the versions of the fenix-framework and dap framework specified in the :

<properties>
    ...
    <version.dap-framework>1.0</version.dap-framework>
    <version.fenixframework>2.0-cloudtm-SNAPSHOT</version.fenixframework>
    ...
    ...
</properties>

For a proper you need to have at least two plugins configured, namely:

<build>
    <plugins>
        ...
        <plugin>
            <groupId>pt.ist</groupId>
            <artifactId>dml-maven-plugin</artifactId>
            <version>${version.fenixframework}</version>
            <configuration>
                <codeGeneratorClassName>${fenixframework.code.generator}</codeGeneratorClassName>
                <params>
                    <ptIstDapEnable>true</ptIstDapEnable>
                    <automaticLocalityHints>true</automaticLocalityHints>
                </params>
            </configuration>
            <executions>
                <execution>
                    <goals>
                        <goal>generate-domain</goal>
                        <goal>post-compile</goal>
                    </goals>
                </execution>
            </executions>
       </plugin>
        ...

and

     ...
     <plugin>
            <groupId>org.codehaus.mojo</groupId>
            <artifactId>build-helper-maven-plugin</artifactId>
            <version>${version.maven.build-helper-plugin}</version>
            <executions>
                <execution>
                    <id>add-resource</id>
                    <phase>generate-resources</phase>
                    <goals>
                        <goal>add-resource</goal>
                    </goals>
                    <configuration>
                        <resources>
                            <resource>
                                <directory>target/generated-sources/dml-maven-plugin</directory>
                                <excludes>
                                    <exclude>**/*.java</exclude>
                                </excludes>
                            </resource>
                        </resources>
                    </configuration>
                </execution>
            </executions>
        </plugin>
        ...
    </plugins>
</build>

In terms of dependencies, it is necessary to have:

<dependencies>
    <dependency>
        <groupId>pt.ist</groupId>
        <artifactId>fenix-framework-backend-infinispan</artifactId>
        <version>${version.fenixframework}</version>
    </dependency>
    ...

Special attention should be paid to the inside the true true

If the parameter ptIstDapEnable is not present, or its value is different from "true" (not case-sensitive), then the code-generator will not generate any of the necessary code for collecting statistical data for the DAP framework, disabling it effectively.

Furthermore, the automaticLocalityHints property is used to signal the instrumentation of the domain classes to automatically generate Locality Hints on those classes that are lacking it (because the programmer did not define them). These automatically inserted Locality Hints serve the purpose of clustering the instances by their classes, which matches the access patterns characterization methodology mentioned in this section. For this to be possible, we generate a Locality Hint by using the fully qualified class name of the domain class in regard. In the case of *-to-many relations, which are backed by domain collections, we use the class name of the domain class being held in the many side of the relation, and not that of the collection.

dap.properties

Having a properly configured pom.xml is the first of two steps for activating the DAP framework. The second is to have a /dap.properties configuration file available in the CLASSPATH.

If, for any reason, the dap.properties file is not available when the application is initializing, then the DAP framework will assume default configuration values, which effectively disables all of its functionality (even if the ptIstDapEnable=true, the generated extra calls will simply return without doing anything). (It is possible to activate the DAP framework later via JMX).

The parameters of the dap.properties file are:

dap_enabled=true
dap_persistent_data=false
dap_path=basePathToFolderWithStructuresIfPersistentDataIsOn
dap_read_statistics=true
dap_write_statistics=true
dap_thread_sleep_interval=25

where

  • dap_enabled is a general turn on/off switch for the DAP framework, on = DAP is allowed to operate, off = disabled
  • dap_persistent_data indicates if the statistical data and results collected at run-time should be kept persistently in a file
  • dap_path is the path to the folder where the persistent DAP data should be kept. This makes it possible to resume from where the access pattern analysis left, if the application is restarted.
  • dap_read_statistics indicates if we want to keep track of read access operations over domain data
  • dap_write_statistics indicates if we want to keep track of write access operations over domain data
  • dap_thread_sleep_interval is the interval, in seconds, between re-calculating the predictions as well as saving the data persistently, if that option is enabled

It should be noted that even if dap_enabled=true, it is still necessary to indicate explicitly what type of accesses are to be analyzed (through dap_read_statistics and dap_write_statistics). If not specified, the default value for these parameters is "false".