Skip to content

Framework Apps

Scott Sievert edited this page Feb 17, 2017 · 1 revision

Summary: We describe how to write the five key functions in every NEXT application.

We have written a YAML interface specifying what the inputs to the major API functions are in myApp.yaml. Now we will discuss the actual app development.

Basics App Development

  • NEXT handles serving requests, load balancing, logging, database management, and most other components of a web server.
  • This allows you to focus on the dataflow and the algorithm development.
  • This page the arguments needed for your app. It is your job to glue together the various dataflows in your application.

The main application code belongs in: apps/PoolBasedBinaryClassification/PoolBasedBinaryClassification.py

In this case the function definitions in the file look something like:

import json
import next.apps.SimpleTargetManager

class MyApp(object):
    def __init__(self,db):
        self.app_id = 'PoolBasedBinaryClassification'
        self.TargetManager = next.apps.SimpleTargetManager.SimpleTargetManager(db)

    def initExp(self, butler, alg, args):
        return ...

    def getQuery(self, butler, alg, args):
		return ...
		
	def processAnswer(self, butler, alg, args):
		return ...
		
    def getModel(self, butler, alg, args):
		return ...

Firstly note that the functions in this class correspond to the major API functions. Secondly, each api function (excluding the __init__ which does some basic setup...more on this later), receives the same three arguments:

  • butler. The butler provides a way to store and retrieve application and algorithm variables and targets effectively hiding explicit database access.

    • Butler-API -- the API docs
    • "Why can I not just save n using self.n = n?" in the class MyApp? NEXT will usually be run with many instances of the same code running in multiple processes. To guarantee future access to data, we have to store it in a database that is shared across all these processes. The butler is our (nice) interface to these databases.
    • The butler provides a way to access four main collections
      • experiment which stores experiment information
      • queries which stores all queries made by NEXT
      • algorithms, a collection specific to each algorith
      • targets a list of the targets uploaded in an experiment initialization.
      • there's more; take a look at next/apps/Butler.py
    • The butler also contains a job function which is used to run asynchronous jobs (such as logging or model updates).
  • alg - Refers to an algorithms implementation of getQuery, processAnswer, getStats or getModel (depends on which function this is received in).

    • each alg is treated like a black box -- we specify the arguments and returns in Algs.yaml.
    • The application must be agnostic to the algorithm; the app only defines the interface
    • In existing NEXT apps, alg only deals with indices, not the actual targets the user sees.
  • args - As demonstrated in Interface, the input of each function contains a dictionary with key args. These parameters are specified in myApp.yaml.

    • If specified in myApp.yaml, these almost are guaranteed to exist (even if optional).

Experiment initialization (through initExp):

def initExp(self, butler, alg, args):
	# Set the experiment_args to contain an additional key n, with the number of targets
	args['n']  = len(args['targets']['targetset'])        
	
	# Get the first target, extract it's feature vector and save this as the dimension
	# This assumes that feature dimension consistent across all targets
	args['d'] = len(args['targets']['targetset'][0]['meta']['features'])

    # Save the target set to the TargetManager associated to this app.
	self.TargetManager.set_targetset(butler.exp_uid, args['targets']['targetset'])
	
	# We do not want the experiment dictionary to contain the targets...this could make it's size very large if there are many targets.
	del args['targets']
	
	# Run the algorithm initExp
	alg({ 'n': args['n'], 'd':args['d']})
	
	# The args are now stored in the butler and can be accessed through butler.experiment
	return args

initExp is responsible for setting up the experiment and saving any variables that might be needed later. The return value of initExp is a dictionary that will be saved in the butler.experiment collection. It is best practice to append or remove any necessary values to the input args dictionary and then return args. So for example, in the code above, we add n, the number of targets, and d the dimension of a feature vector corresponding to a target to args and we delete the associated targets dictionary.

Generally initExp is also expected to save the experiment's targets using a TargetManager that is available through butler.targets. Note that the specific type of TargetManager, aka SimpleTargetManager is specified in __init__. You can learn more about creating a TargetManager here.

Finally, the initExp runs the algorithm initExp in the line alg({ 'n': args['n'], 'd':args['d']}), passing on the number of targets and the ambient dimension.

User request for query through getQuery:

def getQuery(self, butler, alg, args):
# Get the target_id that we wish to label from the algorithm.
	target_id = alg({'participant_uid':args['participant_uid']})

# Get the associated target and remove the feature vector
	target = butler.targets..get_target_item(butler.exp_uid, target_id)
	del target['meta']

# Return a dictionary with the target - this will be returned to the user.
	return {'target_indices':target}

The output of getQuery is directly returned to the client that requested a query. So in general, getQuery must retrieve an index of a target from the algorithm, it then gets the associated target, does any manipulations to it, and then returns it. Every query is also assigned a unique query_uid. The returned query dictionary is also stored in the butler.queries collection.

In this example, the getQuery function first calls the algorithm getQuery with the participant_uid as an input. It receives an integer, the target_id back by the TargetManager. The target_id is then used to retrieve the target. So for example if the algorithm returns a target_id of 2 we may imagine that a dictionary like

                {
                    "meta": {"features": [1.732, -1.882]}, 
                    "alt_type": "text", 
                    "primary_type": "text", 
                    "primary_description": "Target 2", 
                    "alt_description": "2",
					"target_id": 2
                }, 

is passed back. The meta key is then removed since the user does not need the feature vector to answer the question.

Processing a user answer through processAnswer:

	def processAnswer(self, butler, alg, args):
        # Get the query associated to this answer
        query = butler.queries.get(uid=args['query_uid'])
        
		# Extract the target that a query was made about
        target = query['target_indices']
        # Get the label returned from the user in the args dictionary.
        target_label = args['target_label']
        # Update the number of reported answers
        num_reported_answers = butler.experiment.increment(key='num_reported_answers_for_' + query['alg_label'])

        # Pass the answer onto the algorithm for processing.
        alg({'target_id':target['target_id'],'target_label':target_label})

        # Get a copy of the model and log it ever n/4 queries. This will later help us in tracking how our algorithm performed
        # as the number of labelled items increased.
        experiment = butler.experiment.get()
        if num_reported_answers % ((experiment['n']+4)/4) == 0:
            butler.job('getModel', json.dumps({'exp_uid':butler.exp_uid,'args':{'alg_label':query['alg_label'], 'logging':True}}))

       # Return the answer. This is added to the associated query entry stored in butler.queries for future reference.
        return {'target_index':target['target_id'],'target_label':target_label}

processAnswer begins by getting a copy of the original query corresponding to this answer. It then extracts the associated target, and passes the target_id (obtained from the query) and target_label (passed by the client) to the algorithm.

Next processAnswer queues a getModel job in the butler roughly every n/4 answers. As described below and in the algorithms section, the getModel command just returns an internal algorithmic specific representation of the model. By logging this model every n/4 queries, we can track how the model changed over time and check it's performance on a test set. We will delve further into this when we discuss the dashboard.

Finally processAnswer returns the answer. NEXT appends the answer to the query dictionary for the specific query_uid in the butler.queries collection.

getModel:

    def getModel(self, butler, alg, args):
        # Run the getModel algorithm and return the results. Note that we are implicitly assuming there is no inputs to the alg.
        return alg()

The getModel command is very simple, it just returns the algorithms getModel. This agrees with what we saw earlier when getModel was not overwritten in the application specific YAML interface file.

Note, if algorithms may return slightly different types of models and it is desired that the model output be uniform across algorithms, this is the place to do the computation to make it so. Otherwise, this function may simply package the algorithm response into a dictionary and return it.

TODO: ALG_LABEL ISSUES IN GET_MODEL?

Clone this wiki locally