Skip to content

API Reference

kellekai edited this page Sep 28, 2017 · 21 revisions

FTI Datatypes
FTI Constants
FTI_Init
FTI_InitType
FTI_Protect
FTI_Checkpoint
FTI_Status
FTI_Recover
FTI_Snapshot
FTI_Finalize

FTI Datatypes and Constants

FTI Datatypes

⬆️ Top

FTI_CHAR : FTI data type for chars
FTI_SHRT : FTI data type for short integers.
FTI_INTG : FTI data type for integers.
FTI_LONG : FTI data type for long integers.
FTI_UCHR : FTI data type for unsigned chars.
FTI_USHT : FTI data type for unsigned short integers.
FTI_UINT : FTI data type for unsigned integers.
FTI_ULNG : FTI data type for unsigned long integers.
FTI_SFLT : FTI data type for single floating point.
FTI_DBLE : FTI data type for double floating point.
FTI_LDBE : FTI data type for long double floating point.

FTI Constants

⬆️ Top

FTI_BUFS : 256
FTI_DONE : 1
FTI_SCES : 0
FTI_NSCS : -1


FTI_Init

⬆️ Top

  • Reads configuration file.
  • Creates checkpoint directories.
  • Detects topology of the system.
  • Regenerates data upon recovery.

DEFINITION

int FTI_Init ( char * configFile , MPI_Comm globalComm )

INPUT

Variable What for?
char * configFile Path to the config file
MPI_Comm globalComm MPI communicator used for the execution

OUTPUT

Value Reason
FTI_SCES Success
FTI_NSCS No Success

DESCRIPTION

This function initializes the FTI context. It should be called before other FTI functions, right after MPI initialization.

EXAMPLE

int main ( int argc , char **argv ) {
    MPI_Init (&argc , &argv );
    char *path = "config.fti"; // config file path
    FTI_Init ( path , MPI_COMM_WORLD );
.
.
.
    return 0;
}

FTI_InitType

⬆️ Top

  • Initializes a data type.

DEFINITION

int FTI_InitType ( FTIT_type *type , int size )

INPUT

Variable What for?
FTIT_type * type The data type to be initialized
int size The size of the data type to be initialized

OUTPUT

Value Reason
FTI_SCES Success

DESCRIPTION

This function initializes a data type. A variable’s type which isn’t defined by default by FTI (see: FTI Datatypes) should be added using this function before adding this variable to protected variables.

EXAMPLE

typedef struct A {
    int a;
    int b;
} A;
FTIT_type structAinfo ;
FTI_InitType (& structAinfo , 2 * sizeof ( int ));

FTI_Protect

⬆️ Top

  • Stores metadata concerning the variable to protect.

DEFINITION

int FTI_Protect ( int id, void *ptr, long count, FTIT_type type )

INPUT

Variable What for?
int id Unique ID of the variable to protect
void * ptr Pointer to memory address of variable
long count Number of elements at memory address
FTIT_type type FTI data type of variable to protect

OUTPUT

Value Reason
FTI_SCES Success
exit(1) Number of protected variables is > FTI_BUFS

DESCRIPTION

This function should be used to add data structure to the list of protected variables. This list of structures is the data that will be stored during a checkpoint and loaded during a recovery. It resets the dataset with given id if it was already previously registered. When size of a variable changes during execution it should be updated using this function before next check- point to properly store data.

EXAMPLE

int A;
float *B = malloc (sizeof(float) * 10) ;
FTI_Protect(1, &A, 1, FTI_INTG );
FTI_Protect(2, B, 10, FTI_SFLT );
// changing B size
B = realloc(B, sizeof(float) * 20) ;
// updating B size in protected list
FTI_Protect(2, B, 20, FTI_SFLT);

FTI_Checkpoint

⬆️ Top

  • Writes values of protected runtime variables to a checkpoint file of requested level.

DEFINITION

int FTI_Checkpoint( int id, int level )

INPUT

Variable What for?
int id Unique checkpoint ID
int level Checkpoint level (1=L1, 2=L2, 3=L3, 4=L4)

OUTPUT

Value Reason
FTI_DONE Success
FTI_NSCS Failure

DESCRIPTION

This function is used to store current values of protected variables into a checkpoint file. Depending on the checkpoint level file is stored in local, partner node or global directory. Checkpoint’s id must be different from 0.

EXAMPLE

int i;
for (i = 0; i < 100; i ++) {
    if (i % 10 == 0) {
        FTI_Checkpoint ( i /10 + 1, 1) ;
    }
.
. // some computations
.
}

FTI_Status

⬆️ Top

  • Returns the current status of the recovery flag.

DEFINITION

int FTI_Status()

OUTPUT

Value Reason
int 0
No checkpoints taken yet or recovered successfully
int 1
At least one checkpoint is taken. If execution fails, the next start will be a restart
int 2
The execution is a restart from checkpoint level L4 and keep_last_checkpoint was enabled during the last execution

DESCRIPTION

This function returns the current status of the recovery flag.

EXAMPLE

if ( FTI_Status () != 0) {
    .
    . // this section will be executed during restart
    .
}

FTI_Recover

⬆️ Top

  • Loads checkpoint data from the checkpoint file and initializes the runtime variables of the execution.

DEFINITION

int FTI_Recover()

OUTPUT

Value Reason
FTI_SCES Success
FTI_NSCS Failure

DESCRIPTION

This function loads the checkpoint data from the checkpoint file and it up- dates some basic checkpoint information. It should be called after initial- ization of protected variables after a failure. If a variable changes it’s size during execution it must have the latest size before Recover. The easiest way to do so is to add size of variable as another variable to protected list, and then call Recover twice. First to recover size of variable. Second to recover variable’s data (after an update of protected list).

EXAMPLE

Basic example:

if ( FTI_Status() == 1 ) {
    Recover() ;
}

Example if a variable changes its size during execution:

int *A;
int Asize ;
.
.
.
if ( FTI_Status() != 0 ) {
    FTI_Recover(); // to recover size of variable
    A = realloc( A, sizeof(int)*Asize ) ;
    // updating protected list
    FTI_Protect( 2, buf, Asize, FTI_INTG );
    FTI_Recover(); // to recover variable A
}

FTI_Snapshot

⬆️ Top

  • Loads checkpoint data and initializes runtime variables upon recovery.
  • Writes multilevel checkpoints regarding their requested frequencies.

DEFINITION

int FTI_Snapshot()

OUTPUT

Value Reason
FTI_SCES Successfull call (without checkpointing) or if recovery successful
FTI_NSCS Failure of FTI_Checkpoint
FTI_DONE Success of FTI_Checkpoint
exit(1) Failure on recovery

DESCRIPTION

This function loads the checkpoint data from the checkpoint file in case of restart. Otherwise, it checks if the current iteration requires checkpointing (see e.g.: ckpt_L1) and performs a checkpoint if needed (internal call to FTI_Checkpoint). Should be called after initialization of protected variables.

EXAMPLE

int res = Snapshot();
if ( res == FTI_SCES ) {
    .
    . // executed after successful recover
    . // or when checkpoint is not required
}
else { // res == FTI_DONE
    .
    . // executed after successful checkpointing
    .
}

FTI_Finalize

⬆️ Top

  • Frees the allocated memory.
  • Communicates the end of the execution to dedicated threads.
  • Cleans checkpoints and metadata.

DEFINITION

int FTI_Finalize()

OUTPUT

Value Reason
FTI_SCES For application process
exit(0) For FTI process

DESCRIPTION

This function notifies the FTI processes that the execution is over, frees some data structures and it closes. If this function is not called on the end of the program the FTI processes will never finish (deadlock). Should be called before MPI_Finalize().

EXAMPLE

int main ( int argc , char ** argv ) {
    .
    .
    .
    FTI_Finalize () ;
    MPI_Finalize () ;
    return 0;
}