-
Notifications
You must be signed in to change notification settings - Fork 27
API Reference
FTI Datatypes
FTI Constants
FTI_Init
FTI_InitType
FTI_Protect
FTI_Checkpoint
FTI_Status
FTI_Recover
FTI_Snapshot
FTI_Finalize
FTI_CHAR
: FTI data type for chars
FTI_SHRT
: FTI data type for short integers.
FTI_INTG
: FTI data type for integers.
FTI_LONG
: FTI data type for long integers.
FTI_UCHR
: FTI data type for unsigned chars.
FTI_USHT
: FTI data type for unsigned short integers.
FTI_UINT
: FTI data type for unsigned integers.
FTI_ULNG
: FTI data type for unsigned long integers.
FTI_SFLT
: FTI data type for single floating point.
FTI_DBLE
: FTI data type for double floating point.
FTI_LDBE
: FTI data type for long double floating point.
FTI_BUFS
: 256
FTI_DONE
: 1
FTI_SCES
: 0
FTI_NSCS
: -1
- Reads configuration file.
- Creates checkpoint directories.
- Detects topology of the system.
- Regenerates data upon recovery.
DEFINITION
int FTI_Init ( char * configFile , MPI_Comm globalComm )
INPUT
Variable | What for? |
---|---|
char * configFile |
Path to the config file |
MPI_Comm globalComm |
MPI communicator used for the execution |
OUTPUT
Value | Reason |
---|---|
FTI_SCES |
Success |
FTI_NSCS |
No Success |
DESCRIPTION
This function initializes the FTI context. It should be called before other FTI functions, right after MPI initialization.
EXAMPLE
int main ( int argc , char **argv ) {
MPI_Init (&argc , &argv );
char *path = "config.fti"; // config file path
FTI_Init ( path , MPI_COMM_WORLD );
.
.
.
return 0;
}
- Initializes a data type.
DEFINITION
int FTI_InitType ( FTIT_type *type , int size )
INPUT
Variable | What for? |
---|---|
FTIT_type * type |
The data type to be initialized |
int size |
The size of the data type to be initialized |
OUTPUT
Value | Reason |
---|---|
FTI_SCES |
Success |
DESCRIPTION
This function initializes a data type. A variable’s type which isn’t defined by default by FTI (see: FTI Datatypes) should be added using this function before adding this variable to protected variables.
EXAMPLE
typedef struct A {
int a;
int b;
} A;
FTIT_type structAinfo ;
FTI_InitType (& structAinfo , 2 * sizeof ( int ));
- Stores metadata concerning the variable to protect.
DEFINITION
int FTI_Protect ( int id, void *ptr, long count, FTIT_type type )
INPUT
Variable | What for? |
---|---|
int id |
Unique ID of the variable to protect |
void * ptr |
Pointer to memory address of variable |
long count |
Number of elements at memory address |
FTIT_type type |
FTI data type of variable to protect |
OUTPUT
Value | Reason |
---|---|
FTI_SCES |
Success |
exit(1) |
Number of protected variables is > FTI_BUFS
|
DESCRIPTION
This function should be used to add data structure to the list of protected variables. This list of structures is the data that will be stored during a checkpoint and loaded during a recovery. It resets the dataset with given id if it was already previously registered. When size of a variable changes during execution it should be updated using this function before next check- point to properly store data.
EXAMPLE
int A;
float *B = malloc (sizeof(float) * 10) ;
FTI_Protect(1, &A, 1, FTI_INTG );
FTI_Protect(2, B, 10, FTI_SFLT );
// changing B size
B = realloc(B, sizeof(float) * 20) ;
// updating B size in protected list
FTI_Protect(2, B, 20, FTI_SFLT);
- Writes values of protected runtime variables to a checkpoint file of requested level.
DEFINITION
int FTI_Checkpoint( int id, int level )
INPUT
Variable | What for? |
---|---|
int id |
Unique checkpoint ID |
int level |
Checkpoint level (1=L1, 2=L2, 3=L3, 4=L4) |
OUTPUT
Value | Reason |
---|---|
FTI_DONE |
Success |
FTI_NSCS |
Failure |
DESCRIPTION
This function is used to store current values of protected variables into a checkpoint file. Depending on the checkpoint level file is stored in local, partner node or global directory. Checkpoint’s id must be different from 0.
EXAMPLE
int i;
for (i = 0; i < 100; i ++) {
if (i % 10 == 0) {
FTI_Checkpoint ( i /10 + 1, 1) ;
}
.
. // some computations
.
}
- Returns the current status of the recovery flag.
DEFINITION
int FTI_Status()
OUTPUT
Value | Reason |
---|---|
|
No checkpoints taken yet or recovered successfully |
|
At least one checkpoint is taken. If execution fails, the next start will be a restart |
|
The execution is a restart from checkpoint level L4 and keep_last_checkpoint was enabled during the last execution |
DESCRIPTION
This function returns the current status of the recovery flag.
EXAMPLE
if ( FTI_Status () != 0) {
.
. // this section will be executed during restart
.
}
- Loads checkpoint data from the checkpoint file and initializes the runtime variables of the execution.
DEFINITION
int FTI_Recover()
OUTPUT
Value | Reason |
---|---|
FTI_SCES |
Success |
FTI_NSCS |
Failure |
DESCRIPTION
This function loads the checkpoint data from the checkpoint file and it up- dates some basic checkpoint information. It should be called after initial- ization of protected variables after a failure. If a variable changes it’s size during execution it must have the latest size before Recover. The easiest way to do so is to add size of variable as another variable to protected list, and then call Recover twice. First to recover size of variable. Second to recover variable’s data (after an update of protected list).
EXAMPLE
Basic example:
if ( FTI_Status() == 1 ) {
Recover() ;
}
Example if a variable changes its size during execution:
int *A;
int Asize ;
.
.
.
if ( FTI_Status() != 0 ) {
FTI_Recover(); // to recover size of variable
A = realloc( A, sizeof(int)*Asize ) ;
// updating protected list
FTI_Protect( 2, buf, Asize, FTI_INTG );
FTI_Recover(); // to recover variable A
}
- Loads checkpoint data and initializes runtime variables upon recovery.
- Writes multilevel checkpoints regarding their requested frequencies.
DEFINITION
int FTI_Snapshot()
OUTPUT
Value | Reason |
---|---|
FTI_SCES |
Successfull call (without checkpointing) or if recovery successful |
FTI_NSCS |
Failure of FTI_Checkpoint
|
FTI_DONE |
Success of FTI_Checkpoint
|
exit(1) |
Failure on recovery |
DESCRIPTION
This function loads the checkpoint data from the checkpoint file in case of
restart. Otherwise, it checks if the current iteration requires checkpointing
(see e.g.: ckpt_L1) and performs a checkpoint if needed (internal call to FTI_Checkpoint
). Should be called after
initialization of protected variables.
EXAMPLE
int res = Snapshot();
if ( res == FTI_SCES ) {
.
. // executed after successful recover
. // or when checkpoint is not required
}
else { // res == FTI_DONE
.
. // executed after successful checkpointing
.
}
- Frees the allocated memory.
- Communicates the end of the execution to dedicated threads.
- Cleans checkpoints and metadata.
DEFINITION
int FTI_Finalize()
OUTPUT
Value | Reason |
---|---|
FTI_SCES |
For application process |
exit(0) |
For FTI process |
DESCRIPTION
This function notifies the FTI processes that the execution is over, frees
some data structures and it closes. If this function is not called on the end
of the program the FTI processes will never finish (deadlock). Should be
called before MPI_Finalize()
.
EXAMPLE
int main ( int argc , char ** argv ) {
.
.
.
FTI_Finalize () ;
MPI_Finalize () ;
return 0;
}