Skip to content

Fortran API Reference

Kai Keller edited this page Oct 24, 2017 · 31 revisions

FTI Datatypes
FTI Constants
FTI_Init
FTI_InitType
FTI_Protect
FTI_Checkpoint
FTI_Status
FTI_Recover
FTI_Snapshot
FTI_Finalize

FTI Datatypes and Constants

FTI Datatypes

⬆️ Top

FTI datatypes are used in the C-API function FTI_Protect. With the count parameter and the datatype, FTI is able to determine the size of the allocated memory region at ptr.

The FTI Fortran interface defines a template of FTI_Protect for all intrinsic data types. Hence the datatype definitions are not necessary here and are not avalable for the Fortran interface.

FTI Constants

⬆️ Top

FTI_BUFS : 256
FTI_DONE : 1
FTI_SCES : 0
FTI_NSCS : -1
FTI_NREC : -2


FTI_Init

⬆️ Top

  • Reads configuration file.i
  • Creates checkpoint directories.
  • Detects topology of the system.
  • Regenerates data upon recovery.

DEFINITION

subroutine FTI_Init ( config_file, global_comm, err )

ARGUMENTS

Variable What for?
character config_file IN Path to the config file
integer global_comm IN/OUT MPI communicator used for the execution
integer err OUT Token for FTI error code.

ERROR HANDLING

ierr Reason
FTI_SCES Success
FTI_NSCS No Success
FTI_NREC FTI couldn't recover ckpt files, no recovery possible

DESCRIPTION

This function initializes the FTI context. It should be called before other FTI functions, right after MPI initialization. The MPI communicator passed, must be declared as integer, target.

EXAMPLE

integer, target :: rank, nbProcs, err, FTI_comm_world

call MPI_Init(err)
FTI_comm_world = MPI_COMM_WORLD
call FTI_Init('config.fti', FTI_comm_world, err) ! modifies FTI_comm_world
call MPI_Comm_size(FTI_comm_world, nbProcs, err)
call MPI_Comm_rank(FTI_comm_world, rank, err)

FTI_InitType

⬆️ Top

  • Initializes a data type.

DEFINITION

subroutine FTI_InitType ( type_F, size_F, err )

ARGUMENTS

Variable What for?
type(FTI_type) type_F IN The data type to be initialized
integer size_F IN The size of the data type to be initialized
integer err OUT Token for FTI error code.

ERROR HANDLING

err Reason
FTI_SCES Success
FTI_NSCS No Success

DESCRIPTION

This function initializes a data type. A variable’s type which is not an intrinsic Fortran data-type must be added using this function before adding this variable to the protected variables.

EXAMPLE

!...

type polar
    real :: radius
    real :: phi
end type

type(FTI_Type)              :: FTI_Polar

type(polar), target         :: choord
type(polar), pointer        :: choord_ptr
type(c_ptr)                 :: choord_c_ptr

choord_ptr => choord
choord_c_ptr = c_loc(choord) 

! ...

call FTI_InitType(FTI_Polar, int(sizeof(choord),4), ierr)

! ...

FTI_Protect

⬆️ Top

  • Stores metadata concerning the variable to protect.

In the Fortran interface, FTI_Protect comes with two different function headers. One may be used for intrinsic Fortran types and the other must be used for derived data-types.

DEFINITION

subroutine FTI_Protect ( id, data, err ) !> For intrinsic data-types
subroutine FTI_Protect ( id, data_ptr, count_F, type_F, err ) !> For derived data-types

ARGUMENTS (intrinsic types)

Variable What for?
integer id IN Unique ID of the variable to protect
Fortran type, pointer data IN Pointer to memory address of variable
integer err OUT Token for FTI error code.

ARGUMENTS (derived types)

Variable What for?
integer id IN Unique ID of the variable to protect
type(c_ptr) data_ptr IN Pointer to memory address of variable
integer count_F IN Number of elements.
tape(FTI_Type) type_F IN FTI_Type of Derived data-type.
integer err OUT Token for FTI error code.

ERROR HANDLING

err Reason
FTI_SCES Success
FTI_NSCS Number of protected variables is > FTI_BUFS

DESCRIPTION

This function should be used to add data structures to the list of protected variables. This list of structures is the data that will be stored during a checkpoint and loaded during a recovery. When the size of a variable changes during execution, a subsequent call to FTI_Protect, updates the size for the protected variable. The call is necessary to communicate the change in size to FTI.

EXAMPLE

For Fortran intrinsic data-types:

! ...

integer, target :: nbProcs, iter, row, col, err, FTI_comm_world
integer, pointer  :: ptriter
real(8), pointer :: g(:,:)

call MPI_Init(err)
FTI_comm_world = MPI_COMM_WORLD
call FTI_Init('config.fti', FTI_comm_world, err) ! modifies FTI_comm_world
call MPI_Comm_size(FTI_comm_world, nbProcs, err)

row = sqrt((MEM_MB * 1024.0 * 512.0 * nbProcs)/8)

col = (row / nbProcs)+3

allocate( g(row, col) )
allocate( h(row, col) )

! INIT DATA ...

ptriter => iter
call FTI_Protect(0, ptriter, err)
call FTI_Protect(2, g, err)

! ...

For derived data-types

! ...

use iso_c_binding

type polar
    real :: radius
    real :: phi
end type

type(FTI_Type)              :: FTI_Polar
integer, parameter       	:: N=128*1024*25  !> 25 MB / Process
integer, parameter       	:: N1 = 128       
integer, parameter       	:: N2 = 1024
integer, parameter       	:: N3 = 25
integer, target             :: FTI_COMM_WORLD
integer                     :: ierr, status

type(polar), dimension(:,:,:), pointer :: arr
type(c_ptr)                            :: arr_c_ptr

allocate(arr(N1,N2,N3))

shape = (/ N1, N2, N3 /)
arr_c_ptr = c_loc( arr( & 
    lbound(arr,1), &
    lbound(arr,2), &
    lbound(arr,3)))

!> INITIALIZE MPI AND FTI    
call MPI_Init(ierr)
FTI_COMM_WORLD = MPI_COMM_WORLD
call FTI_Init('config.fti', FTI_COMM_WORLD, ierr)

call FTI_InitType(FTI_Polar, int(2*sizeof(1.0),4), ierr)

!> PROTECT DATA AND ITS SHAPE
call FTI_Protect(0, arr_c_ptr, size(arr), FTI_Polar, ierr)

! ...

FTI_Checkpoint

⬆️ Top

  • Writes values of protected runtime variables to a checkpoint file of requested level.

DEFINITION

subroutine FTI_Checkpoint ( id_F, level, err )

ARGUMENTS

Variable What for?
integer id_F IN Unique checkpoint ID
integer level IN Checkpoint level (1=L1, 2=L2, 3=L3, 4=L4)
integer err OUT Token for FTI error code.

ERROR HANDLING

err Reason
FTI_DONE Success
FTI_NSCS Failure

DESCRIPTION

This function is used to store current values of protected variables into a checkpoint file. Depending on the checkpoint level file is stored in local, partner node or global directory. Checkpoint’s id must be different from 0.

EXAMPLE

The handling is identical to the C case, except that in Fortran it is a subroutine and not a function, hence:

! ...

!> LEVEL 2 CHECKPOINT, ID = 1
call FTI_Checkpoint(1, 2, err) 

! ...

FTI_Realloc

⬆️ Top

  • Provides the reallocation of memory on FTI API side for protected variables upon a restart.

DEFINITION

subroutine FTI_Realloc ( id, data, err ) !> For intrinsic data-types
subroutine FTI_Realloc ( id, data_ptr, err ) !> For derived data-types

ARGUMENTS (intrinsic types)

Variable What for?
integer id IN Unique ID of the variable to protect
Fortran type, pointer data IN/OUT Pointer to memory address of variable
integer err OUT Token for FTI error code.

ARGUMENTS (derived types)

Variable What for?
integer id IN Unique ID of the variable to protect
type(c_ptr) data_ptr IN/OUT Pointer to memory address of variable
integer err OUT Token for FTI error code.

ERROR HANDLING

err Reason
FTI_SCES Success
FTI_NSCS No success

DESCRIPTION

For the case that a protected variable changed its size or dimension, before the invokation of FTI_Recover during the restart, the pointed memory region has to be re-allocated. This may be done using FTI_Realloc.

NOTE: The whole allocated memory is accessable only in the first dimension of the returned array. I.e. A_old -> A_new(1:size(A_old),1,1,...)

EXAMPLE

For intrinsic data-types:

! ...

integer, parameter       :: N1=128*1024*25  !> 25 MB / Process
integer, parameter       :: N2=128*1024*50  !> 50 MB / Process
integer, parameter       :: N11 = 128       
integer, parameter       :: N12 = 1024
integer, parameter       :: N13 = 25
integer, parameter       :: N21 = 128       
integer, parameter       :: N22 = 1024
integer, parameter       :: N23 = 50
integer, target          :: FTI_COMM_WORLD
integer                  :: ierr, status

real(dp), dimension(:,:,:), pointer :: arr
real(dp), dimension(:,:,:), pointer :: tmp
integer, dimension(:), pointer      :: shape

allocate(arr(N11,N12,N13))
allocate(shape(3))

!> INITIALIZE MPI AND FTI    
call MPI_Init(ierr)
FTI_COMM_WORLD = MPI_COMM_WORLD
call FTI_Init('config.fti', FTI_COMM_WORLD, ierr)

!> PROTECT DATA AND ITS SHAPE
call FTI_Protect(0, arr, ierr)
call FTI_Protect(1, shape, ierr)

call FTI_Status(status)

!> EXECUTE ON RESTART
if ( status .eq. 1 ) then
    !> REALLOCATE TO SIZE AT CHECKPOINT
    call FTI_Realloc(0, arr, ierr)
    print *, ierr
    call FTI_recover(ierr)
    !> RESHAPE ARRAY
    arr(1:shape(1),1:shape(2),1:shape(3)) => arr(1:size(arr),1,1)

	! ...

end if

! ...

!> FIRST CHECKPOINT
call FTI_Checkpoint(1, 1, ierr)

! ...

!> CHANGE ARRAY DIMENSION
!> AND STORE IN SHAPE ARRAY
shape = [N21,N22,N23]
allocate(tmp(N21,N22,N23))
tmp(1:N11,1:N12,1:N13) = arr
deallocate(arr)
arr => tmp

!> TELL FTI ABOUT THE NEW DIMENSION
call FTI_Protect(0, arr, ierr)

! ...

!> SECOND CHECKPOINT
call FTI_Checkpoint(2,1, ierr)

! ...

For derived data-types:

! ...

use iso_c_binding

! ...

type polar
    real :: radius
    real :: phi
end type

type(FTI_Type)           :: FTI_Polar
integer, parameter       :: N1=128*102*25  !> 25 MB / Process
integer, parameter       :: N2=128*102*50  !> 50 MB / Process
integer, parameter       :: N11 = 128       
integer, parameter       :: N12 = 102
integer, parameter       :: N13 = 25
integer, parameter       :: N21 = 128       
integer, parameter       :: N22 = 102
integer, parameter       :: N23 = 50
integer, target          :: FTI_COMM_WORLD
integer                  :: ierr, status

type(polar), dimension(:,:,:), pointer :: arr
type(c_ptr)                            :: arr_c_ptr
type(polar), dimension(:,:,:), pointer :: tmp
integer, dimension(:), pointer         :: shape

allocate(arr(N11,N12,N13))
allocate(shape(3))

!> INITIALIZE C POINTER
arr_c_ptr = c_loc( arr( & 
    lbound(arr,1), &
    lbound(arr,2), &
    lbound(arr,3)))

! ...

!> PROTECT DATA AND ITS SHAPE
call FTI_Protect(0, arr_c_ptr, size(arr), FTI_Polar, ierr)
call FTI_Protect(1, shape, ierr)

call FTI_Status(status)

!> EXECUTE ON RESTART
if ( status .eq. 1 ) then
    !> REALLOCATE TO SIZE AT CHECKPOINT
    call FTI_Realloc(0, arr_c_ptr, ierr)
    call FTI_recover(ierr)
    !> RESHAPE ARRAY
    call c_f_pointer(arr_c_ptr, arr, shape)

	! ...

end if

! ...

!> FIRST CHECKPOINT
call FTI_Checkpoint(1, 1, ierr)

! ...

!> CHANGE ARRAY DIMENSION
!> AND STORE IN SHAPE ARRAY
shape = [N21,N22,N23]
allocate(tmp(N21,N22,N23))
tmp(1:N11,1:N12,1:N13) = arr
deallocate(arr)
arr => tmp

! ...

!> UPDATE C POINTER BEFORE CALL TO 'FTI_Protect'
arr_c_ptr = c_loc( arr( & 
    lbound(arr,1), &
    lbound(arr,2), &
    lbound(arr,3)))

!> TELL FTI ABOUT THE NEW DIMENSION
call FTI_Protect(0, arr_c_ptr, size(arr), FTI_Polar, ierr)

! ...

!> SECOND CHECKPOINT
call FTI_Checkpoint(2,1, ierr)

! ...

FTI_Status

⬆️ Top

  • Returns the current status of the recovery flag.

DEFINITION

subroutine FTI_Status ( status )

ARGUMENTS

Variable What for?
integer status OUT Token for status flag.

OUTPUT

Value Reason
0 No checkpoints taken yet or recovered successfully
1 At least one checkpoint is taken. If execution fails, the next start will be a restart
2 The execution is a restart from checkpoint level L4 and keep_last_checkpoint was enabled during the last execution

DESCRIPTION

This function returns the current status of the recovery flag.

EXAMPLE

call FTI_Status(status)

!> EXECUTE ON RESTART
if ( status .eq. 1 ) then
    
	! ...

	call FTI_recover(ierr)

	! ...

end if

FTI_Recover

⬆️ Top

  • Loads checkpoint data from the checkpoint file and initializes the runtime variables of the execution.

DEFINITION

subroutine FTI_Recover ( err )

ARGUMENTS

Variable What for?
integer err OUT Token for FTI error code.

ERROR HANDLING

Value Reason
FTI_SCES Success
FTI_NSCS Failure

DESCRIPTION

This function loads the checkpoint data from the checkpoint file and it up- dates some basic checkpoint information. It should be called after initial- ization of protected variables after a failure. see FTI_Realloc for the case that the variable size vary during execution. In that case FTI_Realloc has to be invoked before the call to FTI_Recover to prevent segmentation faults.

EXAMPLE

see example of FTI_Status.


FTI_Snapshot

⬆️ Top

  • Loads checkpoint data and initializes runtime variables upon recovery.
  • Writes multilevel checkpoints regarding their requested frequencies.

DEFINITION

subroutine FTI_Snapshot ( err )

ARGUMENTS

Variable What for?
integer err OUT Token for FTI error code.

ERROR HANDLING

Value Reason
FTI_SCES Successfull call (without checkpointing) or if recovery successful
FTI_NSCS Failure of FTI_Checkpoint
FTI_DONE Success of FTI_Checkpoint
FTI_NREC Failure on recovery

DESCRIPTION

This function loads the checkpoint data from the checkpoint file in case of restart. Otherwise, it checks if the current iteration requires checkpointing (see e.g.: ckpt_L1) and performs a checkpoint if needed (internal call to FTI_Checkpoint). Should be called after initialization of protected variables.

EXAMPLE

! ...

ptriter => iter
call FTI_Protect(0, ptriter, err)
call FTI_Protect(2, g, err)
call FTI_Protect(1, h, err)

do iter = 1, ITER_TIMES

  call FTI_Snapshot(err)

  call doWork(nbProcs, rank, g, h, localerror)

  ! ...

enddo

if ( rank == 0 ) then
  print '("Execution finished in ",F9.0," seconds.")', MPI_Wtime() - wtime
endif

! ...

FTI_Finalize

⬆️ Top

  • Frees the allocated memory.
  • Communicates the end of the execution to dedicated threads.
  • Cleans checkpoints and metadata.

DEFINITION

subroutine FTI_Finalize ( err )

ARGUMENTS

Variable What for?
integer err OUT Token for FTI error code.

ERROR HANDLING

Value Reason
FTI_SCES For application process
exit(0) For FTI process (only if head == 1)

DESCRIPTION

This function notifies the FTI processes that the execution is over, frees some data structures and it closes. If this function is not called on the end of the program the FTI processes will never finish (deadlock). Should be called before MPI_Finalize().

EXAMPLE

! ...

deallocate(h)
deallocate(g)

call FTI_Finalize(err)
call MPI_Finalize(err)

! ...