Skip to content
This repository has been archived by the owner on Jan 8, 2020. It is now read-only.
Luis Lázaro edited this page Mar 22, 2018 · 13 revisions

Motivation.

Map of processed files.

The size of the processed files and their name is saved on a map. This map is written to disk. When the flume agent is started, it is loaded. Before processing a file, the flume source consults this map to find out if the number of lines in the file has grown. If it has done so, the file is marked as modified and therefore the line difference is processed. If it has not changed, the file is simply not processed but will be moved to the processed directory. This movement is susceptible to generate an exception if the file already existed in a directory of processed, simply preventing movement.

Concurrency.

Defined Event States.

Moving processed files.

delete files once they have been processed by flume is an interesting topic. It is not so much the problem of implementing a solution that successfully and safely deletes of files, but to determine a criterion that allows to know when to erase the file. Logically, the objective is to delete the file when it has been processed, but how to know that the file that has been processed by flume is complete?, that is, that no more lines of information will arrive to that file. If we are in a data streaming data environment, define who is the last line of a file is complex. So, moving processed files in local file system is allowed (and recomended) to use when the file that has started processing is reaching the end of file or the inputstream retrieves null. With remote file systems like FTP, retrieving null from inputstream does not imply we reached the end of the file, just arrived one chunk of data.

Clone this wiki locally