-
Notifications
You must be signed in to change notification settings - Fork 220
User Manual
The tool comes with some pre built-in profiles. Each profile has different options enabled or disabled in IPEDConfig.txt and other advanced configuration files. You can specify a processing profile running:
iped.exe -d image.dd -o output -profile profile
If you do not specify a profile, the default one will be used. Below there is a summary of the built-in profiles.
- default
Default features are enabled, like hash computation, signature analysis, container expansion, file indexing, thumbnail generation and regex scanning. Data carving is not enabled. File slacks and unallocated space are not processed. Although, file recovery based on file system tables is done.
- forensic
All features enabled in default profile, plus data carving, file slack and unallocated space processing. NSRL hash database lookup is also enabled, you need to configure NSRL hash index database, refer to NIST NSRL Importing
- pedo
All features of forensic are enabled, plus nudity detection, photoDNA, child porn hash database lookup, child porn hash database based data carving and emule known.met carving. Data carving for videos is also enhanced.
- fastmode
Fastest processing mode to be used at crime scenes to preview data. All features that need access to file content are disabled, like hash computation, signature analysis, indexing, carving, regex scanning, thumbnail generation. It basically does ls on file system tree. But files are still categorized based on extension, you can preview file content, browse file system tree, use the image gallery and apply filters based on any file metadata, like name, path, size or mac times.
- triage
Like the fastmode, but index file content of some document formats (office, pdf, html, emails, txt...) in common user directories. Image and video parsers are disabled. Some folders, like those containing system files, are not included in case. So you can do some index searches in triage scenarios. Time to finish processing is very unpredictable, depends a lot on user data volume. This profile is considered experimental to be used in the field, still unstable when executed in computers with limited resources.
- blind
Profile for automatic data extraction. It automatically extracts data from an evidence, generates a HTML report and a portable case with file categories configured in profiles/blind/conf/CategoriesToExport.txt OR files containing keywords or regexes defined in profiles/blind/conf/KeywordsToExport.txt
If you want to send the case (and evidence) to be analyzed by other person (e.g. examiner -> investigator) or if you simply want to analyze the case from other computer, you can use the --portable
command line option, then IPED will store relative paths to evidences instead of absolute paths, so they can be found without problems independently of the mount point (drive letter).
This only works if the output folder is set on the same drive letter where the evidence is stored. If they are different, it is impossible to derivate a relative path from case -> evidence and absolute paths will be stored.
If you used different drives for output folder and evidence at processing time because of performance considerations, you can move the evidence to the case folder or to one of its parents, keeping some (last) part of the original evidence path, so IPED will try to find the evidence using different path combinations. For example, if you used the paths below:
- case: a:\b\c\case
- evidence: d:\e\f\evidence
If not found, the evidence will be searched for in the following paths:
- a:\b\c\case\evidence
- a:\b\c\case\f\evidence
- a:\b\c\case\e\f\evidence
- a:\b\c\evidence
- a:\b\c\f\evidence
- a:\b\c\e\f\evidence
- a:\b\evidence
- a:\b\f\evidence
- a:\b\e\f\evidence
- a:\evidence
- a:\f\evidence
- a:\e\f\evidence
The look up above applies just to disk image evidences, not for UFDR or AD1 evidences. If your evidence is an UFDR or AD1 or if it is not found, a dialog asking for the new path will be shown when the first actual file item is selected on the UI.
Starting with version 3.18, it is possible to resume a stopped or aborted processing, for example, because of abrupt energy shutdown, system restarting after update, or maybe some out of memory or bug causing the processing to abort.
By default, the application does a commit every 30min, this can be changed in "commitIntervalSeconds" param in "conf/IndexTaskConfig.txt" file (be careful to not decrease it too much). If a processing aborts after the first commit, now the evidence will added to the case in an incomplete state, just with some subset of the items and you must continue or restart the processing.
If you want to resume an aborted processing, just repeat the exact same command line and add --continue
to the end. The aplication will "resume" the processing from last commit point (actually it will skip already commited items, this can take some few seconds or minutes). Note that if you configured indexTempOnSSD = true
and outputOnSSD = false
, the case index is created in indexTemp
and copied to output at the end of processing, so DON'T clear the indexTemp
folder, or you will lost your case. If you are using --append
you can clear indexTemp
because your index was copied to output at the end of the first processing and is not copied to indexTemp
again.
If you want to discard the last incomplete processing and restart from beginning, just repeat the exact same command line and add --restart
to the end. This is useful if you are appending a new evidence to an existing case with other evidences processed before and want to restart just the processing of the last evidence. If a single case evidence aborted and you want to restart, it is better to delete the case and start from scratch instead of using --restart
, although you could.
Currently the following hash algorithms are supported: md5, sha-1, sha-256, sha-512, edonkey, photoDNA (if you are law enforcement and investigate child abuse cases, ask the plugin to [email protected]).
Note sha-256 must be enabled to WhatsApp chat attachments be linked to their conversations.
IPED uses the Apache Tika library to do file signature and type checking. Take a look at tika-mimetypes.xml for the default file types and signatures supported by Apache Tika.
Furthermore, IPED has its own custom signature definitions to complement the Tika supported list. It can be found at profiles/[locale]/[profile]/conf/CustomSignatures.xml
Based on file types (mimetypes) detected by the signature analysis module, IPED classifies case items in a set of categories defined at profiles/[locale]/[profile]/conf/CategoriesByTypeConfig.txt. You can create new categories with new mimetypes defined in conf/CustomSignatures.xml, break existing ones, or merge some if desired. The category hierarchy is defined in conf/CategoryHierarchy.txt config file. Those two files will be merged in the future maybe in a single xml or json file.
Besides that, you can also create some categorization rules based on file properties like name, path, type, dates, size or a combination of them using javascript language. There are some pre-defined rules in profiles/[locale]/[profile]/conf/scripts/RefineCategoryTask.js, e.g. if the file is in $recycle.bin it is added to "Windows Recycle" category. Check that file for more examples.
To import a NIST NSRL hash database, first you must configure, in the LocalConfig.txt file, the kffDb option, where the hash index database will be stored. It is highly recommended to configure it on a SSD disk, or hash look up will be very slow. It is a full path to kff.db file that will be created when you import the database. To import NSRL hashes, download and decompress the NSRL zip file, then use the following command:
iped.exe -importkff NSRLFile_parent
where NSRLFile_parent is the parent folder of NSRLFile.txt file.
You should configure the conf/KFFTaskConfig.txt file, where programs that will receive alert status should be listed. All other programs not listed will receive ignore status.
ProjectVIC hashset is also supported. Please contact ProjectVIC to ask for their hash database. You can configure the hashset json in projectVicHashSetPath
in LocalConfig.txt file and enable the hash look up in enableProjectVicHashLookup
in IPEDConfig.txt file. If you have photoDNA plugin installed, ProjectVIC photoDNA hashes will also be loaded and available for look up while processing.
To enable photoDNA database lookup, first you need to put into the plugin folder (optional_jars) all photoDNA jars. If you are Law Enforcement and investigate child exploitation cases, ask for them at [email protected].
The path to photoDNA database must be configured in photoDNAHashDatabase parameter in LocalConfig.txt file. The database format is simply one photoDNA per line. It may have other file info in the same line (eg: file name, md5), * should be used as separator.
IPED 3.x versions support a specific child porn hash database format of another software called LED, developed by Wladimir Leite (@tc-wleite), used to triage data in the field. The support works for LED versions up to 1.28, since LED-1.29 changed its internal hash set format. The path to that database must be configured in ledWkffPath parameter in LocalConfig.txt file.
The database is built by one or more of txt files stored inside a directory. Each txt file, named as you want, should contain one line of information about each file in the database. Each line contains different hash values and other file info, separated by " *" (1 space and 1 star, without the quotes), as follows:
MD5 *MD5_first_64K *Edonkey *SHA1 *MD5_first_512 *FileSize *FileName *MD5_first_1MB *SHA256
- MD5_first_64K is the md5 of the first 64KB of the file
- MD5_first_512 is the md5 of the first 512 bytes of the file
- MD5_first_1MB is the md5 of the first 1MB of the file
Those partial hashes are used by LED software for fast hash lookup at crime scenes, without the need to compute full file hashes.
If you have the files and want to create your own hash database in this format to use with IPED, you can use the CalcWKFF tool at https://github.com/tc-wleite/CalcWKFF
A single SQLite database with all hash sets is used during IPED processing. So, before processing evidences, the user must import hash sets to be used (or use a SQLite already prepared, with all desired hash sets imported).
In LocalConfig.txt, the full path of the hash database can be set in the entry hashesDB. It is highly recommended to store it on a fast disk, preferably SSD, and not the same used as "indexTemp", if other disk is available. In IPEDConfig.txt it is possible to enable/disable the hash lookup using enableHashDBLookup.
The default input format that can be imported to IPED hashes DB is a simple CSV file, with a header defining hash types and associated properties. Example:
MD5,SHA1,status,set,comment
Hash1,Hash2,alert,Malware,Dangerous exploit
Hash3,Hash4,known,CommonFiles,
...
HashX,HashY,alert,,
There are no mandatory columns, except that at least one hash column and at least one property are present. Currently supported hashes are MD5, SHA-1, SHA-256, SHA-512 and EDONKEY.
To import (or remove) hash sets from the database, the iped-hashdb.jar
command line tool can be used. The very basic command is:
java -jar lib/iped-hashdb.jar -d <input> -o <output>
<input>
is the hash set file or folder you want to import and <output>
is the sqlite database file where the hashes will be imported to, if it doesn't exist, it will be created. Running with no parameters shows all available options, please take a look at them.
PS: If you get an error like UnsupportedClassVersionError: iped/engine/hashdb/HashDBTool has been compiled by a more recent version of the Java Runtime (class file version 55.0), this version of the Java Runtime only recognizes class file versions up to 52.0
update to java 11 or use IPED's embedded java 11 replacing java
in command above by jre\bin\java
.
In the example above, if IPED processes a file with MD5=Hash1, then its properties are added to the item, with the prefix hashDb:. So, the following properties would be added (available for searches, in item’s metadata/advanced properties tab, or to be added as a column in the main table of items):
Property Value
-------------- -----------------
hashDb:status alert
hashDb:set Malware
hashDb:comment Dangerous exploit
The NIST NSRL hash database can be imported to IPED hashes DB.
For RDS Version 2.x, the main file "NSRLFile.txt" and the product file "NSRLProd.txt" must be present in the same folder to allow importing of this hash set.
The NSRL hash sets are now released in a SQLite format (RDSv3). These SQLite databases can also be imported to IPED hashes DB.
MD5 and SHA-1 hashes, present in this hash set, are imported, associated with the following properties:
- set = NSRL
- status = known (or other value, see below)
- nsrlProductName = [List of one or more products associated with this hash]
- nsrlSpecialCode (RDS v2 only)
In the configuration file "NSRLConfig.json", it can be defined a "status" for a group of products. It is a similar configuration previously present in the file "KFFTaskConfig.txt", but now it is possible to set any status (not just "alert" as before), and it is also possible to define a default "status", for other products not explicitly defined (default is "known").
The NIST CAID Non-RDS hash set can also be imported to IPED hashes DB.
After downloading the ZIP file and expanding its content (several JSON files) to a folder, the hash set can be imported by processing that folder with the Hash DB tool. This hash set contains multimedia file hashes in the NSRL database, which can be used to filter out known multimedia files.
MD5 and SHA-1 hashes are imported, associated with the following properties:
- set = CAID
- status = known
- caidMediaSize
- caidFileName
ProjectVIC hash set is also supported. Please contact ProjectVIC directly to ask for their hash database.
When importing this file (that uses a JSON format), MD5 and SHA-1 hashes are imported, associated with the following properties:
- set = ProjectVIC
- vicCategory = (category 0, 1, 2 or 3, as defined in the input file)
- status = pedo (set only if vicCategory is 1 or 2), status = known (category = 0)
- vicSeries (if present)
- vicTags (if present)
- vicVictimIdentified (if present)
- vicOffenderIdentified (if present)
- vicIsDistributed (if present)
- vicIsPrecategorized (if present)
- photoDna (if present)
The INTERPOL ICSE database CSV can also be imported directly to IPED hashes DB.
Although it uses a CSV format, there are some specificities, which required a specilized treatment. Please, open a discussion if you have trouble importing this file. The following properties are imported, when present:
- set = ICSE
- status = pedo
- photoDna
- icseImageID
- icseIsDistributed
- icseVictimIdentified
- icseOffenderIdentified
- icseSeriesName
- icseNumSubmissions
- icseFileAvailable
- icseMediaType
- icseBaseline
Original LED hashes database (CSAM) format is not supported directly by IPED anymore, since its format changed, but it can be easily converted to a standard CSV format, which can then be imported to IPED's hash database as a generic CSV. It should have header columns for the contained hashes and also the following headers when converting:
- set = LED
- status = pedo (for "traditional" files, not "drawings", see "ledCategory" below)
- ledGroup = [case number or other reference of the first appearance of this file]
- ledCategory = 1 ("traditional" LED files, including child abuse, child explicit files, sequences, and child "sensual" images), 3 (Child pornography related "drawings")
Properties are multivalued fields. For example, items with hashes present in LED and ProjectVIC hash sets, will have hashDb:set = [LED, ProjectVIC].
To enable photoDNA database lookup, first you need to put into the plugin folder (old optional_jars) all photoDNA jars. If you are Law Enforcement and investigate child exploitation cases, ask for them at [email protected].
Then, to enable photoDNA calculation, set enablePhotoDNA in IPEDConfig.txt. Finally, to enable photoDNA lookup on IPED hash database, set enablePhotoDNALookup in the same configuration file.
The tool is able to expand recursivelly dozens of container formats, including all compressed formats supported by Apache Tika. Other container formats are supported. The 'expandContainers' option in 'IPEDConfig.txt' file controls this behaviour. Below is a summary (not up to date) list of supported formats:
Compressed: ZIP, RAR, TAR, 7Z, AR, ARJ, JAR, GZIP, BZIP, BZIP2, XZ, CPIO, UNIX-DUMP, Z, ZSTANDARD, LZ4, SNAPPY, BROTLI, LZMA, PACK200
Mail formats: DBX, PST, OST, MSG, MBOX, EML, EMLX
Documents: MS OFFICE, RTF, PDF, HTML
Other: ISO, SQLITE, MDB, EDB, OLE2
The conf/CategoriesToExpand.txt configuration file controls which categories will be expanded or not. You can comment out or uncomment categories to be expanded in that file. Note that depending on the profile used, different categories will or will not be expanded. For example, embedded files, like images, present into documents (e.g. office, rtf, pdf and html) are extracted by default just in pedo profile. Of course you can change the default of other profiles or create your own.
IPED has a regex scan module that searches for bult-in or customizable regex patterns in all text extracted after file parsing. Like the data carving module, an unique finite state machine is built with all regex patterns and scanning usually takes only 2% of total processing time. Currently there are regexes for emails, urls, ip addresses, credit card numbers, swift codes, money values, bitcoin, bitcoin cash, ripple, ethereum and monero wallets, and bitcoin private keys.
It is possible to write regex validators, to verify checksums, using javascript to discard false positives found by regexes. For example, there are validators for bitcoin, bitcoin cash, ripple and ethereum wallets resulting in very few false positives.
Regex patterns are configured in conf/RegexConfig.txt file and you can look at conf/regex_validators folder for a regex script validator example.
Currently IPED is able to recover more than 40 file formats:
bmp, emf, gif, png, jpg, cdr, webp, html, pdf, eml, ole2 and derived (doc, xls, ppt, msg, msi, thumbs.db), rar, zip and derived (docx, xlsx, pptx, odt, ods, odp, iwork, xps), wav, wma, cda, midi, avi, wmv, mp4, 3gp, mov, flv, mpg, shareh.dat, sharel.dat, known.met, vcard, kml, gpx, index.dat and sqlite
IPED has its own data carving engine written from scratch. It builds one unique finite state machine with all configured signatures. That algorithm is used by spell checkers to search in huge dictionaries. So it does not matter the number of signatures configured to search for, it will take the same time to search for 10 or 1000 file signatures. The time complexity is proportional to the amount of data scanned and number of signatures found, and not searched for.
So it is very fast and because of that all bytes are scanned, not only in cluster boundaries, so the tool is able to recover files contained in other files in the past, for example in odd offsets, images embedded into deleted executables... Usually the data carving module takes less than 10% of total processing time. Of course processing of recovered files itself will take longer.
It is possible to configure new carvers in conf/CarverConfig.xml file in 'carverTypes' session using headers, footers and header offset position to get file length. It is also possible to write advanced carvers using javascript, with custom rules for file size calculation and validation.
At the top of 'CarverConfig.xml' file, there are two elements 'typesToProcess' and 'typesToNotProcess', they specify what files will be scanned (submitted to carving module), do not confuse with the file types that will be recovered (those are defined in 'carverTypes' session). The default configuration enables 'typesToNotProcess': all files will be scanned, searching for deleted items inside them, except those listed in 'typesToNotProcess'. It is useless scanning compressed files and other files for which IPED already has a custom parser to extract embedded files from. If 'typesToProcess' is uncommented, just the files listed in 'typesToProcess' will be scanned, e.g. unallocated, pagefile, hiberfil, so on. So this is a much more restrictive configuration.
You can also find the 'ignoreCorrupted' option in 'CarverConfig.xml'. When enabled, carved files that cause parsing errors are discarded and are not included in case. This option is disabled by default in 'forensic' and 'pedo' profiles and is enabled in all other profiles, so 'forensic' and 'pedo' recovers more files by default.
If "LED Child Porn database" (explained below) is configured, its partial hashes are used by IPED for the hash based data carving, thanks to its author Wladimir Leite, enabled with enableKFFCarving option in IPEDConfig.txt. Basically, IPED scans unallocated space, computing hashes of blocks of 512 bytes at sector boundaries. When a block hash matches some MD5_first_512 into the database, the hash of the next 64KB block is computed. If it matches a MD5_first_64K into the database, the hash of the 1MB block is computed if the file size stored in the database is greater than 1MB. If it also matches or if file size is less than 1MB, IPED uses the file size info stored in the database to carve the file.
This technique allows recovering of file formats not recovered by the default data carving module, if the file is present in the database. Furthermore, it allows recovering of partially overwritten files, that had their footer overwritten for example, what is not done by the standard carving module by default.
The tool has encryption detection for some specific file types:
- MS Office (doc, docx, ppt, pptx, xls, xlsx)
- LibreOffice (odt, ods, odp)
- Compressed formats like zip, rar, 7zip
- PST
The detected files can be listed using the "Encrypted Files" filter in the analysis interface.
For other formats there is a generic entropy test module. The results can be checked using "Possibly Encrypted (entropy)" filter. This entropy filter can give false positives e.g. for compressed data with high compression ratio.
The tool uses the Apache Lucene library to index file content and metadata. Please check the search syntax at lucene site
Although diacritics are ignored when indexing and searches with or without them return the same results, we recommend using them in your searches because some file viewers will highlight the hits better if you include the original diacritics (words without diacritics will be highlighted too).
By default, just numbers and letters are indexed, so only them could be used in a search query. Other chars are considered word separators and are "converted to space" before indexing. So if you want to search for an expression with non alphanumeric chars, you need to surround your query with double quotes.
You can add additional chars to be indexed in the extraCharsToIndex
option in conf/IndexTaskConfig.txt file. Indexing extra chars is useful to export a more powerful case dictionary to be used in password attacks, the default dictionary includes just indexed chars (numbers and letters). Disabling lowercase conversion can also improve your words dictionary. But be very careful, searches will become case sensitive and if you index, for example, the @ char, searches for john will not return [email protected] you will have to search for john*
Before indexing files, file content is decoded to extract the text to be indexed by dozens of file parsers. The text of unallocated, file slacks, unknown, corrupted or files without a specific parser is extracted using an internal "strings" implementation able to extract Latin1 scripts encoded in windows-1252, UTF-8 or UTF-16, even if those encodings are mixed into the same file. Other scripts are not supported by the generic strings extractor currently.
IPED uses the Tesseract 5 OCR engine to recognize text in images and scanned PDF files. By default, the portuguese dictionary is used, which has a superset of chars of english. If you want to add other language dictionary, download it from https://tesseract-ocr.github.io/tessdoc/Data-Files.html and put in tools/tesseract/tessdata
folder. After that, you can change the language in OCRLanguage
option in conf/OCRConfig.txt file.
The OCR module is very slow. It usually takes about 4s per thread to process 1 image/page depending on hardware. The processing time of a case with OCR enabled can increase from some hours to days.
The results are very dependent of image quality, resolution, font size and style. If you are not getting the expected results, try to use the ~ wildcard modifier in your search terms, like john~2
, it will search for similar words with at most 2 chars mismatching the original term.
There is also a deprecated -ocr category
command line option. It is used to restrict the OCR to some file category when processing the case or bookmark when creating the report, instead of running on all supported formats.
The OCR of PDFs by default just process PDFs with less than 100 chars per page, the goal is to OCR primarily scanned documents. If you want to OCR every image in PDFs, turn processImagesInPDFs
option on in conf/ParsingTaskConfig.txt
file.
Java supports just bmp, jpg, gif, png and tif. Then IPED uses the imagemagick tool to generate thumbnails and render hundreds of other image formats. To speed up the processing for the common JPG format, IPED by default extracts the thumbnail stored inside the EXIF section if it exists. Rarely the extracted thumbnail can be different from the original image. You can disable that optimization by turning extractThumb
option off in conf/ImageThumbsConfig.txt
file. Carved images of the common formats supported by java, if they are partially overridden, are partially rendered by the tool until the point where the image can be decoded.
Frames of videos are extracted using the mplayer tool. Because of license incompatibility, mplayer is distributed side by side within IPED zip, instead of embedded. Keep mplayer side by side with IPED folder, so it will be found and used to process videos. If you move IPED folder and leave mplayer behind, frames of videos will not be extracted.
It is a common scenario in law enforcement to send the seized evidences to the forensic lab or unit just for data recovery and extraction, and the lab returns the extracted data to investigators to analyze the case. A lot of cases do not require the specific computer or cellphone forensic skills from the digital forensic examiners to analyze the data, so investigators can do it. For those cases, usually operating system artifacts, program libraries and manuals, games, and known files are not relevant.
So IPED provides an automatic file extraction feature that filters out system irrelevant files and extracts only user activity related data. When enabled, the files are extracted to the case folder and the original evidence (dd/e01 images) are not needed anymore to open and analyze the case, so it can be sent to investigators without the original forensic images.
There are two ways to configure the file extraction feature. Based on file categories or based on keywords, detailed below. If both are configured, the union of the results are extracted.
You can configure the conf/CategoriesToExport.txt config file to extract files based on categories. Just uncomment the categories you want to extract and just them will be indexed and extracted to case folder. All other file types will be ignored.
The blind profile uses this configuration with some predefined defaults.
You can configure the conf/KeywordsToExport.txt config file to extract files based on keywords found. Just write down your relevant case keywords, one per line, and every file which contains any keyword will be extracted. You can also use powerful regexes instead of keyword rules.
Currently IPED has a language detection module that is able to recognize about 70 different languages. The two languages with higher probability are saved in language:detected_1 and language_detected_2 properties and their probabilities are saved in language:detected_score_1 and language:detected_score_2 properties respectively, so you can sort or filter by them. There is also a language:all_detected property which stores both the two most likely languages of a document.
This module uses natural language processing techniques to recognize entities, like person and organization names, places, dates and time, money... In conf/NamedEntityRecognitonConfig.txt you can choose between three different implementations: NLTK, Apache OpenNLP and StanfordCoreNLP. The last has better accuracy and is the current default.
Before enabling StanfordCoreNLP NER recognition, you must download the model for your language from StanfordCoreNLP download site and put it in the plugin folder (optional_jars). Currently IPED uses 3.8.0 version.
You should also change the langModel_0 = default option in conf/NamedEntityRecognitonConfig.txt file if your language is different from english to point to your downloaded model package. You can also download and configure other models to be applied to other languages detected by the language detection module.
LED is a brazilian federal police tool developed by @tc-wleite to scan computers at crime scenes searching for child sexual abuse related material. Its nudity detection algorithm was adapted to IPED by @tc-wleite.
This is the default nudity detection algorithm of IPED tool. It was implemented and trained by @tc-wleite and can be enabled by turning enableLedDie on in IPEDConfig.txt. If enabled, for each image and video processed, the tool will create properties (columns) named nudityScore and nudityClass. The first is a nudity score from 1 to 1000, higher scores means higher probability of nudity. The last is a nudity class ranging from 1 to 5, derived from the score, it is better if you want to apply some secondary sorting in table tab.
This algorithm was trained using hundreds of thousands of ilegal images from child sexual abuse cases, and hundreds of thousands of commons images, so the two classes could be learned. The first set of images contains not just explicity images with sex or genitals exposed, but also sensual images of children or images from sequences (where the child begin with clothes and goes removing them). So this algorithm can return high scores to images without explicit sex or genitals exposed of children.
This is an alternative nudity detection algorithm of the tool. It uses a tensorflow implementation of Yahoo Open NSFW model. Its weights were converted from caffe to keras H5 file. When enabled, a property named nsfw_nudity_score is created for images (and videos for IPED versions > 3.18.x) with a probability score ranging from 0 to 1. Please see the requirements to enable this module in Python Modules.
For versions <= 3.18.x you need to manually enable this module in conf/TaskInstaller.xml config file. For newer versions there is an explicit enableYahooNSFWDetection option in IPEDConfig.txt.
Note that this algorithm was trained using a different dataset from the LED algorithm, using adult sex images for example, but without "images from sequences" as explained in the previous section. So it returns much less sensual images without explicit sex or genitals exposed. Depending on the use case, one algorithm can be more suitable than the other.
Note this algorithm is about 10x times slower than LED algorithm on the CPU, using a good GPU is highly recommended.
This module is automatically enabled if file contents are indexed, all profiles except fastmode and triage. You can right click on a document in the analysis interface and click on Find Similar Documents. A similarity dialog will ask for the similarity threshold from 1 to 100, the default value is 70%. That means IPED will look up for documents with at least 70% common words with the selected document, taking into account just the most 50 representative words of the selected document (a TF/IDF score is used to select the relevant words). If you get many false positives, you can increase the similarity threshold, if you get a few results, decrease it.
If you don't need this feature, you can turn off the storeTermVectors option in conf/IndexTaskConfig.txt file. This option just exists in triage profile (disabled), simply include this option and set it to false to disable it. When disabled, index size can decrease about 25%.
This module, if enableImageSimilarity option is enabled in IPEDConfig.txt file, will compute image feature vectors and store them into the index, so you will be able to look up for similar images (using euclidean distance). This module is experimental and uses a non state of the art algorithm, but it is very very fast in turn. If enabled, click on the Options button in the analysis interface, and choose Find Similar Images. You can search for images similar to the highlighted image in Table tab, or you can provide an external reference image to search for into the case.
This is an experimental module and is not present in versions <= 3.18.x. To enable it, set enableFaceRecognition option to true in IPEDConfig.txt file. This module needs python and other required packages to be installed before, refer to Python Modules. It uses face detection techniques and computes features for faces using deeplearning algorithms, so you will be able to search for similar faces in the case.
If enabled, click on the Options button in the analysis interface, and choose Find Similar Faces. You can search for faces similar to the first face detected in the highlighted image in Table tab (the nearest face to the top, if the image has many faces), or you can provide an external reference image to search for its face into the case.
Feature to transcribe audio files to text, so it can be indexed, searched for and scanned by the regex module. Currently there are three implementations: a local one using VOSK library, and remote ones using MS Azure or Google Cloud - note that your audios will be sent to the servers of those companies, and they will return the transcribed audios.
You can choose the service provider in conf/AudioTranscriptConfig.txt
, the current default is VOSK. You must set your language into that file, so the right language model, if supported by the service provider, will be used to transcribe your audios. We currently distribute small portable models for english and portuguese, the first one has a reasonable accuracy, the last one is not so good. If you want to try larger models or models for other languages, download them from VOSK site and replace/put them into models/vosk
folder. It is possible to change the configuration to also transcribe audios from video files, changing convertCommand and mimesToProcess options.
Regarding mimesToProcess, if you need to transcribe audio from mp3 files (it is not transcribed by default), you will need to add the specific mime type 'audio/mpeg' in this parameter. The parameter should look like below:
mimesToProcess = audio/3gpp; audio/3gpp2; audio/vnd.3gpp.iufp; audio/x-aac; audio/x-aiff; audio/amr; audio/amr-wb; audio/amr-wb+; audio/mp4; audio/ogg; audio/vorbis; audio/x-oggflac; audio/x-oggpcm; audio/opus; audio/speex; audio/qcelp; audio/vnd.wave; audio/x-caf; audio/x-ms-wma; audio/x-opus+ogg; audio/ilbc; audio/mpeg
Azure and Google need different IPED parameters to specify the credentials to use their services, take a look at conf/AudioTranscriptConfig.txt
. The credential creation and setup at Azure or Google Platform is out of the scope of this manual, please check their official documentation. At 2020, both Azure and Google Cloud used to give you a first month $200.00 credit to test the service, at that time it is enough to transcribe about 200 hours of audios. But be careful and read their service terms, we don't know what they do with your audios if you don't pay or hire their services.
If the audios were sent by instant message apps (WhatsApp, Skype, Telegram), the transcription is inserted into the chat and the estimated confidence is printed. The confidence score of the transcription can also be used to sort or filter audios in Table tab using the audio:transcriptConfidence property and the transcribed text is stored in the audio:transcription property, so you can search for keywords in audios with high transcription confidence or listen to audios with low confidence.
Wav2Vec2 is another local transcription implementation option to Vosk, since some Vosk models don't have good accuracy (e.g. the Portuguese one). It is a state of the art algorithm developed by Facebook that can be fine tuned to different languages. It is a heavy algorithm so a good GPU is highly recommended, you need a 4GB to 8GB GPU depending on the used model. Some accuracy test results on some pt-BR data sets of some models are below:
- bold font means the model has better accuracy on the data set than the other models
- yellow background means the model was trained using the data set in the column header
Currently we have a local and a remote implementation of Wav2Vec2 algorithm. You have to choose the implementation in conf/AudioTranscriptConfig.txt
file.
For the local one, you must set the implementationClass
param to iped.engine.task.transcript.Wav2Vec2TranscriptTask
and also the huggingFaceModel
parameter. You must also install huggingsound python lib. In the first run, the selected model will be downloaded.
For the remote one, if you already have access to a transcription service, you just need to set the implementationClass
param to iped.engine.task.transcript.RemoteWav2Vec2TranscriptTask
and also set the wav2vec2Service
parameter in conf/AudioTranscriptConfig.txt
file in your IPED instance with the ip:port
address of the central/naming node.
If you have to configure the transcription GPU/CPU cluster, follow the steps below. It it composed of 2 types of nodes:
- a central/naming node responsible to discover the transcription nodes and to point the clients to them: it is a lightweight node that don't need a powerful machine to run, but it should have some hardware redundancy to stay up and running. To start it: download IPED, switch to IPED folder then run:
jre\bin\java -cp lib/* iped.engine.task.transcript.RemoteWav2Vec2Discovery port
It will listen on port
for clients and transcription nodes connections.
- transcription nodes that do the hard job. They should use a good GPU since this is a heavy algorithm and the GPU must have 4GB to 8GB depending on the model. To start each of them: download IPED, install huggingsound python lib, configure
huggingFaceModel
model parameter, switch to IPED folder then run:
jre\bin\java -cp lib\* iped.engine.task.transcript.RemoteWav2Vec2Service ip:port [localport]
where:
ip:port
is the ip and port of the central/naming node
localPort
is an optional fixed port to listen for requests from clients
After starting the name node and at least one transcription node, you can run the IPED client instance.
If you want to analyze many cases at the same time, open a terminal inside some of your cases folder and start the analysis app with command below:
IPEDSearch-App.exe -multicases param
where param is the parent folder containing all case folders you want to analyze, or a txt file containing the fullpaths of all cases, one path per line. Passing a txt file is a bit faster, because IPED doesn't need to scan subfolders looking for cases, and you can open cases located in different mount points.
It is possible to configure an external command line tool to parse a specific artifact not supported natively. IPED will automatically run the tool and import its output as a text or html report that will be indexed and searched for regex patterns. You should configure the external tool in conf/ExternalParser.xml configuration file. Below is a commented example extracted from the default configuration:
<parser>
<!--Name of the new external parser-->
<name>PrefetchParser</name>
<!--Windows relative path of the tool. Not needed if on PATH-->
<win-tool-path>tools/sccainfo/</win-tool-path>
<check>
<!--Check command. If it does not work, the parser will be disabled-->
<command>sccainfo -V</command>
<error-codes>1</error-codes>
</check>
<!--Command to be executed. ${INPUT} will be replaced by a temp file path of the artifact-->
<!--IPED will collect the result from stdin. If the tool output its result to a file, use ${OUTPUT} parameter-->
<command>sccainfo ${INPUT}</command>
<!--All artifacts with this mime-type will be parsed by the external tool-->
<mime-types>
<mime-type>application/x-prefetch</mime-type>
</mime-types>
<!--Charset used in the output by the external tool-->
<output-charset>ISO-8859-1</output-charset>
<!--Number of starting lines of output to be discarded-->
<firstLinesToIgnore>0</firstLinesToIgnore>
<!--If the tool output is a html report, so it will be parsed correctly-->
<outputIsHtml>false</outputIsHtml>
</parser>
If you want the tool output to be rendered as HTML report by IPED, even if it is a TXT output, configure the artifact mimetype in supportedMimes parameter in conf/MakePreviewConfig.txt configuration file.
IPED-4.x already comes with a portable python distribution for Windows + JEP library (java-python bridge) compiled and included. So you can already run or write basic python tasks or parsers. If you want to run advanced python modules listed below, you need to install their dependencies. First, you need to install pip into iped portable python:
cd IPED_ROOT/python
python get-pip.py
set PATH=%PATH%;IPED_ROOT_ABSOLUTE_PATH/python/Scripts
cd Scripts
PS: Steps above will build pip using absolute paths (unfortunately), if you move iped root folder, pip should be uninstalled and installed again.
Then install numpy:
pip install numpy
To use audio transcription on the GPU, you must have a CUDA-compatible NVIDIA graphics card. To check if your card is compatible, look for the product's official datasheet.
It is recommended to update CUDA drivers to the latest version. Download the CUDA toolkit from the NVIDIA website:
https://developer.nvidia.com/cuda-downloads
In the installer, use the custom installation option.
It is not necessary to install the other components, just "Driver components".
After installation, restart your computer.
Check the "NVidia Control Panel" for the updated version of the driver. Click on "System Information", then "Components". View the version of NVCUDA64.dll ( For example "NVIDIA CUDA 12.4.89 driver").
Another way to see the driver version is via command line.
Open the dos prompt in the windows root directory and run the command ( If more than one dll appears, check both):
c:\> dir /s /b nvcuda64.dll
Copy the dll path and run the following command in powershell ( just a example, replace your path ):
get-item C:\Windows\System32\nvcuda64.dll | fl VersionInfo
Now check if your graphics card has a compatible version to run pytorch.
The minimum version of CUDA is 11.3. If, after updating the video card driver, it does not have a version higher than 11.3, it cannot be used for audio transcription. In this case, the audio transcription must be done only by the CPU.
For fast-whisper, the minimum CUDA version is 11.3.
For WhisperX, the minimum CUDA version is 11.7 as it has a prerequisite for version 2.0 of Pytorch.
Source: https://github.com/pytorch/pytorch/blob/main/RELEASE.md
The next step is to install Pytorch. Go to the link below to get the command line that will be used for installation: https://pytorch.org/get-started/locally/
Choose the 'Compute Platform' based on the graphics card's CUDA driver version from the table presented in the installation link.
Copy the installation command, for example:
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
Open a command prompt in the iped installation directory. Enter the "python" folder and execute the copied command with the modification of the 'pip3' part by '.\Scripts\pip.exe'. For example:
c:\iped-4.1.2\python\>.\Scripts\pip.exe install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
PS: For Cuda 11.6 version use:
c:\iped-4.1.2\python\>.\Scripts\pip.exe install torch==1.12.1+cu116 torchvision==0.13.1+cu116 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu116
To verify that the CUDA installation worked and is operational, run the IPED python as below:
c:\iped-4.1.2\python\>.\python.exe
Enter the following commands, you can copy and paste directly.
import torch;
print('CUDA Available: ',torch.cuda.is_available());
print('CUDA FloatTensor: ',torch.cuda.FloatTensor());
print('CUDNN VERSION:', torch.backends.cudnn.version());
print('Number CUDA Devices:', torch.cuda.device_count());
print('CUDA Device Name:',torch.cuda.get_device_name(0));
print('CUDA Device Total Memory [GB]:',torch.cuda.get_device_properties(0).total_memory/1e9);
If the information appears error-free, it is very likely that CUDA is available for use in audio transcription.
Example:
>>> import torch;
>>> print('CUDA Available: ',torch.cuda.is_available());
CUDA Available: True
>>> print('CUDA FloatTensor: ',torch.cuda.FloatTensor());
CUDA FloatTensor: tensor([], device='cuda:0')
>>> print('CUDNN VERSION:', torch.backends.cudnn.version());
CUDNN VERSION: 8801
>>> print('Number CUDA Devices:', torch.cuda.device_count());
Number CUDA Devices: 1
>>> print('CUDA Device Name:',torch.cuda.get_device_name(0));
CUDA Device Name: Quadro P620
>>> print('CUDA Device Total Memory [GB]:',torch.cuda.get_device_properties(0).total_memory/1e9);
CUDA Device Total Memory [GB]: 2.147221504
If you are on Linux, first install python from your default repositories if you don't have it yet. Python 3.9.x was used in tests and is recommended, other versions are not supported and may result in unexpected errors (although they might work).
Then install numpy library:
pip install numpy
Install a Java JDK 11 and set JAVA_HOME environment variable:
export JAVA_HOME=/etc/jdk11.0.3
Then install JEP-4.0.3 (Java Embedded Python), since the java library of that version is already linked to and used by iped:
pip install jep==4.0.3
After you have compiled JEP successfully, you need to include (lib)jep.so parent folder to LD_LIBRARY_PATH environment variable on Linux. Additional steps may be needed: https://github.com/ninia/jep/wiki/Linux
If you are using a recent version of Ubuntu (greater than 23.04), Python environment is externally managed (https://askubuntu.com/q/1465218). So you have to create an environment and use it to install the dependencies:
# install python venv
sudo apt install python3-venv
# prepare virtual environment
python3 -m venv ~/.venv/IPED
source ~/.venv/IPED/bin/activate
# check where pip command is located (should be /home/xxxx/.venv/IPED/bin/pip)
which pip
# install dependencies (examples)
pip install jep==4.0.3
pip install numpy
Prior to invoke IPED process, you have to setup some environment vars:
source $HOME/.venv/IPED/bin/activate
export LD_LIBRARY_PATH="$VIRTUAL_ENV/lib/python3.12/site-packages/jep:$LD_LIBRARY_PATH"
Different modules will need different dependencies to be installed, check below:
pip install pillow
pip install tensorflow
Tested on Windows with python 3.9.12, pillow 8.1.0, keras 2.4.3, tensorflow 2.4.1.
We recommend using tensorflow on GPU, see the requirements at tensorflow site. For example, running on a cheap P620 GPU made this module 5x times faster than running on a 48 threads CPU.
pip install face_recognition
pip install opencv-python
Tested on Windows 10 with python 3.9.12, cmake 3.18.4.post1, face_recognition 1.3.0. Please note this module is, for example, slower on P620 GPU than on a 48 threads CPU. If you install the libs with GPU drivers installed, currently the module will run on GPU.
pip install huggingsound
We strongly recommend using a good GPU to run this heavy module. Depending on the model used, your GPU should have GBs of memory, maybe 8GB.
pip install gputil
pip install faster-whisper
We strongly recommend using a high end GPU to run this heavy module. Depending on the model used, your GPU should have GBs of memory, maybe 8GB.
ALTERNATIVELY, if you have a good GPU with plenty of VRAM memory, run commands below instead of above, it can speed up transcribing long audios a lot:
pip install gputil
pip install git+https://github.com/sepinf-inc/whisperx.git@multi-audio
WhisperX also needs FFmpeg to be installed and put on PATH.
pip install bs4
To generate a report with bookmarked data, you have to click on the options button (or right click anywhere on the items table) and click on the last option Create Indexed Report.
You have to select the bookmarks you want to include in the report. For each selected bookmark, it is possible to check the ThumbsOnly checkbox, so only image or video thumbnails will be exported to the report, the raw file content will not. It is useful when you have huge amounts of images or videos to report.
By default, the tool automatically exports email and chat attachments. You can disable that behaviour, so only bookmarked attachments will be exported.
You must configure the output folder where the report will be generated. It must not exist or must be an empty folder.
Optionally, you can provide a keyword list you used during the analysis, so it will be included in the report.
You must fill in case information, like case number, examiner name, requester name, evidence description... You can manually fill in that information or you can load it from a json file. The json file format can be seen in the example below:
{
reportNumber:"001/2019",
reportDate:"12/12/2019",
reportTitle:"Computer Forensic Examination",
examiners:["Luis Filipe da Cruz Nassif"],
caseNumber:"Case number 001",
requestForm:"Letter 001/2019",
requestDate:"01/12/2019",
requester:"Prosecutor John",
labCaseNumber:"001/2019",
labCaseDate:"01/12/2019",
evidences:[{id:"1",desc:"Computer 001"},{id:"2",desc:"External hard disk 001"},{id:"3",desc:"Thumbdrive 002"}]
}
Finally, just click on Create button. It will generate a HTML report with selected bookmarks and a portable case with the items included in those bookmarks, so you can send the portable case to the report requester.
He/She will be able to open the case and view analysed data the same way the examiner or investigator did. It will be possible to do keyword searches on the portable case and perform additional analysis too.