Character Set Validator for C/C++ Code

The cvc tool can be used to check whether C/C++ source files only contain characters from the basic character set as defined in the C standard.

Motivation

The source code should consist of the basic source character set primarily for reasons of portability, compatibility and security.

Compilers are designed to interpret and process code written using the basic source character set. Deviating from this set may lead to unexpected behavior or errors during compilation. By sticking to the standard, you minimize the risk of encountering compatibility issues with compilers.

Various development tools, such as code editors, IDEs, and static analysis tools are built to understand and assist with code written according to the C standard. Conforming to the basic source character set ensures better support from these tools, enhancing the development process.

Unicode defines various control characters that are invisible or have non-printable representations, such as Left-to-Right Override or Right-to-Left Isolate, zero-width joiners. These characters might not be visible when viewing the source code in a regular text editor, making their presence non-obvious.

Depending on the context and the specific characters used, these invisible characters can alter the meaning of the code. For example, they might introduce syntax errors, change variable names, or create conditions that are not apparent to the programmer but are interpreted by the compiler or interpreter in unintended ways. Malicious actors can potentially exploit this by inserting trojan code into a program's source code. This code might perform actions that the programmer did not intend, such as granting unauthorized access, leaking sensitive information, or altering the behavior of the program in unexpected ways.

Further reading: https://trojansource.codes/

Example

This example shows a shortened result for cargs.h. The file contains 35 invalid characters in total (validation with default settings). With the verbose output enabled, the file name and a list of lines with the invalid characters are printed to stdout.

$> cvc -f lib/cargs/cargs.h --verbose
file lib/cargs/cargs.h:
line 97: 0x40 (@)
line 103: 0x40 (@)
...
line 161: 0x40 (@)
line 166: 0x40 (@)
line 169: 0x60 (`) 0x60 (`)
line 172: 0x40 (@)
line 173: 0x40 (@)
line 178: 0x40 (@)
line 182: 0x60 (`) 0x60 (`)
line 184: 0x40 (@)
...
line 209: 0x40 (@)
line 210: 0x40 (@)
35

Encoding

cvc assumes that all characters are encoded as single bytes. However, the C23 standard (upcoming ISO/IEC 9899:2023) permits multibyte characters to represent members of the extended character set.

The purpose of cvc is to limit source code to the basic source character set, which does not include any locale-specific definition.

The coding of characters in the table below is valid for the following character encodings (incomplete list):

ASCII / US-ASCII
UTF-8
Windows-1252
ISO/IEC 8859-1
ISO/IEC 8859-15
ISO 646 (USA/ASCII)

Basic source character set

hex	dec	char	remarks
9	9	HT	horizontal tab, see --noht option
0A	10	LF	line feed, EOL indicator, see -e/--eol option
0B	11	VT	vertical tab, see --vt option
0C	12	FF	form feed, see --ff option
0D	13	CR	carriage return, EOL indicator, see -e/--eol option
..	..	..	n/a
20	32	SP	space
21	33	!	-
22	34	"	-
23	35	#	-
24	36	$	C23, see -a/--all option
25	37	%	-
26	38	&	-
27	39	'	-
28	40	(	-
29	41	)	-
2A	42	*	-
2B	43	+	-
2C	44	,	-
2D	45	-	-
2E	46	.	-
2F	47	/	-
30..39	48..57	0-9	-
3A	58	:	-
3B	59	;	-
3C	60	<	-
3D	61	=	-
3E	62	>	-
3F	63	?	-
40	64	@	C23, see -a/--all option
41..5A	65..90	A-Z	-
5B	91	[	-
5C	92	\	-
5D	93	]	-
5E	94	^	-
5F	95	_	-
60	96	`	C23, see -a/--all option
61..7A	97..122	a-z	-
7B	123	{	-
7C	124	\|	-
7D	125	}	-
7E	126	~	-

Validation

cvc checks the EOL first. By default, EOL is determined automatically based on the first occurrence of an EOL indicator. However, an expected EOL indicator can be specified with the -e/--eol option. The EOL indicator must be used consistently throughout the file. Otherwise, the validation stops at the first erroneous EOL indicator.

Usage

There are three ways to pass input data to the program:

specify a source file
pipe the data to cvc
type input it manually

$> cvc -f main.c --noht --verbose
file: main.c
0
$> cat main.c | cvc --noht --verbose
0
$> cvc -e CRLF --noht --verbose
hello
world(<CTRL> + <D>)
Unexpected end-of-line indicator in line 1!

Cooperation with other tools

cvc is designed with UNIX philosophy in mind and therefore intentionally to be used in cooperation with other command-line tools. The following examples show the basic idea:

Find all .c/.cpp/.*h files starting from current directory recursively and forward them to cvc for validation:

find . -regex '.*/.*\.\(c\|cpp\|h\)$' -exec cvc -f {} --verbose \;

Validate all .c/.h files starting from current directory recursively and add up the results:

find . -regex '.*/.*\.\(c\|h\)$' -exec cvc -f {} \; | paste -sd+ - | bc

Show exit code:

cvc -f main.c
echo $?

Defaults

If cvc is used with default settings, the following applies:

horizontal tabs are allowed
form feed and vertical tabs are not allowed
EOL indicator will be detected automatically
characters $, @, ` are not allowed

Open Source Library

This program utilizes libcargs version 1.1.0, which is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Character Set Validator for C/C++ Code

Motivation

Example

Encoding

Basic source character set

Validation

Usage

Cooperation with other tools

Defaults

Open Source Library

About

Releases

Packages

Languages

License

piscilus/cvc

Folders and files

Latest commit

History

Repository files navigation

Character Set Validator for C/C++ Code

Motivation

Example

Encoding

Basic source character set

Validation

Usage

Cooperation with other tools

Defaults

Open Source Library

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages