A clean, safe and flexible implementation of BERT, a data-structure format inspired by Erlang ETF.
This project is in active development, and should not be used in production yet.
Primary features:
- High level implementation of ETF in pure Erlang
- Atoms protection and limitation
- Fine grained filtering based on type
- Callback function or MFA
- Fallback to
binary_to_term
function on demand - Drop terms on demande
- Term size limitation
- Custom options for term
- Property based testing
- BERT parser subset
- Depth type protection
- Fully documented
- +90% coverage
- 100% compatible with standard ETF
- 100% compatible with BERT
Secondary features:
- Global or fine grained statistics
- Profiling and benchmarking facilities
- Logging facilities
- Tracing facilities
- ETF path
- ETF schema
- Custom parser subset based on behaviors
- ETF as stream of data
- Usage example with ETF, BERT and/or custom parser
- Low level optimization (optimized module with merl)
Berty was created to easily replace binary_to_term/1
and
binary_to_term/2
built-in functions. In fact, the implementation is
transparent in many cases. The big idea is to protect your system from
outside, in particular atom and memory exhaution.
% create an atom from scratch
Atom = term_to_binary(test).
% An atom is automatically converted as binary
{ok, <<"test">>}
= berty:decode(Atom).
% different methods can be used to deal with atoms.
{ok, test}
= berty:decode(Atom, #{ atoms => {create, 0.2, warning} }).
% Other terms are supported
Terms = term_to_binary([{ok,1.0,"test",<<>>}]),
{ok, [{ok,1.0,"test",<<>>}]}
= berty:decode(Terms).
More features are present, for example, dropping terms or creating custom callbacks.
Lists = term_to_binary([1024,<<>>,"test"]).
% let drop all integers
{ok, [<<>>, "test"]}
= berty:decode(Lists, #{ integer_ext => drop
, small_integer_ext => drop
}).
% let create a custom callback
Callback = fun
(_Term, Rest) ->
{ok, doh, Rest}
end.
{ok, [doh, <<>>, "test"]}
= berty:decode(Lists, #{ integer_ext => {callback, Callback}
, small_integer_ext => {callback, Callback}
}).
% let create another one.
Callback2 = fun
(Term, Rest) when 1024 =:= Term ->
logger:warning("catch term ~p", [1024]),
{ok, Term, Rest};
(Term, Rest) -> {ok, Term, Rest}
end.
{ok, [1024, <<>>, "test"]}
= berty:decode(Lists, #{ integer_ext => {callback, Callback2}
, small_integer_ext => {callback, Callback2}
}).
Those are simple examples, more features are present and will be added. Here the most important functions:
berty:decode/1
: standard BERT decoder with default optionsberty:decode/2
: standard BERT decoder with custom optionsberty:decode/3
: custom decoder with custom optionsberty:encode/1
: standard BERT encoder with default optionsberty:encode/2
: standard BERT encoder with custom optionsberty:encode/3
: custom encoder with custom optionsberty:binary_to_term/1
: wrapper aroundbinary_to_term/1
berty:term_to_binary/1
: wrapper aroundterm_to_binary/1
rebar3 compile
rebar3 shell
rebar3 as test eunit
rebar3 as test shell
Mainly because of atoms management. In fact, binary_to_term/1
and
term_to_binary/1
are not safe, if unknown data are coming from
untrusted source, it's quite easy to simply kill the node by
overflowing the number of atoms managed by the node itself, and
probably also a full cluster if this data is shared.
% first erlang shell
file:write_file("atom1", term_to_binary([ list_to_atom("$test-" ++ integer_to_list(X)) || X <- lists:seq(1,1_000_000) ])).
% second erlang shell
file:write_file("atom2", term_to_binary([ list_to_atom("$test-" ++ integer_to_list(X)) || X <- lists:seq(1_000_000,2_000_000) ])).
Now restore those 2 files on another node.
% third erlang shell
f(D), {ok, D} = file:read_file("atom1"), binary_to_term(D).
f(D), {ok, D} = file:read_file("atom2"), binary_to_term(D).
no more index entries in atom_tab (max=1048576)
Crash dump is being written to: erl_crash.dump...done
Doh. Erlang VM crashed. We can fix that in many different way, here few examples:
-
avoid using
binary_to_term/1
andterm_to_binary/1
functions, instead create our own parser based on ETF specification. When terms are deserialized, atoms can be (1) converted in existing atom (2) converted in binary or list (3) simply dropped or replaced with something to alert the VM this part of the data is dangerous. -
keep our own local atom table containing all atom deserialized. A soft/hard limit can be set.
In fact, a simple solution already exists, using the option safe
or
used
when using
binary_to_term/2
. It
will protect you from creating non-existing atoms, but how many
projects are using that?
-
mojombo/bert.erl
: https://github.com/mojombo/bert.erl/blob/master/src/bert.erl#L25-spec decode(binary()) -> term(). decode(Bin) -> decode_term(binary_to_term(Bin)).
-
mojombo/ernie
: https://github.com/mojombo/ernie/blob/master/elib/ernie_server.erl#L178receive_term(Request, State) -> Sock = Request#request.sock, case gen_tcp:recv(Sock, 0) of {ok, BinaryTerm} -> logger:debug("Got binary term: ~p~n", [BinaryTerm]), Term = binary_to_term(BinaryTerm),
-
sync/n2o
: https://github.com/synrc/n2o/blob/master/src/services/n2o_bert.erl#L8encode(#ftp{}=FTP) -> term_to_binary(setelement(1,FTP,ftpack)); encode(Term) -> term_to_binary(Term). decode(Bin) -> binary_to_term(Bin).
-
ferd/bertconf
: https://github.com/ferd/bertconf/blob/master/src/bertconf_lib.erl#L10decode(Bin) -> try validate(binary_to_term(Bin)) of Terms -> {ok, Terms} catch throw:Reason -> {error, Reason} end.
-
a13x/aberth
: https://github.com/a13x/aberth/blob/master/src/bert.erl#L25-spec decode(binary()) -> term(). decode(Bin) -> decode_term(binary_to_term(Bin)).
-
yuce/bert.erl
: https://github.com/yuce/bert.erl/blob/master/src/bert.erl#L24-spec decode(binary()) -> term(). decode(Bin) -> decode_term(binary_to_term(Bin)).
-
And probably many more like this search on
searchcode.com
orgithub.com
suggest.
It's highly probable lot of those functions are hard to call, but it
could be the case. In situation where unknown data are coming,
erlang:binary_to_term/1
and even erlang:binary_to_term/2
should be
avoided or carefully used.
Few articles12 have been created in the past to explain these problems. On my side, if I was in charge of fixing this issue, I would probably do something in two times.
In the first step, I would probably create a workaround on atom creation function, with a soft/hard limit. When we reach the soft limit, warnings are displayed saying we reached the soft limit, but we can still create new atoms. When reaching the hard limit, atoms can't be created anymore, and exceptions are raised instead of crashing the host.
In a second step, I would probably create a flexible interface to deal with atoms and divide the problem in half:
-
create fixed atom store containing only atoms from source code (Erlang release and project), this one can't be increased.
-
create a second atom store containing dynamically created atoms during runtime, this one can be increased.
What I worry about is when dealing with mnesia. What could happen if someone create more than 2M unwanted atoms added in Mnesia or DETS? What kind of behavior the cluster will have? And how to fix that if it's critical.
Unfortunately, I think it will totally break atom performance, but it could be an interesting project to learn how Erlang BEAM works under the hood.
Well, it depends. If you are receving a (very) long string or list containing terms, it will have a direct impact on the memory, and it will eventually lead to memory exhaustion:
% size of the list should be checked
% if not, memory exhaustion can happen
[ $1 || _ <- lists:seq(0,160_000_000) ].
% eheap_alloc: Cannot allocate 3936326656 bytes of memory (of type "heap").
% Crash dump is being written to: erl_crash.dump...
Same behavior can be generated using binaries:
% big binaries can crash the BEAM
binary_to_term(<<131, 111, 4294967294:32/unsigned-integer, 0:8/integer, 255:8, 0:4294967280/unsigned-integer>>).
% binary_alloc: Cannot allocate 4294967293 bytes of memory (of type "binary").
% Crash dump is being written to: erl_crash.dump...
Generating ETF payload with very long binaries can also have an impact on CPUs, the following code can generate DoS and if many process
% big payload, high cpu usage, no crash.
% size of the big integer must be checked
% size: 2**18-1, binary byte size: 262_150 (~262kB)
_ = binary_to_term(<<131, 111, 262_143:32/unsigned-integer, 0:8/integer, 255:2_097_144/unsigned-integer>>).
% size: 2**19-1, binary byte size: 524_294 (~524kB)
_ = binary_to_term(<<131, 111, 524_287:32/unsigned-integer, 0:8/integer, 255:4_194_296/unsigned-integer>>).
% size: 2**20-1, binary byte size: 1_048_582 (~1MB)
_ = binary_to_term(<<131, 111, 1_048_575:32/unsigned-integer, 0:8/integer, 255:8_388_600/unsigned-integer>>).
Creating a long node name can crash the VM during startup, because the
name of the node is encoded using an atom_ext
term, encoded on 255
bits. If the name of the node is greater than 255, it crashes.
erl -sname $(pwgen -A0 252 1)
# Crash dump is being written to: erl_crash.dump...done
erl -name $(pwgen -A0 246 1)@localhost
# Crash dump is being written to: erl_crash.dump...done
It's highly probable other terms can have a deadly impact on a node or a cluster.
The problem is from atoms, at least one paper3 talked about that. Fixing the garbage collection issue could help a lot, but if it's not possible for many reason, using an high level implementation of ETF with some way to control what kind of data are coming might be an "okayish" solution.
The "Let it crash" philosophy is quite nice when developing high level
application interacting in a safe place but this philosophy can't be
applied in a place where uncontrolled data is coming. Some functions,
like binary_to_term/1
must be avoid at all cost.
This answer is a draft, a sandbox to design an Erlang ETF Schema feature.
It might be great to have syntax to create ETF schema, a bit like protobuf4, json schema5, XML6 (with XLST7) or ASN.18. In fact, when I started to find something around this feature, I also found UBF9 project from Joe Armstrong.
schema1() ->
integer().
schema2() ->
tuple([[atom(ok), integer()]
,[atom(error), string(1024)]).
% fun ({ok, X}) when is_integer(X) -> true;
% ({error, X) when is_list(X) andalso length(X) =< 1024 -> is_string(X);
% (_) -> false.
schema3() ->
tuple(
Here the final representation.
[{tuple, [{atom, [ok]}, {integer, []}]}
,{tuple, [{atom, [error]}, {string, [1024]}]}
]
% or
[[tuple, [2]]
,[atom, [ok,error]]
,[integer, []]
,[string, [1024]]
].
Another feature like xmlpath or jsonpath is also required as well, an easy syntax and comprehensible one needs to be created. I would like to include:
- pattern matching
% how to create an etf path?
% first example
% ETF = #{ key => #{ key2 => { ok, "test"} } }.
"test" = path(ETF, "#key#key2{ok,@}")
% second example
% ETF = [{ok, "test"}, {error, badarg}, {ok, "data"}].
[{ok, "test"},{ok, "data"}] = path(ETF, "[{ok,_}]")
% or
[]{ok,_}
% third example
% ETF = {ok, #{ <<"data">> => [<<"test">>] }}.
[<<"test">>] = path(ETF, "{ok,@}#!data").
When I wrote Serialization series — Do you speak Erlang ETF or BERT?
(part
1)
in 2017, someone told me to check another project called
jem.js
and read Replacing JSON
when talking to Erlang
(archive)
blog post. What's funny here... Is that:
handle_post(Req, State) ->
{ok, Body, Req1} = cowboy_req:body(Req),
Decoded = erlang:binary_to_term(Body),
Reply = do_whatever(Decoded),
{erlang:term_to_binary(Reply), Req1, State}.
Yes, "Faster and more efficient", but can destroy your whole platform in few second. Don't do that. Please. Unfortunately, inaka.net seems to be down, it would have been funny to play with that.
Probably, but I did not find a lot on that. Here a short summary of each terms is it safe or not and with the risk(s).
Terms | Code | Safe? | Risks |
---|---|---|---|
ATOM_CACHE_REF |
82 | no | atom exhaustion |
ATOM_EXT |
100 | no | atom exhaustion |
ATOM_UTF8_EXT |
118 | no | atom exhaustion |
BINARY_EXT |
109 | maybe | dynamic binary length (32bits) |
BIT_BINARY_EXT |
77 | maybe | dynamic bitstring length (32bits) |
EXPORT_EXT |
113 | no | atom exhaustion |
FLOAT_EXT |
99 | yes | 31 bytes float fixed length |
FUN_EXT |
117 | no | atoms exhaution |
INTEGER_EXT |
98 | yes | 1 byte fixed length |
LARGE_BIG_EXT |
111 | maybe | dynamic integer length (32bits) |
LARGE_TUPLE_EXT |
105 | maybe | dynamic tuple length (32bits) |
LIST_EXT |
108 | maybe | dynamic list length (32bits) |
LOCAL_EXT |
121 | yes | atom exhaustion |
MAP_EXT |
116 | maybe | dynamic pair length (32bits) |
NEWER_REFERENCE_EXT |
90 | no | memory exhaustion |
NEW_FLOAT_EXT |
70 | yes | 8 bytes fixed float |
NEW_FUN_EXT |
112 | no | atom exhaution |
NEW_PID_EXT |
88 | no | atom exhaution |
NEW_PORT_EXT |
89 | no | atom exhaution |
NEW_REFERENCE_EXT |
114 | maybe | dynamic reference length (16bits) |
NIL_EXT |
106 | yes | fixed length |
PID_EXT |
103 | no | atom exhaustion |
PORT_EXT |
102 | no | atom exhaustion |
REFERENCE_EXT |
101 | no | atom exhaustion |
SMALL_ATOM_EXT |
115 | no | atom exhaustion |
SMALL_ATOM_UTF8_EXT |
119 | no | atom exhaustion |
SMALL_BIG_EXT |
110 | maybe | dynamic integer length (8bits) |
SMALL_INTEGER_EXT |
97 | yes | fixed size |
SMALL_TUPLE_EXT |
104 | maybe | dynamic tuple length (8bits) |
STRING_EXT |
107 | maybe | dynamic string length (16bits) |
V4_PORT_EXT |
120 | no | atom exhaustion |
Footnotes
-
https://erlef.github.io/security-wg/secure_coding_and_deployment_hardening/atom_exhaustion.html ↩
-
Atom garbage collection by Thomas Lindgren, https://dl.acm.org/doi/10.1145/1088361.1088369 ↩