-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CENNSO-2033] rework logging #219
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good
{noreply, State}; | ||
|
||
handle_info(Info, State) -> | ||
?LOG(warning, "handle_info: ~p", [Info]), | ||
{noreply, State}. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It used to be recommended in a gen_server to have a cluse handle_info of unknown content. Perhaps it is no longer a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd keep it in notice, like "this is not normal but it's not causing an error". It would avoid a crash. But I don't see this proposal as much better anyway 🤷🏻
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Production code should IMHO not get call/cast/info message that are not handled. Sure, it is always possible to send something though a remote shell, but I would that not call a normal situation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@javiermtorres @ebengt I would be ok with a policy to always add catch all clauses for those with logs at level alert
(or even emergency
). That would make them noisy enough to catch them, but still be somewhat safe.
Very high log levels are IMHO justified because missing a info/cast/call could mean that something important was missed and it is impossible so determine if it was important without understanding the message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'd be fine with a high log level. The alternative is a crash (if we omit the catchall clause) that is equally visible.
Since infos catch any sent message, it's easier to stumble upon issues of processes sending messages to the wrong pid than calls/casts. I'd agree to not using catchall clauses for calls/casts.
src/ergw_aaa_static.erl
Outdated
@@ -107,7 +107,7 @@ handle_response({API, _}, #{'Result-Code' := Code}, Session, Events, State) -> | |||
handle_response(Procedure, #{'Result-Code' := Code}, Session, Events, State) -> | |||
{{fail, Code}, Session, [{stop, {Procedure, peer_reject}} | Events], State}; | |||
handle_response(_Procedure, Response, Session, Events, State) -> | |||
?LOG(error, "Response: ~p", [Response]), | |||
?LOG(debug, "unexpected response: ~p", [Response]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would not a log level of warning be better here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd agree to that 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i do not: https://www.erlang.org/doc/apps/kernel/logger_chapter.html#log-level
the audience of the logs is a user of software incorporating ergw_aaa. some external party floods your instance with garbage, the garbage is properly NOT handled (thus dropped). what is the error? what is the call to action to solve that error?
1c9f89c
to
4e06e93
Compare
src/ergw_aaa_static.erl
Outdated
@@ -107,7 +107,7 @@ handle_response({API, _}, #{'Result-Code' := Code}, Session, Events, State) -> | |||
handle_response(Procedure, #{'Result-Code' := Code}, Session, Events, State) -> | |||
{{fail, Code}, Session, [{stop, {Procedure, peer_reject}} | Events], State}; | |||
handle_response(_Procedure, Response, Session, Events, State) -> | |||
?LOG(error, "Response: ~p", [Response]), | |||
?LOG(debug, "unexpected response: ~p", [Response]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd agree to that 🤔
85348b6
to
583b795
Compare
The setting is used by the rate limit checks. However, those tests are designed to check for handling the rate limit only. The outstanding requests limits is not checked and hitting it interferes with the rate limit check, leading to spurious failure. Raise the value to a limit that it will no longer affect the rate limit tests.
The session process can not get an `EXIT` event from the owner since it is never linked to it. This is a left over from very ancient version of the session logic. Removed. Include the session information in the log for a `DOWN` event.
6052df3
to
d11c69f
Compare
* change logs to lebel debug that where intendet for debugging only, but where not using the debug level * remove obsolete debug logs * clarify some non-debug logs
A reference count underflow indicates a logic bug in the reference counting logic. Log it with proper criticallity, but otherwise try to continue.
d11c69f
to
f605150
Compare
@@ -264,19 +264,11 @@ handle_cast(stop, State) -> | |||
|
|||
handle_info({'DOWN', MRef, _Type, Pid, Info}, #state{peers = Peers0} = State) | |||
when is_map_key(MRef, Peers0) -> | |||
?LOG(alert, "got down for pending request ~0p (~p) with ~p", | |||
[maps:get(MRef, Peers0), Pid, Info]), | |||
?LOG(alert, "Diameter request process terminated unexpected with ~0tp (req: ~0tp, pid: ~0p)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://www.erlang.org/doc/apps/kernel/logger_chapter.html#log-level
which "action must be taken immediatly"? is something dead already? is everything about to be shut down?
@@ -285,7 +277,8 @@ handle_info({'DOWN', MRef, _Type, Pid, _Info}, State) -> | |||
{noreply, State}; | |||
|
|||
handle_info(Info, State) -> | |||
?LOG(warning, "handle_info: ~p", [Info]), | |||
?LOG(alert, "unexpected info message, something important might have been missed: ~p", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://www.erlang.org/doc/apps/kernel/logger_chapter.html#log-level
what shall the user of the software incorporating ergw_aaa do with this message?
the component of the software using this library is not visible here, thus the user of the software is unable to deduce in which area a problem is, especially when the description of the loglevel "alert" is "action must be taken immediately".
_ -> | ||
{{error, underflow}, Peers#{OH => Peer}} | ||
?LOG(emergency, "reference counting underflow for Diameter peer ~0tp", [OH]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://www.erlang.org/doc/apps/kernel/logger_chapter.html#log-level
the whole system became just unusable because the some diameter-peers vanished?
_Request, SvcName, {_, #diameter_caps{origin_host = {OH, _}}} = Peer, _CallOpts) | ||
when length(Errors) /= 0 -> | ||
%% the exact content of Errors is a bit unclear, dumping them is best we can do | ||
?LOG(error, "~0tp: decode of Diameter answer ~0tp from ~0tp failed, errors ~0tp", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://www.erlang.org/doc/apps/kernel/logger_chapter.html#log-level
the operator of the software using ergw_aaa is bombarded with faulty encoded diameter messages. ergw_aaa deals with this by … just not accepting those messages. what is the "error"? from the perspective of the operator everything is running just fine.
from a developer perspective i can understand "error", but from a user perspective which needs to see those logs within normal operations i do not see it. in this area, an increase of an error-metric plus a log message on debug-level (which the operator can activate on demand to find out the peer involved, the message, the error itself).
when length(Errors) /= 0 -> | ||
?LOG(error, "~p: decode of answer ~p from ~p failed, errors ~p", [SvcName, Msg, Peer, Errors]), | ||
%% the exact content of Errors is a bit unclear, dumping them is best we can do | ||
?LOG(error, "~0tp: decode of Diameter answer ~0tp from ~0tp failed, errors ~0tp", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see my remarks before about error-log for diameter decoding errors.
when length(Errors) /= 0 -> | ||
?LOG(error, "~p: decode of answer from ~p failed, errors ~p", [SvcName, Peer, Errors]), | ||
%% the exact content of Errors is a bit unclear, dumping them is best we can do | ||
?LOG(error, "~0tp: decode of Diameter answer ~0tp from ~0tp failed, errors ~0tp", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see my previous comments on that
when length(Errors) /= 0 -> | ||
?LOG(error, "~p: decode of answer from ~p failed, errors ~p", [SvcName, Peer, Errors]), | ||
%% the exact content of Errors is a bit unclear, dumping them is best we can do | ||
?LOG(error, "~0tp: decode of Diameter answer ~0tp from ~0tp failed, errors ~0tp", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see my previous comment on that
?LOG(error, "~p:handle_request(~p, ~p, ~p)", | ||
[?MODULE, _Packet, _SvcName, _Caps]), | ||
handle_request(Packet, SvcName, {_, #diameter_caps{origin_host = {OH, _}}}) -> | ||
?LOG(error, "~0tp: unsupported Diameter request from peer ~p, raw request ~0tp", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see my previous comment on that
handle_event(Type, Event, _State, _Data) -> | ||
?LOG(warning, "unhandled event ~p:~p", [Type, Event]), | ||
handle_event(info, Info, State, _) -> | ||
?LOG(alert, "unexpected info message in state ~p, something important might have been missed: ~p", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see my previous comment on that
@@ -107,7 +107,8 @@ handle_response({API, _}, #{'Result-Code' := Code}, Session, Events, State) -> | |||
handle_response(Procedure, #{'Result-Code' := Code}, Session, Events, State) -> | |||
{{fail, Code}, Session, [{stop, {Procedure, peer_reject}} | Events], State}; | |||
handle_response(_Procedure, Response, Session, Events, State) -> | |||
?LOG(error, "Response: ~p", [Response]), | |||
?LOG(alert, "unexpected Diameter response, something important might have been missed: ~p", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see my previous comment on that
No description provided.