The most basic approaches to designing a schema for intrusion detection using
BSM suffer form two flaws:
-
BSM logs contain enormous amounts of data. For any database approach to be
efficient it must do away with redundancy. The redundancy exists at the
token-field level (not at the token or the event level). In this light both
simple approaches (based on table per token and table per event) fail to
take care of redundancy.
EXAMPLE: consider the case related to user Ids, the total number of uids over
a span of a day would be of the order of 50-100 on a single day. But the
Audit events and the corresponding tokens for a BSM logs may be several
thousands. Now these 50-100 uids are replicated in each of these events.
If we adapt a table per token or a table per event approach these users Ids
will be replicated many times over.
Therefore the idea behind the hybrid schema is to ensure that we remove
redundancy from the data by considering one field in a table and thus
eliminating repeated entries for a field.
- BSM logs generate very detailed information about an event. That is they
are exhaustive in their approach. This adds to their bulk. However when we
are detecting intrusions, we do not need all this information about an event
to classify an intrusion.
Again both approaches (related to table per event and table per token) fail to
consider this aspect.
EXAMPLE to find the pid of a process, the token approach would take a join
of an header token with a subject token. Now subject token has 9 fields
(corresponding table has 9 columns) and header token has 6 fields
(corresponding table has 6 columns). Now we are taking the join of two bulky
tables, followed by a projection. Even if the optimizer pushes projections
before join, it would still be spending resources to project from huge tables.
We could have got away more efficiently just by somehow looking at just 2
fields the event id of the header token and the pid of the subject token.
Therefore the idea is to have some kind of a pseudo view that just looks at
the pertinent data fields for an intrusion, while blacking out rest of the
data.
The key to solving the above two problems lies in enhancing the granularity
of the information available, so that we can look at the info we need and
forget the rest. The hybrid schema is based on this philosophy. The schema
looks at things from a token-field perspective. We maintain a table for each
token-field, so that redundancy is removed and an event table from which
each entry is an index to the token-field table (so that joins are now table
lookups). This event table itself is indexed according to event types.
Now to find an event like the execution of a malicious script by using the
hybrid schema, we look at all execvc events (for which there is an index),
and for each of them lookup the path name from the path name table. This way
you do not encounter any data that is not absolutely required for the
intrusion detection.
|