A Hybrid Design

Intrusion Detection and
Database Systems

A Hybrid Design

The most basic approaches to designing a schema for intrusion detection using BSM suffer form two flaws:

BSM logs contain enormous amounts of data. For any database approach to be efficient it must do away with redundancy. The redundancy exists at the token-field level (not at the token or the event level). In this light both simple approaches (based on table per token and table per event) fail to take care of redundancy.
EXAMPLE: consider the case related to user Ids, the total number of uids over a span of a day would be of the order of 50-100 on a single day. But the Audit events and the corresponding tokens for a BSM logs may be several thousands. Now these 50-100 uids are replicated in each of these events. If we adapt a table per token or a table per event approach these users Ids will be replicated many times over.
Therefore the idea behind the hybrid schema is to ensure that we remove redundancy from the data by considering one field in a table and thus eliminating repeated entries for a field.
BSM logs generate very detailed information about an event. That is they are exhaustive in their approach. This adds to their bulk. However when we are detecting intrusions, we do not need all this information about an event to classify an intrusion. Again both approaches (related to table per event and table per token) fail to consider this aspect. EXAMPLE to find the pid of a process, the token approach would take a join of an header token with a subject token. Now subject token has 9 fields (corresponding table has 9 columns) and header token has 6 fields (corresponding table has 6 columns). Now we are taking the join of two bulky tables, followed by a projection. Even if the optimizer pushes projections before join, it would still be spending resources to project from huge tables. We could have got away more efficiently just by somehow looking at just 2 fields the event id of the header token and the pid of the subject token. Therefore the idea is to have some kind of a pseudo view that just looks at the pertinent data fields for an intrusion, while blacking out rest of the data.

The key to solving the above two problems lies in enhancing the granularity of the information available, so that we can look at the info we need and forget the rest. The hybrid schema is based on this philosophy. The schema looks at things from a token-field perspective. We maintain a table for each token-field, so that redundancy is removed and an event table from which each entry is an index to the token-field table (so that joins are now table lookups). This event table itself is indexed according to event types.

Now to find an event like the execution of a malicious script by using the hybrid schema, we look at all execvc events (for which there is an index), and for each of them lookup the path name from the path name table. This way you do not encounter any data that is not absolutely required for the intrusion detection.