Introduction

AppId is shown in one of the first diagrams from the Snort++ manual (although it's a little misleading).
Since AppId is so prominently displayed in this prominent diagram, I thought I'd discuss it in some detail.
name:

website:

comment:


Overview

AppId attempts to determine the client and server and, in certain circumstances, also attempts to identify the payload
We'll talk about payload later when we discuss AppId's HTTP handler.
for a given Flow
Recall that each TCP and UDP packet belongs to a Flow. In the case of a TCP session, all of the packets in the TCP session belong to the same Flow. In the case of a UDP session, all of the packets coming from and going to the same IP addresses/ports belong to the same Flow.
. For example, if a PcAnywhere client connects to one of the PcAnywhere servers, AppId will (hopefully) determine that the client is a PcAnywhere client using the PcAnywhere service. As another example, if you are using a Mozilla web browser, AppId will figure out that the client is a Mozilla browser.
AppId is used in rules. Let's look at a rule in which the appid option is used:
alert tcp any any -> any any (msg:"Someone is using PcAnywhere!"; appid: pcanywhere; sid:1000000; rev:1)
This rule will cause an alert to be generated if AppId determines that PcAnywhere clients and servers are communicating. Let's look at another rule:
alert tcp any any -> any any (msg:"Someone is using a Mozilla browser!"; appid: mozilla; sid:1000001; rev:1)
This rule will cause an alert to be generated if AppId determines that a packet was sent by the Firefox browser. AppId tries to determine the client and/or the service so that the Rule Detection Engine can match a rule with a given packet.
name:

website:

comment:


Details

AppId does whatever it can to determine a Flow's client and/or service. If AppId determines the client and/or service, it will store the client or service's unique id (its "appid") in the Flow's application_ids[]
I use the term "AppId" to describe the component of Snort++
and I use appid (lowercase) to describe the unique integer assigned to each application. We'll see a little later where these unique integers are found.
(We'll discuss the PAYLOAD and MISC elements when we discuss the AppId HTTP handler.)
:
The Rule Detection Engine will then compare all of the elements of application_ids to each rule that contains the appid option. This means that if AppId determines that either the client and/or the service is using the PcAnywhere protocol and a rule contains the appid: pcanywhere option, then the Flow matches with the option (although the packet must also, of course, match will all the other options of the rule for there to be a match between the packet and the rule).
So there are two questions that still need to be answered. First, where is the appid for each application defined and, secondly, how does AppId determine the appids for each Flow?
The first question is easy. The appids correspond to the first column of the appMapping.data file
The other columns, if non-zero, are unique as well and correspond to the service, client, and payload (we'll talk about payload later). A zero value indicates that the application is not used as a service, client, or payload. For example, the Internet Explorer browser is (obviously) only a client so its service and payload are 0. The elements of application_ids[] are only set to the value of the first column, the application's appid.
.
If any of the appids for a given rule (there can be more than one appid in a rule) matches the AppId of the service, client, or payload, then the option matches. For example, if AppId determines that the client is a PcAnywhere client, application_ids[APP_PROTO_CLIENT] will be set to 781. If AppId determines that the service is a PcAnywhere service, application_ids[APP_PROTO_SERVICE] will also be set to 781. If AppId determines that the client is either a PcAnywhere client or that the server is a PcAnywhere server, then the following rule:
alert tcp any any -> any any (msg:"Someone is using PcAnywhere!"; appid: pcanywhere; sid:1000000; rev:1)
will match and an alert will be sent.
The second (and much more difficult) question is - how does AppId figure out the service, client, and (if the AppId HTTP handler comes into play) the payload? Before we answer that question, let's talk about how AppId is invoked. One of the things that I don't like about the figure from the manual (shown at the top) is that it only shows AppId as a subscriber (we'll talk about subscriber-publisher relationships soon). However, it can also be an Inspector (more specifically, a network Inspector)
AppId also tries to determine the client and service for UDP packet Flows. I discuss TCP Flows because it is the more interesting case.
.
Recall that the network Inspectors (including AppId)
Another network inspector is binder. By calling Wizard, binder serves a similar purpose to AppId. In a sense, AppId and binder compete with one is a substitute binder and AppId compete with each other.
are invoked for all of the packets in a Flow until a service Inspector is identified by the Wizard. (Note that an appropriate service Inspector won't necessarily be found for a given Flow. For example, there is no service Inspector that corresponds to PcAnywhere. )
Once an appropriate service Inspector for the Flow is found (if one indeed exists), this service Inspector can request that AppId look at the various components (header, body, etc.) of the Flow. This is how a subscriber/publisher relationship works. AppId/HttpInspect is, in fact, the most important example in Snort++ of a subscriber/publisher relationship. During Snort++'s initialization, AppId expresses an interest (i.e., "subscribes to") in HTTP header events and HttpInspect then publishes HTTP header events after it has identified an HTTP header in an HTTP session
Note that only a handful of service Inspectors currently publish events - the most important being HttpInspect.
.
So, to recap, AppId can be invoked as a network Inspector and can also be invoked in response to events published by the service Inspectors (if AppId has registered itself as a subscriber to these events).
Returning to the second question - how does AppId identify services, clients, and payloads? It has several different approaches, some of which are used by AppId, the network Inspector, and some of which are used by AppId, the subscriber. Some (but not all) of these approaches rely on Detector objects. We'll talk about Detectors in a little bit.
name:

website:

comment:


AppId, the network Inspector

Let's talk first about AppId, the network Inspector. AppId uses various arrays, search engines, and hashes (which we'll loosely refer to as "databases") to make its decisions. (We'll talk a little later about how items are added to these databases - this process is complicated by the fact that these additions are often made using Lua code). Some of these databases are used if the packet is coming from the server or client, some are only used if the packet is coming from the server, and some are used if the packet is coming from the client
It's important to understand that AppId, the network Inspector, does not determine the payload - only the AppId HTTP handler can do that.
.
Like the typical Inspector, AppId is invoked by its eval() method when used as a network Inspector
The most important function called by eval() is AppIdDiscovery::do_application_discovery().
. eval() looks in these different databases. Here are the different databases that are used if a packet is coming from a server
In a moment, I will discuss Detectors and the C++ and Lua functions that are associated with each database.
:
As you can see above, a TCP packet from the server (i.e., a packet with a lower source TCP port number than the destination TCP port number) is compared against 6 databases. If a match is found in a database, the client_app_id, serviceAppId, payload_app_id, portServiceAppId, client_service_app_id, and/or misc_id fields of the packet's Flow's AppIdFlowData are set.
What is the AppIdFlowData class? AppIdFlowData is a FlowData-derived class. A Flow can have multiple objects of FlowData-derived classes but only one of each specific FlowData-derived class. These FlowData objects are "used by various inspectors to store specific data on the Flow for later use". For example, a Flow could have an associated single (i.e., not multiple) ReputationFlowData object, a single AppIdFlowData object, and a single HttpFlowData object. As mentioned above, client_app_id, serviceAppId, payload_app_id, portServiceAppId, client_service_app_id, and/or misc_id are the "specific data" used by AppId.
I will explain later how these AppIdFlowData fields are used to set the packet's Flow's application_ids[] mentioned above. Let's see now which databases affect which field in AppIdFlowData.
1) The IP address, (TCP or UDP) port, and protocol (TCP or UDP) is looked up in the host_port_cache hash. If an entry is found matching these three parameters, the client_app_id, serviceAppId, and/or payload_app_id fields are set. (I will discuss later how this and other databases are populated.)
2) The port number is looked up in the tcp_port_only[] array. If an element in the array is found, the portServiceAppId is set.
3)The lengthCache is a little unusual and is used in only one circumstance (some weird app named "KeyholeTV"). If the protocol (TCP or UDP), direction (client -> server or server -> client), and packet payload length match up, the portServiceAppId is set. Step 3 is only invoked if the portServiceAppId was not found in step 2.
The first three databases (1-3) in the diagram above do not require validation. In other words, if an entry is found in one of the databases in the first three steps, the packet is not inspected to determine if there really is a match. For example, if an entry of "91.108.56.16", TCP port 443 exists in the host_port_cache hash (and it should if using the default AppId configuration) and AppId sees a packet from "91.108.56.16", TCP port 443, then service_app_id is unconditionally set to 4116 (which corresponds to Telegram) without further validation. In other words, if the packet comes from "91.108.56.16", tcp port 443, the packet must not be scrutinized further to verify that the service is really Telegram.
The next three databases (4-6) for packets from the server to the client do require validation. At this point, we need to discuss Detectors
Detectors are also used in steps 1-3. However, they are used in a somewhat kludgey manner to load multiple entries into various databases. In the typical scenario, a Detector is written for a single service or a single client. In the case of the Detectors for databases 1-3, the Detectors are written for multiple services or clients. For example, the payload_group_hootieandtheblowfish Detector was written for an unrelated collection of clients, servers, and payloads (hence the silly and meaningless name of the file).
.
Detectors are typically written for a single client or a single service (e.g., the PcAnywhere client and service Detectors). Detectors can be written in C++ or Lua. The core service Detectors (e.g., ssh, sip, http, smtp) are written in C++ while the more obscure services (PcAnywhere, BitTorrent) are written in Lua
The advantage of writing a Detector in C++ is speed. The advantage to writing a Detector in Lua is convenience (you don't need to recompile Snort++ if you change a Detector written in Lua). Most Detectors are written in Lua.
. There are two important tasks of a Detector. The first important task of a Detector is to insert entries into the different databases.
The figure above shows the different C++ and Lua functions required to insert entries into the different databases. We'll talk more about this a little later.
The second important task is to validate that the Detector is indeed the appropriate Detector for a given Flow (as explained above, this is unnecessary in databases 1-3). This is typically done after a match with an entry in one of the databases that corresponds to the client or service for which the Detector is responsible. For example, the SSH service Detector during initialization inserts an entry of "SSH-" in the tcp_services search engine. Obviously, just finding "SSH-" in a packet isn't enough to be 100% certain that the packet is from an SSH service. So the SSH service Detector will be asked to validate that the packet is indeed from an SSH service.
Now that we understand a little about Detectors, let's look at databases 4 through 6. Here's the diagram again.
Aside from validation, databases 4-6 also differ from databases 1-3 in that databases 4-6 are found in the ServiceDiscovery object. In concrete terms, this means that the next three databases are specific to packets coming from the server; the first three databases are common both to packets coming from the client and packets coming from the server.
4) The packet's source port is looked up in the tcp_services array. The corresponding element, if non-NULL, will point to the Detector object that corresponds to the port. For example, the SshServiceDetector object corresponds to port 22. If a packet is from port 22, this object's validate() method is called to look at the packet and to verify that the packet is indeed originating from an SSH server. (Just because a packet is from port 22 doesn't mean it's an SSH session.) Upon validation, the Detector object's validate() method typically sets the serviceAppId field. (A service Detector could, however, set any of the fields.)
5) The packet is compared to patterns in the tcp_patterns search engine that indicate a specific service. For example, the string "SSH-" indicates that a packet originates from an SSH service. Similar to step 4, if a match is found, the service Detector's validate() method is called to verify and (if appropriate) set the serviceAppId field. (Database 5 is not needed if a Detector is found and successfully validated in step 4.)
6) The last step is desperation. If no Detectors were found in Databases 1-5, the "brute force" approach is then taken. This involves invoking the validate() method for all of the service Detectors. These service Detectors are found in the tcp_detectors hash.
If the packet is coming from the client, the approach is similar but abbreviated:
The first three databases are the same as for packets coming from the server. For a client-originated packet, only the tcp_patterns database is searched against. If a match is found and the validation of a client Detector is successful, validate() sets the client_app_id field. The main task of a client Detector is to identify the client application used. However, the client Detector also occasionally attempts to determine the service as well. If the client Detector determines (perhaps incorrectly) that a certain service is being used, it will set the client_service_app_id field. Note that the client_service_app_id field will only be used if serviceAppId field is not set. (The idea behind this precedence is that a service Detector is typically better able to determine the appropriate service for a Flow than a client Detector. We'll talk a little more about that in a moment.)
As mentioned above, AppId's goal is to set the four application_ids[] array elements of Flow so that the detection engine can try to find matches between the elements of this array and the appid options found in the rules. After the different databases are compared with a given packet (from the client or server), the client_app_id, serviceAppId, payload_app_id, portServiceAppId, client_service_app_id, and/or misc_id fields of the packet's Flow's AppIdFlowData fields may be set. So how do these fields map to application_ids[]?
Choosing the AppId for the service is the only tricky one. If serviceAppId was set after looking in the host_port_cache hash or in the tcp_services[] array, tcp_patterns search engine, or the tcp_detectors hash (brute force), then serviceAppId is chosen, regardless of the values of client_service_app_id or portServiceAppId. client_service_app_id is considered a weak candidate because the client Detector determined client_service_app_id, not the service Detector, which is better suited to determining the service. All that is required for portServiceAppId to be set is for a given port to be used. For example, if one of the ports of a Flow is TCP port 22, then portServiceAppId is set to 846, SSH's appid. Since this is so unreliable (it's trivial to switch the service-side port number of SSH as well as most other service), portServiceAppId is considered a weak candidate
The use of third party modules is still under development. For this reason, I do not discuss the tp_app_id or tp_payload_app_id fields ("tp" stands for "third party"). When this code is finalized, these fields will play a role and I will document them.
.
name:

website:

comment:


Adding Entries To The Databases

I've introduced several databases but I haven't yet explained in detail how the databases are populated. These databases are populated by the initialization routines of the Detectors. As I've already mentioned, Detectors can be written in the C++ language or the Lua language. Here are the functions that populate the different databases
As you can see from the above diagram, the host_port_cache and lengthCache are populated only by Lua Detectors. There would be no advantage to writing a C++ Detector that would add entries to these two databases since these two databases do not require validation (validation functions scrutinize packets in real time so speed is critical).
.
When one of the Lua functions in the diagram above is called, the Lua function invokes a C++ function. This interaction requires the Lua/C++ API. The Lua/C++ API defines the interface between the Lua code and the C++ code. Calls to the Lua API are made during initialization and validation. During the initialization of a Lua Detector, C++ code calls the Lua Detector's Lua initialization functions and this Lua initialization function, in turn, calls C++ code to insert entries into the various databases that were described above.
This interaction is complex and is discussed in another document.
name:

website:

comment:


AppId Http Handler

We've already discussed the AppId network Inspector. Now it's time to talk about the AppId HTTP handler. As I mentioned earlier, the AppId network Inspector is called when the Binder/Wizard combination is unable to find an appropriate service Inspector for a Flow. If HttpInspect is chosen by the Wizard, it will ask AppId to look at the HTTP request headers to determine the client and payload (the service will be "HTTP")
Don't worry. I'll talk about what "payload" means shortly.
.
As discussed elsewhere, HttpInspect is able to break an HTTP session into its components (header, body, etc.). From an HTTP request header,
AppId can determine the client and the payload. For example, if the User-Agent field contains "Firefox", the client is a Firefox browser. AppId is also able to determine the "payload". I've mentioned "payload" before but I haven't discussed it in detail because the AppId network Inspector is not able to determine the payload. The AppId HTTP handler, on the other hand, is able to determine the payload. It determines this information from the Host and (occasionally) from the path fields of the HTTP request header. For example, if the Host field is set to "facebook.com" and the path is set to "/notes", the application_ids[APP_PROTOID_PAYLOAD] element is set to 1360 (the appid for Facebook Notes)
Since the AppId HTTP handler at this time is only interested in HTTP request headers, the name "payload" is a little misleading. An HTTP header is certainly in a TCP session's payload but the term "payload" is pretty vague. The payload can be derived from the path and Host fields.
.
name:

website:

comment:


Subscriber/Publisher Relationship

Let's talk about how HttpInspect gets this HTTP header information to AppId. During initialization, modules "subscribe" to specific events. Later, as Snort++ is processing packets, any Inspector (and - in theory at least - the Rule Detection Engine as well) can "publish" events.
AppId subscribes to HTTP_REQUEST_HEADER_EVENT_KEY and HTTP_RESPONSE_HEADER_EVENT_KEY events during initialization and registers its HttpEventHandler as the callback class. What that means is that HttpEventHandler's handle() method is invoked whenever HttpInspect publishes an HTTP_RESPONSE_HEADER_EVENT_KEY event. handle() has two parameters - the event (an HttpEvent object) as well as the Flow corresponding to the pseudo packet that triggered the event.
Here's an HTTP header again:
All of these items (e.g., the host, the user-agent) are contained within the HttpEvent object
More specifically, they are contained in norm_heads field (which is a linked list containing all of the header fields) of the http_msg_header.
.
The strings from these fields (e.g., the host) are then copied to the Flow's AppIdSession object
Recall from above that AppIdSession is one of the FlowData-derived objects that are associated with the Flow.
. The AppId HTTP handler (HttpEventHandler::handle()) then compares the different HTTP header fields against various "databases" (defined loosely). The most interesting and important case is the CHP search engines
I have no idea what "CHP" stands for. Perhaps the "HP" is short for "HTTP" but I have no idea what the "C" stands for.
.
The CHP search engine group is preferred by AppId above all of its other search engines. In other words, if AppId finds a CHP match, AppId is content with the results and the packet is compared against no other search engines. There is a CHP search engine for the HTTP header fields as well as the body. The most important HTTP header fields are the host (HOST_PT) and uri (URI_PT) fields. These CHP search engines are populated by Lua Detectors
As mentioned above, this interaction is complex and is discussed in another document.
. First, the Lua Detector creates a CHPApp with CHPCreateApp() and then calls the Lua CHPAddAction() method to create CHPActions and populate the CHP search engines. CHPApps link the different CHPActions together. For example, there are 3 CHPActions created for the Facebook Detector during AppId's initialization.
The strings from each CHPAction are added to the appropriate search engine with pointers to their respective CHPActions. Later, when AppId is invoked by the HttpInspect inspector, in order for there to be a match with this Detector, all three strings must be found: "facebook.com" must be found in the HTTP header's host field; "/login.php" and "email=" must be found in the path.
Furthermore, the email address can be extracted from the path (it can be found immediately after the "email=" string)
How is the extracted email used by AppId? At least for now, it doesn't appear to be used for anything.
.
A CHPTallyAndActions struct keeps track of matches with strings in the CHP search engines.
Each time a string from one of the CHP search engines matches, its associated CHPAction is added to the appropriate element in the chp_matches[] array. For example, during initialization CHPAddAction() added "facebook.com" to the CHP HOST_PT search engine. If "facebook.com" exists in the Host field of an HTTP header of a pseudo packet that AppId is inspecting, then the CHPAction is added to the HOST_PTelement's linked list. In this way, AppId can determine if all of the required elements exist.
In the end, the service_app_id, client_app_id, or payload_app_id field of the AppIdSession is set in accordance with the chosen CHPApp. The second argument to CHPCreateApp() determines which of these three is set. It's a bitwise operation. For example, if the second argument is 6, both the service_app_id (2) and the payload_app_id (4) will be set with the first argument to CHPCreateApp() (629 in our "facebook.com" example above). As discussed previously, the service_app_id, client_app_id, or payload_app_id will ultimately be passed down to the appropriate Flow fields.
If no match is found with a CHPApp, the header is compared against client_agent_matcher (User-Agent field) and via_matcher (Via). The client_agent_matcher search engine is populated by the C++ static_client_agent_patterns[] array and the addHttpPattern() (e.g., Google Update) while the via_matcher search engine is populated by the static_via_http_detector_patterns[] array.
If no match is found in the CHP search engines, the last thing checked is the URL, which will include the host and the path. The layout of the search engines allow for permutations in the path. For example, "disqus.com" has 2 possible payload_id's - "/" (payload_id=798) and "/woopra" (payload_id=1001).
So if "disqus.com" is found in the search engine corresponding to the Host field, a second search engine will be searched for "/" and "/woopra".
name:

website:

comment: