All posts by Teresa Copple

Run Faster Log Searches With InsightIDR

2022-03-11 Teresa Copple

Post Syndicated from Teresa Copple original https://blog.rapid7.com/2022/03/11/run-faster-log-searches-with-insightidr/

Run Faster Log Searches With InsightIDR

While it could be true that life is more about seeking than finding, log searches are all about getting results. You need to get the right data back as quickly as possible. In this blog, let’s explore how to make the best use of InsightIDR’s Log Search capabilities to get the correct data returned back to you as fast as possible.

First, you need an understanding of how Rapid7’s InsightIDR Log Search feature works. You may even want to review this doc to familiarize yourself with some of the newer search functionality that has been recently released and to understand some of the Log Search nuances discussed here.

The basics

Let’s begin by looking at how the Rapid7 InsightIDR search engine extracts data. The search engine processes both structured and unstructured log data to extract out valuable fields into key-value-pairs (KVPs) whenever it is possible for it to do so. These normalized fields of data or KVPs allow you to search the data more efficiently.

While the normalized fields of data are typically the same for similar types of logs, in InsightIDR they are not normalized across the product. That is, you’ll see the same extracted fields, or keys, pulled out for logs in the same Log Set, but the extracted fields and key names used in other Log Sets may be different.

As everyone who has spent any amount of time looking at log data knows, individual log entries can be all over the place. Some vendors have great logs that contain structured data with all the valuable information that you need, but not all products do this. Sometimes the logs consist, at least in part, of unstructured data. That is, the logs are not in KVP or cannot be easily broken into distinct fields of data.

The Rapid7 search engine automatically identifies the keys in structured data, as well as from unstructured data, and automatically identifies the KVPs to make the data searchable. This allows you to search for any value or text appearing in your log lines without creating a dedicated index. That is, you can search by specifying either just text like “ksmith” or “/.*smith.*/”, or you can search with the KVP specified – for example “destination_account=ksmith” – with equal ease in the search engine. However, is one of these searches better than the other? Let’s keep going to answer that question.

As InsightIDR is completely cloud-native, the architecture is designed to take advantage of many cloud-native search optimizations, including shared resources and auto-scaling. Therefore, in terms of search performance, a number of specialized algorithms are used that are designed to search for data across millions of log lines. These include optimizations to find needle-in-a-haystack entries, statistical algorithms for specific functions (e.g. for percentile), parallelization for aggregate operations, and regular expression optimizations. How quickly the results are returned can vary based on the number of logs, the number of (matching) loglines in that time range, the particular query, and the nature of the data – e.g. the number of keys and values in a logline.

Did you know that statistics on the last 100 searches that have been performed in your InsightIDR instance are available in the Settings section of InsightIDR? Go to Settings -> Log Search -> Log Search Statistics to view them. In addition to basic information, such as when the query completed and how long it took, you can also use the “index factor” that is provided to determine how efficient your query is. The index factor is a value from 0 to 100 that represents how much the indexing technology was used during the search. The higher the index value, the more efficient the search is. The Log Search Statistics page is especially helpful if you want to optimize a query you will be running against a large data set or using frequently.

How to improve your searches

As you can see, the Log Search query performance can be influenced by a number of factors. While we discuss some general considerations in this section, keep in mind that for queries that run frequently, you may want to test out different options to find what works best for your logs.

General recommendations

Here are some of the best ways to speed up your log searches:

Specify smaller time ranges. Longer time ranges add to the amount of time the query will take because the search query is analyzing a larger number of logs. This can easily be several hundred million records, such as with firewall logs.
Search across a single Log Set at a time. Searching across different Log Sets usually slows down the search, as different types of logs may have different key-value pairs. That is, these searches often cannot be optimally indexed.
Add functions only as needed. Searches with only a where() search specified are faster than searches with groupby() and calculate().
Use simple queries when possible. Simple queries return data faster. Complex queries with many parts to calculate are often slower.
Consider the amount of data being searched. Both the number of log entries that are being searched and the size of the log should be considered. As the logs are stored in key-value pairs, the more keys that the logs have, the slower they are to search.

The old adage about deciding if you want your result to be fast, cheap, or good applies here, too — except that with log searches, the triad that influences your results is fast, amount of data to be searched, and complexity of the query. With log searches, these tradeoffs are important. If you are searching a Log Set with large logs, such as process start events, then you may have to decide which optimization makes the most sense: Should you run your search against a smaller time range but still use a complex query with functions? Or would you rather search a longer time range but forgo the groupby() or calculate()? Or would you rather search a long time range using a complex query and then just wait for the search to complete?

If you need to search across Log Sets for a user, computer, IP address, etc., then maybe it makes more sense to build a dashboard with cards for the data points that you need instead of using Log Search. Use the filter option on the dashboard to specify the user, computer, etc. on which you need to filter. In fact, a great dashboard collection might just be your iced dirty chai latte, the combination that solves most of your log search challenges all at once. If you haven’t already done so, you may want to check out the new Dashboard Libraries, as more are being added every month, and they can make building out a useful dashboard almost instantaneous.

Specifying keys vs. free text

It is an interesting paradox of Log Search that specifying a key as part of the search does not always improve the search speed. That is, it is sometimes faster to use a key like “destination_account=ksmith,” but not always. When you specify a key-value-pair to search, then the log entries must be parsed for the search to complete, and this can be more time-consuming than just doing a text search.

In general, when the term appears infrequently, running a text-based search (e.g. /ksmith/) is usually faster.

Also, you may get better results searching for only a text value instead of searching a specific key. That is, this query:

where(FAILED_OTHER)

… might be more efficient and run faster than this query:

where(result=FAILED_OTHER)

Of course, this only applies if the value will not be part of any of the other fields in the log entry. If the value might be part of other fields, then you will need to specify the key in order for the results to be accurate.

Expanding on this further, the more specific you can be with the value, the faster the results will be returned. Be specific, and specify as much of the text as possible. A search that contains a very specific value with no key specified is often the fastest way to search, although you should test this with your particular logs to see what works best with them.

The corollary to this is that partial-match-type searches tend to be slower than if a full value is specified. That is, searching for /adm_ksmith/ will be faster than /adm_.*/. Finally, case-insensitive searches are only slightly slower than when the case is specified. “Loose” searches — those that are for partial and case-insensitive searches — are slower, largely because partial match searches are slower. However, these types of searches are usually not so slow that you should try to avoid them.

Contradictorily, it is also sometimes the case that specifying a key to search rather than free text can greatly improve indexing and therefore reduce the search times. This is particularly true if the term you are searching appears frequently in the logs.

Additional log search tips

Here are some other ways to improve your search.

Check to see if a field exists before grouping on it. Some fields (the key part of the key-value pair) do not exist in every log entry. If you run groupby on a field and it doesn’t always exist, the query will run faster if you first verify that the field is part of the logs that are being grouped on. Example:

where(direction)groupby(direction)

Which logical operators you use can make a big difference in your search results. AND is recommended, as it filters the data, resulting in fewer logs that need to be searched. In other words, AND improves the indexing factor of the search. OR should be avoided if possible, as it will match more data and slow down the search. In general, less data can be indexed when the search includes an OR logical operator. You do need to use common sense, because depending on your search criteria, it may be that you need to use OR.
Avoid using a no-equal whenever possible. In general, when you are searching for specific text, the indexer is able to skip over chunks of log data and work efficiently. In order to search for a “not equal to,” every entry must be checked. The “no equal” expressions are NOT, !=, and !==, and they should be avoided whenever possible. Again, use common sense, because your query may not work unless you use a “no-equal.”
The order that you specify text in the query is not important. That is, the queries are not evaluated left-to-right — rather, the entire query is first evaluated to determine how it can be best indexed.
Using regular expression is usually not slower than using a native LEQL search.

For example, a search like

where(/vpn asset.*/i)

… is a perfectly fine search.

However, using logical operators in the regular expression will make the search slower for exactly the same reason that they can make the regular search slower. In addition, using the logical operators — especially the (“|”), which is logical OR — can be more impactful in regular expression searches, as they disable the use of indexing the logs. For example, a query like this:

where(geoip_country_name=/China|India/)

… should be avoided if possible. Instead, use this query:

where(geoip_country_name=/China/ OR geoip_country_name=/India/)

You could also use the functions IN or IIN:

where(geoip_country_name IN [China,India])

To summarize how the indexing works, let’s look at a Log Search query that I have constructed:

where(direction=OUTBOUND AND connection_status!=DENY AND destination_port=/21|22|23|25|53|80|110|111|135|139|143|443|445|993|995|1723|3306|3389|5900|8080/)groupby(geoip_organization)

Should this query be optimized? The first thing that the log search evaluator will do is to determine if any of the search can be indexed.

In looking at the components of the search, it has three computations that are all being ANDed together: “direction=OUTBOUND,” “connection_status!=DENY,” and then the port evaluation. Remember, AND is good since it can reduce the amount of data that must be evaluated. “direction=OUTBOUND” can be indexed and will reduce the amount of data against which the other computations must be run. “connection_status!=DENY” cannot be indexed since it contains “not equal” — in other words, every log entry must be checked to determine if it contains this KVP. However, the connection_status computation is vital to how this query works, and it cannot be removed.

Is there a way to optimize this part of the query? The “connection_status” key has only two possible values, so it can easily be changed to an equal statement instead of a no-equal. Also, not all firewall logs have this field so we can add verifying that the field exists to the query. Finally, the destination_port search is not optimal, as it contains a long series of OR computations. This computation is also an important criteria for the search, and it cannot be removed. However, it could be improved by replacing the regular expression with the IN function.

where(direction=OUTBOUND AND connection_status AND connection_status=ACCEPT AND destination_port IN [21,22,23,25,53,80,110,111,135,139,143,443,445,993,995,1723,3306,3389,5900,8080])groupby(geoip_organization)

Will this change improve the search greatly? The best way to find out is to test the searches with your own log data. However, keep in mind that “direction=OUTBOUND” will be evaluated first, because it can be indexed. In addition, since in these particular logs (firewall logs), this first computation greatly reduces the amount of log entries left to be evaluated, other optimizations to the query will not greatly enhance the speed of the search. That is, in this particular case, both queries take about the same amount of time to complete.

However, the search might run faster without any keys specified. Could I remove them and speed up my search? Given the nature of the search, I do need to keep “connection_status” and “destination_port” as the values in these fields can occur in other parts of the logs. However, I could remove “direction” and run this search:

where(OUTBOUND AND connection_status!=DENY AND destination_port IN [21,22,23,25,53,80,110,111,135,139,143,443,445,993,995,1723,3306,3389,5900,8080])groupby(geoip_organization)

In fact, this query runs about 30% faster than those with “direction=” key specified.

Let’s look at a second example. I want to find all the failed authentications for all the workstations on my 10.0.2.0 subnet. I can run one of these three searches:

where(source_asset_address=/10\.0\.2\..*/ AND result!=SUCCESS)groupby(source_asset_address)

where(source_asset_address=/10\.0\.2\..*/ AND result=/FAILED.*/)groupby(source_asset_address)

where(source_asset_address=/10\.0\.2\..*/ AND result IN [FAILED_BAD_LOGIN,FAILED_BAD_PASSWORD,FAILED_ACCOUNT_LOCKED,FAILED_ACCOUNT_DISABLED,FAILED_OTHER])groupby(source_asset_address)

Which one is better? Since the first one uses a “not equal” as part of the computation, the percentage of the search data that can be indexed will be less than the other two searches. However, the second search has a partial match (/FAILED.*/) versus the full match of the first search. Partial searches are slower than specifying all the text to be matched. Finally, the third search avoids both the “no-equal” and a partial match by using the IN function to list all the possible matches that are valid.

As you might have guessed, the third search is the winner, completing slightly faster than the first search but more than twice as fast as the second one. If you are searching a large set of data over a long period of time, the third search is definitely the best one to use.

How data is returned to the Log Search UI

Finally, although it is not related to log search speed, you might be curious about how data gets returned into the Log Search UI. As the log search query runs, as long as there are no errors, it will continue to pull back data to be returned for the search. For searches that do not contain groupby() or calculate(), results will be returned to the UI as the search runs. However, if groupby() or calculate() are part of the query, these functions are evaluated against the entire search period. Therefore, partial results are not possible.

If the search results cannot be returned because of an error, such as a search that cannot be computed or a rate-limiting error with a groupby() or calculate() function, then instead of the data being returned, you will see an error in the Log Search UI.

Hopefully, this blog has given you a better sense of how the Log Search search engine works and provided you with some practical tips, so you can start running faster searches.

Additional reading:

NEVER MISS A BLOG

Get the latest stories, expertise, and news about security today.

Introducing the Manual Regex Editor in IDR’s Parsing Tool: Part 2

2021-07-08 Teresa Copple

Post Syndicated from Teresa Copple original https://blog.rapid7.com/2021/07/08/introducing-the-manual-regex-editor-in-idrs-parsing-tool-part-2/

Introducing the Manual Regex Editor in IDR’s Parsing Tool: Part 2

I have logs on my mind right now, because every spring, as trees that didn’t survive the winter are chopped down, my neighbor has truckloads of them delivered to his house. All the logs are eventually burned up in his sugar house and used to make maple syrup, and it reminds me that I have some logs I’d like to burn to the ground, as well!

If you’re reading this blog, chances are you probably have some of these ugly logs that are messy, unstructured, and consequently, difficult to parse. Rather than destroy your ugly logs, however tempting, you can instead use the Custom Parsing tool in InsightIDR to break them up into usable fields.

Specifically, I will discuss here how to use Regex Editor mode, which assumes a general understanding of regular expression. If you aren’t familiar with regular expression or would like a short refresher, you may want to start with “Introducing the Manual Regex Editor in IDR’s Parsing Tool: Part 1.”

Many logs follow a standard format and can be easily parsed with the Custom Parsing tool using the default guided mode, which is quite easy to use and requires no regular expression. You can read more about using guided mode here.

If the logs parse well with guided mode, you will likely use it to parse your logs. However, for those logs that lack good structure, common fields, or just do not parse correctly using guided mode, you may need to switch to Regex Editor mode to parse them.

Example 1: Key-Value Pair Logs Separated With Keys

Let’s start by looking at some logs in a classic format, which may not be obvious at first glance.

The first part of these logs has the fields separated by pipes (“|”). Logs that have a common field, like a pipe, colon, space, etc., separating the values are usually easy to parse and can be parsed out in guided mode.

However, the problem with these logs is that the last part contains a string that’s separated by literal strings of text rather than a set character type. For example, if you look at the “msg” field, it is an unstructured string that might contain many different characters — the end of the value is determined by the start of the next key name, “rt”.

May 26 08:25:37 SECRETSERVER CEF:0|Thycotic Software|Secret Server|10.9.000033|500|System Log|3|msg=Login Failure - ts2000.com\hfinn - AuthenticationFailed rt=May 26 2021 08:25:37 src=10.15.4.56 Raw Log/Thycotic Secret Server
May 26 08:25:37 SECRETSERVER CEF:0|Thycotic Software|Secret Server|10.9.000033|500|System Log|3|msg=Login attempt on node SECRETSERVER by user hfinn failed: Login failed. rt=May 26 2021 08:25:37 src=10.15.4.56

Let’s see how parsing these logs with the Custom Parsing Tool works. I have followed the instructions at https://docs.rapid7.com/insightidr/custom-parsing-tool/ and started parsing out my fields. Right away, you can see I’m having a problem parsing out the “msg” field in guided mode. It’s not working like I want it to!

Introducing the Manual Regex Editor in IDR’s Parsing Tool: Part 2

This is where it is a good idea to switch to Regex Editor mode rather than continuing in guided mode. As such, I have selected the Regex Editor, and you can see that displayed here:

Whoa! This regex might look daunting at first, but just remember from our previous lesson that you can use an online regex cheat sheet or expression builder to make sense of any parts you find confusing or unclear.

Here is what the Regex Editor has so far, based on the two fields I added from guided mode (date and message):

^(?P<date>(?:[^:]*:){2}[^ ]+)[^=]*=(?P<message>[^d]*d)

The regular expression for the field I decided to name “message” is the part that’s not working like I want. So, I need to edit this part: (?P<message>[^d]*d).

Remember, this is a capture group, so the regular expression following the field name, “<message>”, will be used to read the logs and extract the values for each one. In other words, in Log Search, the key-value pair will be extracted out from the logs as follows:

(?P<key>regex that will extract the value)

“[^d]*d” is not working, so let’s figure out how to replace that. There are a lot of ways to go about crafting a regular expression that will extract the field. For example, one option might be to extract every character until you get to the next literal string:

^(?P<date>(?:[^:]*:){2}[^ ]+)[^=]*=(?P<message>.*)rt=

This works, but it is somewhat inefficient for parsing. It’s outside our scope here to discuss why, but in general, you should not use the dot star, “.*”, for parsing rules. It is much more efficient to define what to match rather than what to not match, so whenever possible, use a character class to define what should be matched.

Create the character class and put all possible characters that appear in the field into it:

^(?P<date>(?:[^:]*:){2}[^ ]+)[^=]*=(?P<message>[\w\s\.-\\]+)

These Thycotic logs have a lot of different characters appearing in the “msg” field, so defining a character class like I’m doing here, “[\w\s\.-\\]”, is a bit like playing pin the tail on the donkey, in that I hope I get them all!

Let’s look at another way to extract a text string when it does not have a standard format. Remember that “\d” will match any digit character. Its opposite is “\D”, which matches any non-digit character. Therefore, “[\d\D]+” matches any digit or non-digit character:

^(?P<date>(?:[^:]*:){2}[^ ]+)[^=]*=(?P<message>[\d\D]+)rt=

One thing to point out here is that, although defining the specific values in the character class was a bit futzy to extract the “msg” field, this method works very well when parsing out host and user names.

In most cases, host and user names will contain only letters, numbers, and the “-” symbol, so a character class of “[\w\d-]” works well to extract them. If the names also contain a FQDN, such as “hfinn.widgets.com”, then you also need to extract the period: “[\w\d\.-]”.

Example 2: Unstructured Logs

Let’s look at some logs that are challenging to parse due to the field structure.

Logs that do not have the same fields in every log and that do not have standard breaks to delineate the fields, while nicely human readable, are more difficult for a machine to parse.

<14>May 27 16:25:31 tkxr7san01.widgets.local MSWinEventLog	1	 Microsoft-Windows-PrintService/Operational	5793	Mon May 27 16:25:31 2021 	805	Microsoft-Windows-PrintService	hfinn	User	Information	 tkxr7san01.widgets.local	Print job diagnostics		Rendering job 71.	 2447840
<14>May 27 16:25:31 tkxr7san01.widgets.local MSWinEventLog	1	 Microsoft-Windows-PrintService/Operational	5794	Mon May 27 16:25:31 2021 	307	Microsoft-Windows-PrintService	hfinn	User	Information	 tkxr7san01.widgets.local	Printing a document		Document 71, Print Document owned by hfinn on lpt-hfinn01 was printed on XEROX1001 through port XEROX1001. Size in bytes: 891342. Pages printed: 2. No user action is required.	2447841

Let’s take a look at how we can parse out the fields in this log using guided mode. When you are using the Custom Parsing Tool, one of the first things you need to decide is if you want to create a filter for the rule:

With logs, the device vendor decides what fields to create and how the logs will be structured. Some vendors create their logs so all have the same fields in every log line, no matter what the event. If your logs look like this, then you will not need to use a filter.

Other vendors define different fields depending on the type of event. In this case, you will probably need to use a filter and create a separate rule for each type of event you want to parse. Another reason to use a filter is that you just want to parse out one type of event.

Looking at my Microsoft print server logs closely, you can see that the second log has quite a few more fields than the first one: document name printed, the document owner, what printer it was printed on, size, pages printed, and if any user action is required. As such, I’m going to need to use filters and create more than one rule here.

The filter should be a literal string that is part of the logs I want to parse. In other words, how can the parsing tool know which logs to apply this rule to? Let’s start with the first type of log:

<14>May 27 16:25:31 tkxr7san01.widgets.local MSWinEventLog	1	 Microsoft-Windows-PrintService/Operational	5793	Mon May 27 16:25:31 2021 	805	Microsoft-Windows-PrintService	hfinn	User	Information	 tkxr7san01.widgets.local	Print job diagnostics		Rendering job 71.	 2447840

Its type is “Print job diagnostics”, so that seems like a good string to match on. I will use that for my filter.

Still in the default guided mode, I’ll start extracting out the fields I want to parse. I don’t need to parse them all, just the ones I care about. As I am working my way through the log, I find that I am not able to extract the username like I want:

I am going to continue for the moment, however, using guided mode to define the last field I need to add.

Let’s pause for a moment. I’ve created four fields. Three of them are fine, but the “source_user” field is not parsing correctly. I am going to switch now to Regex Editor mode to fix it.

The regex created by the Custom Parsing Tool in guided mode is:

^(?:[^ ]* ){3}(?P<source_host>[^ ]+)(?:[^ ]* ){4}(?P<datetime>[^	]+)[^m]+(?P<source_user>[^ ]+)[^P]+(?P<action>[^	]+)

The only part I need to look at is the capture group for the field I called “source_user”:

(?P<source_user>[^ ]+)

With that said, the issue with the rule could be somewhere else in the parsing rules. However, let’s just start with the one capture group. Let’s interpret the character class first: “[^ ]”.

When the hat symbol (“^”) appears in a character class, it means “not”. The hat is followed by a space, so the character class is “not a space”. Therefore, “[^ ]+” means “read in values that are not spaces” or “read until you get to a space”.

Looking at the entire parsing rule, you can see it is counting the number of spaces to define what goes into each field. This would work out fine if spaces were the field delimiters, but that’s not how these logs work. The logs are a bit unstructured in the sense that some of the fields are defined by literal strings and others are just literal strings themselves.

Also, guided mode had a few too many beers while trying to cope with these silly logs and decided that the “source_user” field should always start with the letter “m”:

[^m]+(?P<source_user>[^ ]+)

Oops! We don’t want that! Let’s get rid of it, plus the “[^P]+”, which means to read to a literal capital P: this is how it is rolling past everything to the “Print job diagnostics”, but we can do better than that.

As humans, we know that we want the “action” field to be the literal string “Print job diagnostics”, which the Custom Parsing Tool doesn’t know. Let’s just fix these few things first. I made these changes, clicked on Apply to test them, and got an error:

This error means I’ve goofed, and the rule does not match my logs. I know I’m on the right path, though, so I’m going to continue. The problem here is with how the regex is going from “datetime” to “source_user” and then to “action”.

Let’s stop for a moment to look at this regex:

(?:[^ ]*){3}

The “(?P<keyname>)” structure that we’ve been using is a capture group. The “(?:)” structure is a non-capture group. It means that regex should read this but not actually capture it or do anything with it. It’s also how we are going to skip past the fields we don’t want to parse. The “{3}” means “three times”. Of course, we have already seen that “[^ ]*” means “not a space” or “read until you get to a space”. So, the whole non-capture group “(?:[^ ]*){3}“ means “read everything to the next space, three times” or “skip past everything that is in the log until you have read past three spaces”.

Now, let’s look at an actual log:

The last field we read in was “datetime”, and then, we need to skip over to the “source_user” field. Let’s try to do that by skipping past the three spaces until we get there.

Next, from the “source_user” to the last field, “action”, there are four spaces. Here is my regex:

^(?:[^ ]* ){3}(?P<source_host>[^ ]+)(?:[^ ]* ){4}(?P<datetime>[^ ]+)(?:[^ ]* ){3}(?P<source_user>[^ ]+)(?:[^	]* ){4}(?P<action>Print job diagnostics)

I have added “(?:[^ ]* ){3}“ to skip past 3 spaces and done the same thing later to skip past 4 spaces, using a “{4}” to denote 4 spaces. Let’s see if it works:

The tool seems to be happy with this, as all the fields appear correctly, so I will go ahead with applying and saving this rule.

If you are at this spot with your logs except the tool is not happy and you are getting an error, I have a couple of tips for you to try:

Sometimes, the tool works better if you do parse every field, even if you do not particularly care about them. Try parsing every field to see if that works better for you.
Occasionally, it is easier to just parse out one (or maybe two or three) fields per rule. This is especially true if the logs are very messy and the fields have little structure. If you are really stuck, try to parse out just one field at a time. It is okay to have several parsing rules per log type, if necessary.
Try to proceed one field at a time, if possible. Get the one field extracted correctly, and then proceed to the next one.

When you create a parsing rule, it will apply to new logs that are collected. I have waited a bit for more logs to come in, and I can see they are now parsing as expected.

Now, I need to create a second rule for the next type of log. Here is what those logs look like

<14>May 27 16:25:31 tkxr7san01.widgets.local MSWinEventLog	1	 Microsoft-Windows-PrintService/Operational	5794	Mon May 27 16:25:31 2021 	307	Microsoft-Windows-PrintService	hfinn	User	Information	 tkxr7san01.widgets.local	Printing a document  Document 71, Print Document owned by hfinn on lpt-hfinn01 was printed on XEROX1001 through port XEROX1001. Size in bytes: 891342. Pages printed: 2. No user action is required.	2447841

When you create a second (or third, fourth, etc.) parsing rule for the same logs, the Custom Parsing Tool does not know about the previous rules. You will need to start any additional rules just as you did the first one.

Also, just like before, as I am creating the parsing rule, I will need to apply a filter to match these logs. The type of log is “Printing a document”, so I will use that as the filter.

Again, I will start in guided mode and define the fields I want to start parsing out — it isn’t required to start in guided mode but, sometimes, that is easier. I defined a few fields, and as you can see, the parsing is not working like I need.

Now that I have the fields defined, I’ll switch to Regex Editor mode.

The regex that was generated in guided mode is:

^(?:[^ ]* ){4}(?P<datetime>[^ ]+)[^h]{36}(?P<source_user>[^ ]+)\D{18}(?P<source_host>[^	]+) (?P<action>[^ ]+)	(?P<printed_document>(?:[^ ]* ){3}[^ ]+)(?:[^ ]* ){6}(?P<owner>[^ ]+)

Just like before, I am going to clean this up a bit, starting with the first field and working from left to right, modifying the regex to parse out the keys like I want.

The first part of the regex, “^(?:[^ ]* ){4}(?P<datetime>[^ ]+)”, says to skip past the first four spaces and read the next part into the datetime key until a fifth space is read. This first part is fine and parsing out the field as needed, so let’s move on to the next part, “[^h]{36}(?P<source_user>[^ ]+)”.

For simplicity and functionality, I am going to swap out the way the regex is skipping over to the username, “[^h]{36}”, with “(?:[^ ]* ){3}”. This is using the same logic as the first rule: skip past three spaces to find the next value to read in.

These first two fields are working, so let’s move to the next one, “source_host”. The regex for skipping over to this field and parsing it is: “\D{18}(?P<source_host>[^ ]+)”.

While this might look odd, the “\D{18}” part is how regex skips over the literal string “User Information”. We looked at “\D” previously; the “\D” means to match “any non-digit character”, and the “{18}” means “18 times”. In other words, skip forward 18 non-digit characters. This is working well, and there is no need for any tinkering.

The next field is “(?P<action>[^ ]+)”, and just like the previous one, this field is properly parsed out. If a field is properly parsed, then there is no need to make any changes.

Now, I am getting to the messiest field in the logs, which is the name of the document that was printed. You can see that the field should contain all the text between “Printing a document” and “owned by”, and it is not being correctly parsed with the auto-generated regex.

Hey, we humans are still better at some things than machines! While this field is easy for us to make out, machine-generated regex has difficulty with these types of fields.

Let’s look at the current method that regex is using. The field is parsed with “(?P<printed_document>(?:[^ ]* ){3}[^ ]+)”, which means the parser is counting spaces to try to determine the field, and that just doesn’t work here. We need to craft a regular expression that will parse everything until we get to “owned by”, no matter what it is.

The best way to do this is to create a character class that includes all the possible characters that might be in the field to be extracted. Because a document can contain any possible character, I am going to use the class “[\d\D]”, which I used previously. To say “followed by the literal string owned by”, I need to be careful and take into account the spaces.

Spaces can be a little tricky in that, sometimes, what your eyes perceive as one space are several. To be on the safe side, instead of specifying one space with “\s” you might want to use “\s+”. You could also put an actual space there. That is what the auto-generated regex has; however, I prefer to use the “\s+” notation as it makes it clear that I want to match one or more spaces.

(?P<printed_document>[\d\D]+)\s+owned\s+by\s+

Before continuing, I am checking to make sure the “printed_document” field is parsed properly.

It all looks good! If it didn’t, I would fiddle with the parsing until it did. The last field owner is parsing correctly, as well

Drat! I need to extract one more field, and I forgot to define it in guided mode. That’s okay, however, because I can go ahead and add it right now. All I need to do is to create a named capture group for it.

We looked at the syntax for a named capture group earlier:

(?P<key>regex that will extract the value)

I want to collect how many pages were printed, so let’s look at our logs again. I’m working on this part of the log:

owned by mstewart on \\L-mstewart03 was printed on Adobe PDF through port Documents\*.pdf. Size in bytes: 0. Pages printed: 13.

I need to skip forward from the “owned by” value all the way over to “Pages printed”. The fields in between might have different numbers of spaces, characters, etc., which means I can’t count spaces like I did before. I can’t count the number of characters, either.

Let’s use our old friend “[\d\D]”:

(?:[\d\D]+)Pages printed:\s+(?P<pages_printed>\d+)

The tool seems happy with this, as there are no errors when I click on Apply, and the fields in the Log Preview appear to be correct.

When I apply and save this rule, I wait for a few more logs to come in to see if the parsing rule is working like I want. In Log Search, here is a parsed log line:

You might be wondering what the “\t” is all about. The “\t” is a tab or space character, and I’m not quite satisfied with my parsing rule. I didn’t spot when I saved it that there is a space being captured at the beginning of the field. If you find, after you save a rule, that it isn’t working like you want, you can either delete it and start over or just modify the existing rule.

I am going to modify my rule and fix it. Here is my final regex with the extra spaces accounted for:

^(?:[^ ]* ){4}(?P<datetime>[^ ]+)(?:[^	]* ){3}(?P<source_user>[^	]+)\D{18}(?P<source_host>[^	]+) (?P<action>[^ ]+)\s+(?P<printed_document>[\d\D]+)\s+owned\s+by\s+(?P<owner>[^ ]+)(?:[\d\D]+)Pages printed:\s+(?P<pages_printed>\d+)

And that’s it! With this guide, you have now learned how to use the Custom Parsing tool in InsightIDR to break up ugly logs into usable fields.

Introducing the Manual Regex Editor in IDR’s Parsing Tool: Part 2

Introducing the Manual Regex Editor in IDR’s Parsing Tool: Part 1

2021-07-06 Teresa Copple

Post Syndicated from Teresa Copple original https://blog.rapid7.com/2021/07/06/introducing-the-manual-regex-editor-in-idrs-parsing-tool-part-1/

Introducing the Manual Regex Editor in IDR’s Parsing Tool: Part 1

New to writing regular expressions? No problem. In this two-part blog series, we’ll cover the basics of regular expressions and how to write regular expression statements (regex) to extract fields from your logs while using the custom parsing tool. Like learning any new language, getting started can be the hardest part, so we want to make it as easy as possible for you to get the most out of this new capability quickly and seamlessly.

The ability to analyze and visualize log data — regardless of whether it’s critical for security analytics or not — has been available in InsightIDR for some time. If you prefer to create custom fields from your logs in a non-technical way, you can simply head over to the custom parsing tool, navigate through the parsing tool wizard to find the “extract fields” step, and drag your cursor over the log data you’d like to extract to begin defining field names.

The following guide will give you the basic skills and knowledge you need to write parsing rules with regular expressions.

What Are Regular Expressions?

In technical applications, you occasionally need a way to search through text strings for certain patterns. For example, let’s say you have these log lines, which are text strings:

May 10 12:43:12 SECRETSERVERHOST CEF:0|Thycotic Software|Secret Server|10.9.000002|500|System Log|7|msg=The server could not be contacted. rt=May 10 2021 12:43:12
May 10 12:43:41 SECRETSERVERHOST CEF:0|Thycotic Software|Secret Server|10.9.000002|500|System Log|7|msg=The RPC Server is unavailable. rt=May 10 2021 12:43:41

You need to find the message part of the log lines, which is everything between “msg=” and “rt=”. With these two log lines, I might hit the easy button and just copy the text manually, but clearly, this approach won’t work if I have hundreds or thousands of lines from which I need to pull the field out.

This is where regular expression, often shortened to regex, comes in. Regex gives you a way to search through text to match patterns, like “msg=”, so you can easily pull out the text you need.

How Does It Work?

I have a secret to share with you about regular expression: It’s really not that hard. If you want to learn it in great depth and understand every feature, that’s a story for another day. However, if you want to learn enough to parse out some fields and get on with your life, these simple tips will help you do just that.

Before we get started, you need to understand that regex has some rules that must be followed. The best mindset to get the hang of regex, at least for a little while, is to follow the rules without worrying about why.

Here are some of the basic regular expression rules:

The entire regular expression is wrapped with forward slashes (“/”).
Pattern matches start with backslashes (“\”).
It is case-sensitive.
It requires you to match every character of the text you are searching.
It requires you to learn its special language for matching characters.

The special language of regular expression is how the text you are searching is matched. You need to start the pattern match with a backslash (“\”). After that, you should use a special character to denote what you want to match. For example, a letter or “word character” is matched with “\w” and a number or “digit character” is matched with “\d”.

If we want to match all the characters in a string like:

cat

We can use “\w”, as “\w” matches any “word character” or letter, so:

\w\w\w

This matches the three characters “c”, “a”, and “t”. In other words, the first “\w” matches the “c” character; “\w\w” matches “ca”; and “\w\w\w” matches “cat”.

As you can see, “\w” matches any single letter from “a” to “z” and also matches letters from “A” to “Z”. Remember: Regex is case sensitive.

“\w” also matches any number. However, “\w” does NOT match spaces or other special characters, like “-”, “:”, etc. To match other characters, you need to use their special regex symbols or other methods, which we’ll explore here.

Getting Started With Regex

Before we keep going, now is a good time to take a few minutes to find a regex “cheat sheet” you like.

Rapid7 has one you can use: https://docs.rapid7.com/insightops/regular-expression-search/, or you may have a completely different one you prefer. Whatever the case is for you, these guides are helpful in keeping track of all your matching options.

While we’re at it, let’s also find a regex testing tool we can use to practice our regex. https://regex101.com/ is very popular, as it is a tool and cheat sheet in one, although you may find another tool you want to use instead.

InsightIDR supports the version of regex called RE2, so if your parsing tool supports the Golang/RE2 flavor, you might want to select it to practice the specific flavor InsightIDR uses.

To follow along with me, open your preferred tool for testing regex. Enter in some text to match and some regex, and see what happens!

Introducing the Manual Regex Editor in IDR’s Parsing Tool: Part 1

Let’s look at another way to match the string “cat”. You can use literals, which means you just enter the character you want to match:

This means you literally want to match the string “cat”. It matches “cat” and nothing else.

Let’s look at another example. Say I need to match the string:

san-dc01

As we saw earlier, you can use “\w” to match the word characters. To match a number, you can use “\w” or “\d”. “\d” will match any number or “digit”. However, how can you match the “-”?

The dash (“-”) is not a word character, so “\w” does not match it. In this case, we can tell regex we want to match the “-” literally:

\w\w\w-\w\w\d\d

There are other options, as well. The dot or period character (“.”) in regex means to match any single character. This works to parse out the string “san-dc01”, as well:

\w\w\w.\w\w\d\d

While this works, it is tedious typing all these “\w”s. This is where wildcards, sometimes called regex quantifiers, come in handy.

The two most common are:

* match 0 or more characters

+ match 1 or more characters

“\w*” means “match 0 or more word characters”, and “\w+” means “match 1 or more word characters”.

Let’s use these new wildcards to match some text. Say we have these two strings:

cat

san-dc01

I want one regex pattern that will match both strings. Let’s match “cat” first. The regex we used previously:

\w\w\w

matches the string, so you can see that using this wildcard would work, too:

\w+

Now, let’s look at matching “san-dc01”. I can use this:

\w+-\w+

This means “match as many word characters as there are, followed by a dash, and then followed by as many word characters as there are”. However, while this matches “san-dc01”, it does not match “cat”. The string “cat” has no “-” followed by characters.

The regex we added, “-\w+”, only matches a string if the “-” character is part of the string. In addition, “\w+” means “match one or more numbers”. In other words, “\w+” means “match at least one word character up to as many as there are”. As such, we need to use “\w*” here instead to specify that the “dc01” part of the string might not always exist. We also need to use “-*” to specify that the “-” might not always exist in the string we need to match, either.

Therefore, this should work to parse both strings:

\w+-*\w*

By now, you may have noticed something else important about regex: There are usually many different patterns that will match the same text.

Sometimes, I find that people get snooty about their regex, and these people might mock you if they think you could have crafted a shorter pattern or a more efficient one. A pox on their house! Don’t worry about that right now. It’s more important that your regex pattern works than it is that it be short or impressively intricate.

Let’s look at another way to match our strings: You can use a character class for matching.

The character class is defined by using the square brackets “[“ and “]”. It simply means you want regex to match anything included in your defined class.

This is easier than it sounds! Since our strings “cat” and “san-dc01” contain characters that match either “\w” or the literal “-”, our character class is “[\w-]”.

Now, we can use the “+” to specify that our string has one or more characters from the character class:

[\w-]+

Additional Regular Expressions for Log Parsing

Besides “\w” and “\d”, I have a few more regular expressions I want you to pay close attention to. The first one is “\s”, which is how you match whitespaces.

“\s” will match any whitespace character, and “\s+” will match one or more whitespace characters.

Next, remember that the dot (“.”) will match any character. The dot becomes especially powerful when you combine it with the star (“*”). Remember: The star means to match 0 or more characters. Therefore, the “dot star” (“.*”) will match any characters as many times as they appear, including matching nothing. In other words, “.*” matches anything.

Finally, let’s look at special uses of the circumflex character, which is often just called the hat (“^”).

The hat has two completely different uses, which should not be confused. First, when used by itself, the hat in regex designates where the beginning of the line starts. For example, “^\w+” means that the line must start with a word.

The second use of the hat is when it appears in a character class. If you use the hat character when defining a character class, it means “everything except” as in “match everything except these characters”. In other words, “[\d]+” would match any digit character, while “[^\d]+” means to do the opposite or match everything except for any digit character!

Log Parsing Examples

Let’s go back to where we started, trying to parse out the msg field from our logs:

May 10 12:43:12 SECRETSERVERHOST CEF:0|Thycotic Software|Secret Server|10.9.000002|500|System Log|7|msg=The server could not be contacted. rt=May 10 2021 12:43:12
May 10 12:43:41 SECRETSERVERHOST CEF:0|Thycotic Software|Secret Server|10.9.000002|500|System Log|7|msg=The RPC Server is unavailable. rt=May 10 2021 12:43:41

Have you copied and pasted these log lines into your regex tester? If not, go ahead and do so now.

We need to parse a literal string “msg=”. These literals in log lines are often the “key” part of a key-value pair and are sometimes called anchors instead of literals, since they are the same in every log line. To parse them, you would usually specify the literal string to match on.

Next, we need to read the value that follows. You have a few different approaches you can use here. A common way to parse the value is to read everything that follows until the next literal or anchor. Remember: There are many ways to do this, but your regex might look like this:

msg=.*rt=

By the way, if you are familiar with regex, you know that the greedy “*” creates inefficient parsing rules, but let’s not worry too much about that right now. The skinny on this, however, is that you should never use the dot star (“.*”) for parsing rules. It is useful for searches and trying to figure out log structure, though.

Another way to read the value is to use a character class:

msg=[\w\s\.]+rt=

Let’s break the character class down to determine exactly what is specified. “\w” means match any word character. “\s” means to match any space. We also need to match a literal period, since that appears in the msg value, but the period or dot has a special meaning in regex. When characters have special meaning in regex, like the slashes, brackets, dot, etc, they need to be “escaped”, which you do by putting a backslash (“\”) in front of them. Therefore, to match the period, we need to use “\.” in the character class.

Remember: Defining a character class means you want to match any character defined in the class, and the “+” at the end of the class means to “match one or more of these characters”. In that case, “[\w\s\.]+” means “match any word character, any space, or a period as many times as it occurs”. The matching will stop when the next character in the sequence is not a word character, a space, or a period OR when the next part of the regex is matched. The next part of the regex is the literal string “rt=”, so the regex will extract the “[\w\s\.]+” characters until it gets to “rt=”.

Finally, there is just one more regex syntax that is helpful to understand when using regex with InsightIDR, and that is the use of capture groups. A capture group is how you define key names in regex. Capture groups are actually much more than this, but let’s narrow our focus to just what we need to know for using them with InsightIDR. To specify a named capture group for our purposes, use this syntax:

(?P<keyname>putyourregexhere)

The regex you put into the capture group is what is used to read the value that is going to match the “keyname”. Let’s see how this works with our logs.

Say we have some logs in key-value pair (KVP) format, and we want to both define and parse out the “msg” key. We know this regex works to match our logs: “msg=[\w\s\.]+rt=”. Now, we need to take this one step further and define the “msg” key and its values. We can do that with a named capture group:

msg=(?P<msg>[\w\s\.]+)rt

Let’s break this down. We want to read in the literal string “msg=” and then place everything after it into a capture group, stopping at the literal string “rt=”. The capture group defines the key, which you can also think of as a “field name”, as “msg”: “(?P<msg>”.

If we wanted to parse the field name as something else, we could specify that in between the “<>”. For example, if we wanted the field name to be “message” instead of “msg”, we would use: “(?P<message>”.

The regex that follows is what we want to read for the value part of our key-value pair. In other words, it is what we want to extract for the “msg”. We already know from our previous work that the character class “[\w\s\.]” matches our logs, so that’s what is used for this regex.

In this blog, we explored the regex syntax we need for InsightIDR and used a generic tool to test our regular expressions. In the next blog, we’ll use what we covered here to write our own parsing rules in InsightIDR using the Custom Parsing Tool in Regex Editor mode.

Introducing the Manual Regex Editor in IDR’s Parsing Tool: Part 1

Noise

All posts by Teresa Copple

Run Faster Log Searches With InsightIDR

The basics

How to improve your searches

General recommendations

Specifying keys vs. free text

Additional log search tips

How data is returned to the Log Search UI

NEVER MISS A BLOG

Introducing the Manual Regex Editor in IDR’s Parsing Tool: Part 2

Example 1: Key-Value Pair Logs Separated With Keys

Example 2: Unstructured Logs

Introducing the Manual Regex Editor in IDR’s Parsing Tool: Part 1

What Are Regular Expressions?

How Does It Work?

Getting Started With Regex

Additional Regular Expressions for Log Parsing

Log Parsing Examples

The collective thoughts of the interwebz