Dot Net For All

Parsing Log file using Regular Expression

In this article I will discuss about the regular expression in C#. I will use the regular expression to parse a log text file. This example I am using to have a better understanding of the regular expressions. Moreover I will state the most commonly used patterns.

 

What is a Regular Expression ?

Regular expression are used to identify the character patterns. Regular expression can be used to validate the text input such as password and phone numbers or parsing the textual data in more structural format.

We have to use Regex class present in the System.Text.RegularExpressions namespace.

One of the most common regular expression is a quantifier(?). It matches preceding item 0 or 1 time. An item is a single character or a structure of characters in square brackets.

In the below code I am looking for 0 or 1 occurrence of the ‘e’ character in the string.

            Console.WriteLine(Regex.Match("Developer", "Develope?r").Success); //true
            Console.WriteLine(Regex.Match("Developr", "Develope?r").Success);//true
            Console.WriteLine(Regex.Match("Developeer", "Develope?r").Success);//false

Regex.Match searches within a larger string. It returns an Match object which has Success, Index, Length and value properties.

We can use Regex class to split a string. Please see the example below.

By looking at the above example you can say that we can use string.Split function to split the string by space. But what if you have to split the text by suppose a number.

            Regex match = new Regex(@"\d");
            string name = "Vikram1Chaudhary2C#3Developer";
            string[] words = match.Split(name);
            Console.WriteLine(string.Join(",", words));

The above code will return the string array containing “VIkram”, “Chaudhary”, “C#”, “Developer”.

Parsing a Text File.

The topic of regular expression is very huge. And it is difficult to give the example for all the special characters in one article. That is why I will cover most of the concepts in one example. Here I will parse the log file which we write for our applications for debugging purpose.

One of the line for the log file can be as shown below

“ERROR 12/6/2016 01:25:15 —MyModule— NullReferenceException thrown by MyMethod in MyCLass”

Where the first part is the Error level which can be one of the ERROR, INFO, WARNING etc. The next part is the timestamp followed by module name and detailed exception.

The code to parse the log file is as shown below.

            string SampleLine = "ERROR 12/6/2016 01:25:15 ---MyModule--- NullReferenceException thrown by MyMethod in MyCLass";

            Regex pattern = new Regex(@"(?<level>\S+) " +
                @"(?<timestamp>\d{2}/\d{1,2}/\d{4} \d{2}:\d{2}:\d{2}) " +
                @"---(?<category>[^-]+)--- " +
                @"(?<message>.*)");

            Match match = pattern.Match(SampleLine);
            if(match.Success)
            {
                Console.WriteLine(match.Groups["level"]);
                Console.WriteLine(match.Groups["timestamp"]);
                Console.WriteLine(match.Groups["category"]);
                Console.WriteLine(match.Groups["message"]);
            }

Here I will cover all the parts of the regular expression used above.

  1. (?<>)         The expression is used to capture the  named group.
  2. \S+            This expression is used to  capture one or more non white space character sequence.
  3. \d{2}          This expression is used to capture the two digit number.
  4. \d{1,2}       It captures the one or two digit number.
  5. [^-]+          It captures the strings which does not contain one or more occurrences of ‘-‘ char.
  6. .*               it Captures zero or more occurrence of any character without newline.

In the above regular expression we get the four groups names level, timestamp, category and message. At line no 9 we are getting the values of these groups, if we are able to parse the string correctly.

A quick reference for the Sample patterns

Please find below a quick reference for the regular expression patterns.

.    Any character except newline.
\.   A period (and so on for \*, \(, \\, etc.)
^    The start of the string.
$    The end of the string.
\d,\w,\s A digit, word character [A-Za-z0-9_], or whitespace.
\D,\W,\S Anything except a digit, word character, or whitespace.
[abc]  Character a, b, or c.
[a-z]  a through z.
[^abc] Any character except a, b, or c.
aa|bb  Either aa or bb.
?      Zero or one of the preceding element.
*      Zero or more of the preceding element.
+      One or more of the preceding element.
{n}    Exactly n of the preceding element.
{n,}   n or more of the preceding element.
{m,n}  Between m and n of the preceding element.
(expr) Capture expr for use with \1, etc.
(?:expr) Non-capturing group.
(?=expr) Followed by expr.
(?!expr) Not followed by expr.

Conclusion

In this article I have tried to cover the basics regular expression. Moreover we have seen how we can use the regular expression to split the string or parse a text file.

Top career enhancing courses you can't miss

My Learning Resource

Excel your system design interview