MyCommand Line Text Processing repository contains a chapter on GNU grep that has been edited and expanded to create this book. GNU grep is part of text creation and manipulation commands provided by GNU and is provided by default on GNU/Linux.
Frequently used options
Simple string search
If your input ﬁle has a format other than \r\n (carriage return and newline characters) as a line terminator, you must convert the input ﬁle to Unix style before processing.
Fixed string search
Case insensitive search
Invert matching lines
Line number and count
The output given by -c is the total number of lines that match the given patterns, not the total number of matches. Use the -o option and pipe the output to wc -l to get total matches (example shown later).
Limiting output lines
Multiple search strings
Filename instead of matching lines
Filename preﬁx for matching lines
In the first case, in is emphasized even after the pipe, while in the second case, in is not emphasized. In practice, it is always rarely used as there is extra information added to the matching lines and it can cause undesirable results when processing such lines.
Match whole word or line
Another useful option is -x which will only display a line if the entire line conforms to the given pattern.
Comparing lines between ﬁles
See also stackoverflow: Fastest way to find lines from one text file in another larger text file — go through all the answers.
Extract only matching portion
The professor and. d) Display the first three lines with Count .. city named by Count Dracula, is a fairly well-known place. 34;I am writing at the request of Mr. Jonathan Harker, who himself is not a strong junior partner of the important firm of Hawkins & Harker; and so, like you.. f) Display lines containing Zooelogical Gardens along with the prefix of the line number. $ grep ##### add your solution here.
BRE/ERE Regular Expressions
A cool use case for alternation is to combine line anchors to display the entire input ﬁle but highlight only the required search patterns. Standalone line anchors will match every input line, even empty lines, since they are positional markers.
Matching characters like tabs
The dot meta character
If you are using a Unix-like distribution, you will probably have /usr/share/dict/words dictionary file. To allow matching in any order, you will also need to insert alternation.
Longest match wins
If .* had consumed everything after Error, there would be no more valid characters to try to match. So, among the different amount of matching characters for .*, the longest part that satisfies the overall regular expression is selected.
Metacharacters outside character classes such as ˆ, $, () etc either have no particular meaning or have completely different meanings within character classes. The next metacharacter is ˆ which must be specified as the first character of the character class.
A very good knowledge of regular expressions is not only important for using grep effectively, but also useful when using regular expressions in other tools like sed and awk and programming languages like Python and Ruby. However, the core concepts are likely to be the same, and a handy reference sheet would go a long way in reducing misuse. a) Extract all () pairs with/without text in them, provided they don't contain () characters inside. b) For the given input, match all lines starting with den or ending with ly .. lines='lovely\n1 dentist\n2 lonely\nden\nodletel\ndent' .. c) Extract all whole words containing 42, but not at the edge of words. Match all lines that contain car, but not as a whole word. printf 'car\nscar\ncare\npot\nscare\n' | grep ##### add your scar solution here. than to frighten. e) For the dracula.txt file, count the total number of lines that contain removed or dormant or received or answered or rejected or retracted as whole words. f) Elicit words starting with s and containing e and t in any order.
The sample input shown below may help to understand the differences, if any. printf 'known\nmood\nknow\npony\ninns\n' known .. l) Display all lines starting with hand and ending with no further characters or s or y or le.
This option can be used instead of specifying both -A and -B if the required number of lines is the same. No error or warning if the count exceeds the lines available for any of these three options.
Context matching got its own chapter mainly because of angular examples and customization options (in Ubuntu man grep doesn't list them at all). exercises. a) For this question, create the exercises/context_matching directory and then save this file from the learn_gnugrep_ripgrep repo as palindrome.py.
Customize search path
Using ﬁnd command
Options not covered
Perl Compatible Regular Expressions
BRE/ERE vs PCRE subtle diﬀerences
Subexpression calls can be used instead of a backreference to reuse the model itself (similar to function calls in programming).
Variable length lookbehind
Some of the variable-length positive lookup cases can be simulated by using \K to suffix the pattern needed as the lookup assertion. Variable-length negative lookbehind can be simulated by using negative lookbehind (which has no variable-length limit) within an array and applying quantifier to match characters one by one.
It also shows how grouping can be denied in certain cases. matches 'dog' only if it is not preceded by 'cat' anywhere in the line. note the use of anchor to force all characters to match 'dog'. matches 'dog' only if it is preceded by 'parrot' anywhere in the line. extract matching portion to better understand denied grouping. metacharacter x readable pattern with space and comments. This will override the modifications applied to the entire pattern, if any. modifiers:pattern) will apply modifications to this section only. modifiers:pattern) will negate modifiers for this section only. modifiers-modifiers:pattern) will apply and negate specific modifiers only for this section. modifiers) when pattern is not given, modifiers (including negation) will be applied from this point.
Q and \E
Then matches a set of brackets that can optionally contain any number of non-nested sets of brackets (called level-two for reference). Within that, a valid string consists of either non-bracket characters or a non-nested bracket sequence - that is, to recursively match any number of nested sets of brackets, use a subexpression call within its capturing group itself.
Since the entire pattern must be named here, you can use the default zero recording group.
In addition to use in command-line tools such as GNU grep, pcregrep, and ripgrep, it is also used in programming languages. However, I feel like I've covered most of the features that can come up for command line use with grep .. a) Filter all lines that meet all these rules: .. should have at least two alphabets .. should not end with a blank sign. Problem formulation is to find a sequence of duplicate word characters where the second occurrence matches just before a newline character. correct the command to get expected output as shown .. e) Extract all whole words except those starting with p or e or n.
Gotchas and Tricks
Patterns starting with hyphen
Word boundary diﬀerences
Faster execution for ASCII input
Speed beneﬁts with PCRE
The vim editor has a -q option that makes it easy to edit matching lines from the rg output if it has both a line number and a filename prefix. Hn to allow vim to place the cursor from the beginning of the match instead of the beginning of the line. In practice, it is always rarely used (for example: pipelining results to less -R) as it adds extra information to matching rows and can cause unwanted results when processing such rows.
The separator -- is not added if two or more groups of matching lines have overlapping lines or are next to each other in the input file.
Depending on the implementation of regular expression engines, the longest match will be chosen from among all valid matches generated with varying number of characters for .* or the engine will track character by character from the end of the string until the pattern can be satisfied or fail. Character class metacharacters can be matched literally at a specific location or by using \ to escape them. Similar to variables in programming languages, the string captured by () can be referenced later using dereference.
This is a workaround for some cases where workarounds are needed. extract the 3rd occurrence of 'cat' followed by optional lowercase letters.
To use Rust regex normally and switch to PCRE2 if necessary, use the --engine=auto option.
This section will cover the feature most attractive to users – the default recursive behavior and options to customize it. For beginners to recursive searching using rg, the --files option to get a list of files being searched can be useful. The presence of .git directory (either in current directory or in parent directories) will mark .gitignore to be used for ignoring.
The --no-require-git flag was recently added to avoid the need for an empty .git directory.
Many factors such as file size, file encoding, line size, sparse or dense matches, hardware features, etc. will affect performance. rg has options like -j, --dfa-size-limit and --mmap to adjust performance in some cases. See Frequently used options: exercises section for details on the input file used here.. assumes 'exercises/freq_options' as CWD .. rg ##### add your solution here 2559 . b) Commands like sed and perl require special attention if you need to literally find and replace text. rg offers an easier alternative as can be seen from these exercises.. b&6 c . c) Create the exercises/ripgrep directory and then save this file from the learn_gnugrep_ripgrep repoas sample.md. Enjoy learning Ruby. d) Sum all integers (floating point numbers should be ignored) if file also contains string is .. assumes 'exercises/ripgrep' as CWD .. which already has one file named 'sample.md' .. create two multiple files with these commands .. all three files should be treated as input .. rg ##### add your solution here | datamash sum 1 61 . e) The default behavior changes depending on whether the output is destined for the terminal or not.
Search for good way or bye in all the files in the given directory and save the output in out.txt file.. accept 'exercises/ripgrep' as CWD.