• Không có kết quả nào được tìm thấy

GNU GREP and RIPGREP

N/A
N/A
Nguyễn Gia Hào

Academic year: 2023

Chia sẻ "GNU GREP and RIPGREP"

Copied!
111
0
0

Loading.... (view fulltext now)

Văn bản

MyCommand Line Text Processing repository contains a chapter on GNU grep that has been edited and expanded to create this book. GNU grep is part of text creation and manipulation commands provided by GNU and is provided by default on GNU/Linux.

Frequently used options

Simple string search

If your input file has a format other than \r\n (carriage return and newline characters) as a line terminator, you must convert the input file to Unix style before processing.

Fixed string search

Case insensitive search

Invert matching lines

Line number and count

The output given by -c is the total number of lines that match the given patterns, not the total number of matches. Use the -o option and pipe the output to wc -l to get total matches (example shown later).

Limiting output lines

Multiple search strings

Filename instead of matching lines

Filename prefix for matching lines

Colored output

In the first case, in is emphasized even after the pipe, while in the second case, in is not emphasized. In practice, it is always rarely used as there is extra information added to the matching lines and it can cause undesirable results when processing such lines.

Match whole word or line

Another useful option is -x which will only display a line if the entire line conforms to the given pattern.

Comparing lines between files

See also stackoverflow: Fastest way to find lines from one text file in another larger text file — go through all the answers.

Extract only matching portion

Summary

Exercises

The professor and. d) Display the first three lines with Count .. city named by Count Dracula, is a fairly well-known place. 34;I am writing at the request of Mr. Jonathan Harker, who himself is not a strong junior partner of the important firm of Hawkins & Harker; and so, like you.. f) Display lines containing Zooelogical Gardens along with the prefix of the line number. $ grep ##### add your solution here.

BRE/ERE Regular Expressions

Line Anchors

Word Anchors

Alternation

A cool use case for alternation is to combine line anchors to display the entire input file but highlight only the required search patterns. Standalone line anchors will match every input line, even empty lines, since they are positional markers.

Grouping

Escaping metacharacters

Matching characters like tabs

The dot meta character

Quantifiers

If you are using a Unix-like distribution, you will probably have /usr/share/dict/words dictionary file. To allow matching in any order, you will also need to insert alternation.

Longest match wins

If .* had consumed everything after Error, there would be no more valid characters to try to match. So, among the different amount of matching characters for .*, the longest part that satisfies the overall regular expression is selected.

Character classes

Metacharacters outside character classes such as ˆ, $, () etc either have no particular meaning or have completely different meanings within character classes. The next metacharacter is ˆ which must be specified as the first character of the character class.

Backreferences

Known Bugs

A very good knowledge of regular expressions is not only important for using grep effectively, but also useful when using regular expressions in other tools like sed and awk and programming languages ​​like Python and Ruby. However, the core concepts are likely to be the same, and a handy reference sheet would go a long way in reducing misuse. a) Extract all () pairs with/without text in them, provided they don't contain () characters inside. b) For the given input, match all lines starting with den or ending with ly .. lines='lovely\n1 dentist\n2 lonely\nden\nodletel\ndent' .. c) Extract all whole words containing 42, but not at the edge of words. Match all lines that contain car, but not as a whole word. printf 'car\nscar\ncare\npot\nscare\n' | grep ##### add your scar solution here. than to frighten. e) For the dracula.txt file, count the total number of lines that contain removed or dormant or received or answered or rejected or retracted as whole words. f) Elicit words starting with s and containing e and t in any order.

The sample input shown below may help to understand the differences, if any. printf 'known\nmood\nknow\npony\ninns\n' known .. l) Display all lines starting with hand and ending with no further characters or s or y or le.

Context matching

This option can be used instead of specifying both -A and -B if the required number of lines is the same. No error or warning if the count exceeds the lines available for any of these three options.

Contiguous matches

Customizing separators

Context matching got its own chapter mainly because of angular examples and customization options (in Ubuntu man grep doesn't list them at all). exercises. a) For this question, create the exercises/context_matching directory and then save this file from the learn_gnugrep_ripgrep repo as palindrome.py.

Recursive search

Customize search path

Extended globs

Using find command

Piping filenames

Miscellaneous options

Scripting options

Multiline matching

Byte offset

Options not covered

Perl Compatible Regular Expressions

BRE/ERE vs PCRE subtle differences

String anchors

Escape sequences

Non-greedy quantifiers

Possessive quantifiers

Grouping variants

Lookarounds

Subexpression calls can be used instead of a backreference to reuse the model itself (similar to function calls in programming).

Variable length lookbehind

Some of the variable-length positive lookup cases can be simulated by using \K to suffix the pattern needed as the lookup assertion. Variable-length negative lookbehind can be simulated by using negative lookbehind (which has no variable-length limit) within an array and applying quantifier to match characters one by one.

Modifiers

It also shows how grouping can be denied in certain cases. matches 'dog' only if it is not preceded by 'cat' anywhere in the line. note the use of anchor to force all characters to match 'dog'. matches 'dog' only if it is preceded by 'parrot' anywhere in the line. extract matching portion to better understand denied grouping. metacharacter x readable pattern with space and comments. This will override the modifications applied to the entire pattern, if any. modifiers:pattern) will apply modifications to this section only. modifiers:pattern) will negate modifiers for this section only. modifiers-modifiers:pattern) will apply and negate specific modifiers only for this section. modifiers) when pattern is not given, modifiers (including negation) will be applied from this point.

Q and \E

G anchor

Skipping matches

Recursive matching

Then matches a set of brackets that can optionally contain any number of non-nested sets of brackets (called level-two for reference). Within that, a valid string consists of either non-bracket characters or a non-nested bracket sequence - that is, to recursively match any number of nested sets of brackets, use a subexpression call within its capturing group itself.

Since the entire pattern must be named here, you can use the default zero recording group.

Unicode

In addition to use in command-line tools such as GNU grep, pcregrep, and ripgrep, it is also used in programming languages. However, I feel like I've covered most of the features that can come up for command line use with grep .. a) Filter all lines that meet all these rules: .. should have at least two alphabets .. should not end with a blank sign. Problem formulation is to find a sequence of duplicate word characters where the second occurrence matches just before a newline character. correct the command to get expected output as shown .. e) Extract all whole words except those starting with p or e or n.

Gotchas and Tricks

Shell quoting

Patterns starting with hyphen

Word boundary differences

Faster execution for ASCII input

Speed benefits with PCRE

Parallel execution

Installation

Default behavior

Options overview

The vim editor has a -q option that makes it easy to edit matching lines from the rg output if it has both a line number and a filename prefix. Hn to allow vim to place the cursor from the beginning of the match instead of the beginning of the line. In practice, it is always rarely used (for example: pipelining results to less -R) as it adds extra information to matching rows and can cause unwanted results when processing such rows.

The separator -- is not added if two or more groups of matching lines have overlapping lines or are next to each other in the input file.

Rust Regex

Depending on the implementation of regular expression engines, the longest match will be chosen from among all valid matches generated with varying number of characters for .* or the engine will track character by character from the end of the string until the pattern can be satisfied or fail. Character class metacharacters can be matched literally at a specific location or by using \ to escape them. Similar to variables in programming languages, the string captured by () can be referenced later using dereference.

This is a workaround for some cases where workarounds are needed. extract the 3rd occurrence of 'cat' followed by optional lowercase letters.

PCRE2

To use Rust regex normally and switch to PCRE2 if necessary, use the --engine=auto option.

Recursive options

This section will cover the feature most attractive to users – the default recursive behavior and options to customize it. For beginners to recursive searching using rg, the --files option to get a list of files being searched can be useful. The presence of .git directory (either in current directory or in parent directories) will mark .gitignore to be used for ignoring.

The --no-require-git flag was recently added to avoid the need for an empty .git directory.

Speed comparison

Many factors such as file size, file encoding, line size, sparse or dense matches, hardware features, etc. will affect performance. rg has options like -j, --dfa-size-limit and --mmap to adjust performance in some cases. See Frequently used options: exercises section for details on the input file used here.. assumes 'exercises/freq_options' as CWD .. rg ##### add your solution here 2559 . b) Commands like sed and perl require special attention if you need to literally find and replace text. rg offers an easier alternative as can be seen from these exercises.. b&6 c . c) Create the exercises/ripgrep directory and then save this file from the learn_gnugrep_ripgrep repoas sample.md. Enjoy learning Ruby. d) Sum all integers (floating point numbers should be ignored) if file also contains string is .. assumes 'exercises/ripgrep' as CWD .. which already has one file named 'sample.md' .. create two multiple files with these commands .. all three files should be treated as input .. rg ##### add your solution here | datamash sum 1 61 . e) The default behavior changes depending on whether the output is destined for the terminal or not.

Search for good way or bye in all the files in the given directory and save the output in out.txt file.. accept 'exercises/ripgrep' as CWD.

Further Reading

Tài liệu tham khảo

Tài liệu liên quan

Exercise 1: Choose the correct option.. 1. This text is ___ than