正则表达式及其应用

字数统计: 1.7k阅读时长: 7 min

 2018/11/30

正则表达式(regular expression)是指一种操作字符串的搜索模式/模板(Pattern Template)，也可以看作是一种识别文本模式的符号表示法，可用于文本的搜索、编辑等操作。如下是一个正则表达式：

^[0-9]

其定义了一个搜索非数字字符的模式。正则表达式在如Shell(sed/gawk)/Python/Java等语言中都有广泛的应用.

正则表达式规则

字符匹配符

下表是常用的字符匹配符：

表1 字符匹配符

正则表达式	说明
`.`	匹配任意字符
`^`	一行的开始
`$`	是否一行的末尾
`[]`	匹配在`[ch]`中的任意一个字符
`()`	从正则表达式匹配结果中产生子字符串, `ma(tri)?x`匹配`max`或者`matrix`
`/`	`Oct (1st / 2nd)`匹配 `Oct 1st`或者`Oct 2nd`
`\`	转义字符, 将特殊字符转义, 如`a\.b`匹配的是`a.b`而不是`ajb`
`^regex`	在一行的开始出匹配regex
`regex$`	在一行的末尾匹配regex
`[abc]`	匹配字符a或b或c
`[abc][vz]`	匹配a/b/c后跟v/z的字符串
`[^abc]`	匹配除了a/b/c之外的任意字符
`[a-d]`	匹配a到d之间的字符
`[0-8]`	匹配0到8之间的数字
`XZ`	搜索 XZ
`x/z`	搜索 X 或者 Z

由于hexo排版的原因, 上表中/实际为 |

元字符

为了简化表达式规则，正则表达式提供了几种元字符(meta characters)：

表2 元字符

正则表达式	说明
`\d`	任意数字,`[0-9]`的简写
`\D`	非数字匹配，`[^0-9]`的简写
`\s`	空白字符，`[ \t\n\x0b\r\f]`的简写
`\S`	非空白字符，`[^\s]`的简写
`\w`	单词匹配符，`[a-zA-Z_0-9]`的简写
`\W`	非单词匹配符，`[^\w]`的简写
`\S+`	多个非空白字符
`\b`	单词`[a-zA-Z0-9_]`边界字符匹配
`\r`	回车符

数量匹配符

另外，正则表达式还提供了数量匹配符(Quantifier),用于标识一个元素出现的频次，主要有以下几种：

表3 数量匹配符

正则表达式	说明	示例
`*`	出现次数 >= 0,等同`{0,}`	`x` 查找零个或者多个字符x; `.`匹配任意字符串
`+`	出现次数 >= 1,等同 `{1,}`	`x+` 匹配出现次数大于1的字符x
`?`	出现次数不多于1次，等同`{0,1}`	`x?` 查找出现次数不大于1次的字符x
`{n}`	出现次数为 n	`\d{3}` 搜索长度为3的数字字符串
`{n1,n2}`	出现次数在 n1 与 n2 之间	`\d{1,4}` 数字字符长度在1 ~ 4之间的字符串
`*?`	`?`放在一个数量匹配符的后面时，定义为一个“懒惰数量匹配符” (reluctant/lazy quantifier),该匹配符找到最小的一个匹配，然后搜索到第一个匹配字符时，即不再搜索	`s.?o` 匹配 `stackoverflow`, 而`s.o`匹配的是`stackoverflow`

给定正则表达式的模式

可以在一个正则表达式的开始给定一个模式修改符(mode modifiers):

(?i) 使正则表达式不区分大小写
(?s) 单行模式,匹配包括换行符在内的所有字符(makes the dot match all characters, including line breaks)
(?m) 多行模式(makes the caret and dollar match at the start and end of each line in the subject string.)

若需要指定多种模式，则将其组合在一起即可: (?ism)

正则表达式的应用

在shell脚本中使用正则表达式

在shell脚本命令find, grep中, 经常需要使用正则表达式来查找字符:

比如在ifconfig中查找ip地址:

1
2
3


ifconfig | grep -E "[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}"

在tcpdump中过滤IP地址:

1
2
3


tcpdump | grep -E "[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}"

在某个文件中查找所有单词:

1 2	cat xxx.log \| grep -E "[a-zA-Z+]

在Java中使用正则表达式

Java中的 String 支持正则表达式来操作字符串，这给文本操作带来了很大的方便：

表4 String中的正则表达式方法

方法	说明
`str.matches("regex")`	判断字符串`str`是否与`regex`相匹配
`str.split("regex")`	通过`regex`分割字符串`str`
`str.replaceFirst("regex","replacement")`	用`replacement`替换字符串中第一次出现 `regex`的字符串
`str.replaceAll("regex","replacement")`	用于`replacement`替换所有匹配`regex`的字符串

参考示例:


  package de.vogella.regex.test;
  
  public class RegexTestStrings {
    	public static final String EXAMPLE_TEST = "This is my small example "
    	+ "string which I'm going to " + "use for pattern matching.";
  
    	public static void main(String[] args) {
  	System.out.println(EXAMPLE_TEST.matches("\\w.*"));
  	String[] splitString = (EXAMPLE_TEST.split("\\s+"));
System.out.println(splitString.length);// should be 14
   for (String string : splitString) {
     System.out.println(string);
   }
  	// replace all whitespace with tabs
  	System.out.println(EXAMPLE_TEST.replaceAll("\\s+", "\t"));
    }
  }

注意: 在Java中斜杠\是一个转义字符，因此为了得到单个的斜杠字符，需要用\\来实现

Pattern/Matcher

对于更高级的应用，Java提供了两个类Pattern(java.util.regex.Pattern) 和 Matcher (java.util.regex.Matcher):

首先，使用 Pattern 得到对应的正则表达式；
然后，利用 Matcher 来操作相应的字符串

参考示例：


   import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class RegularExpression {
	private String regx = null;
	private Pattern pattern = null;
	private Matcher matcher = null;
	
	public static void main(String[] args){		
		// regular expression test
		String input =  new String("here 2016, now we encounter very confusing things. On the one hand, we human beings feel "
				+ "very confident, but on the other hand, we are so fucking lost in our self-built world! we do waste our energy"
				+ "and time  on useless things. We are totally lost.");
		
		String regDigit = new String("\\d");
		String regChars = new String("hand|useless");
		String regWild = new String("[^af]");
		String regWord = new String("\\w");
		String regTimes = new String("[a-z]{5}");
		
		RegularExpression reg = new RegularExpression(regDigit);
		reg.getMatcherResult(input);
		
		reg.setRegx(regChars);
		reg.getMatcherResult(input);
		
		reg.setRegx(regWild);
		reg.getMatcherResult(input);
		
		reg.setRegx(regWord);
		reg.getMatcherResult(input);
		
		reg.setRegx(regTimes);
		reg.getMatcherResult(input);
	}
	
	public RegularExpression(){
		
	}
	
	public RegularExpression(String reg){
		this.regx = reg;
		this.pattern = Pattern.compile(regx);
	}
	
	
	public void getMatcherResult(String in){
		System.out.println("current regression expression is " + regx);
		this.matcher = pattern.matcher(in);
		
		while(matcher.find()){
			System.out.println(matcher.group());
		}
	}
	
	
	public void setRegx(String regx){
		this.regx = regx;
		this.pattern = Pattern.compile(regx);
	}
}