Sets
We can define a sets of characters by square brackets. For example, to match JavaScript, javaScript, Javascript and javascript, we can write an regular expression as below.
> /[Jj]ava[Ss]cript/.test("JavaScript")
true
> /[Jj]ava[Ss]cript/.test("Javascript")
true
Inside this square bracket, some special characters in regular expressions we can use directly. For example, .
is used to express any character. Normally if we want to match this character itself, we need to escape it. But in this range syntax, we can use it directly.
> /[.]/.test(".")
true
> /[.]/.test(";")
false
Ranges
Inside this square brackets, we can use character -
to express a range. Below shows a few example.
> /[1-5]/.test("3")
true
> /[a-c]/.test("b")
true
> /[a-c1-3]/.test("b")
true
If we want to match -
itself, we need to escape it. But if -
is at the start or end of this square bracket, since it does not express the meaning of range, so no need to escape.
> /[a\-d]/.test("c")
false
> /[a\-d]/.test("-")
true
> /[\-123]/.test("-")
true
> /[-123]/.test("-")
true
Excluding ranges
We can use character ^
at the start of the square bracket to express excluding ranges. So for example, /[\d]/
means any digits, /[^\d]/
means any characters except digits.
> /[^\d]/.test("5")
false
> /[^\d]/.test("a")
true
If we want to match character ^
itself, we can excape it or just don't put it at the start.
> /[^\d]/.test("1")
false
> /[\^\d]/.test("1")
true
> /[\^\d]/.test("^")
true
> /[\d^]/.test("^")
true
Surrogate pairs
String in JavaScript are encoded in utf16. This encoding system cannot contain all characters. So for some special characters, this surrogate pairs strategy is used to combine two code points together to express one character. So to process these special character, we need to be careful.
For example, below regular express shows a match, but the matched result is strange.
> "๐".match(/[๐]/)
[ '\ud83d', index: 0, input: '๐', groups: undefined ]
Actually, this character is expressed by 2 code point. The matched result actually is the first part of that character.
> "๐".charCodeAt(0)
55357
> '\ud83d'.charCodeAt(0)
55357
So to process this kind of character properly, we need to use u
flag.
> "๐".match(/[๐]/)
[ '\ud83d', index: 0, input: '๐', groups: undefined ]
> "๐".match(/[๐]/u)
[ '๐', index: 0, input: '๐', groups: undefined ]