JavaScript Regular Expression - Sets and Ranges

JavaScript Regular Expression - Sets and Ranges

ยท

2 min read

Sets

We can define a sets of characters by square brackets. For example, to match JavaScript, javaScript, Javascript and javascript, we can write an regular expression as below.

> /[Jj]ava[Ss]cript/.test("JavaScript")
true
> /[Jj]ava[Ss]cript/.test("Javascript")
true

Inside this square bracket, some special characters in regular expressions we can use directly. For example, . is used to express any character. Normally if we want to match this character itself, we need to escape it. But in this range syntax, we can use it directly.

> /[.]/.test(".")
true
> /[.]/.test(";")
false

Ranges

Inside this square brackets, we can use character - to express a range. Below shows a few example.

> /[1-5]/.test("3")
true
> /[a-c]/.test("b")
true
> /[a-c1-3]/.test("b")
true

If we want to match - itself, we need to escape it. But if - is at the start or end of this square bracket, since it does not express the meaning of range, so no need to escape.

> /[a\-d]/.test("c")
false
> /[a\-d]/.test("-")
true
> /[\-123]/.test("-")
true
> /[-123]/.test("-")
true

Excluding ranges

We can use character ^ at the start of the square bracket to express excluding ranges. So for example, /[\d]/ means any digits, /[^\d]/ means any characters except digits.

> /[^\d]/.test("5")
false
> /[^\d]/.test("a")
true

If we want to match character ^ itself, we can excape it or just don't put it at the start.

> /[^\d]/.test("1")
false
> /[\^\d]/.test("1")
true
> /[\^\d]/.test("^")
true
> /[\d^]/.test("^")
true

Surrogate pairs

String in JavaScript are encoded in utf16. This encoding system cannot contain all characters. So for some special characters, this surrogate pairs strategy is used to combine two code points together to express one character. So to process these special character, we need to be careful.

For example, below regular express shows a match, but the matched result is strange.

> "๐Ÿ˜„".match(/[๐Ÿ˜„]/)
[ '\ud83d', index: 0, input: '๐Ÿ˜„', groups: undefined ]

Actually, this character is expressed by 2 code point. The matched result actually is the first part of that character.

> "๐Ÿ˜„".charCodeAt(0)
55357
> '\ud83d'.charCodeAt(0)
55357

So to process this kind of character properly, we need to use u flag.

> "๐Ÿ˜„".match(/[๐Ÿ˜„]/)
[ '\ud83d', index: 0, input: '๐Ÿ˜„', groups: undefined ]
> "๐Ÿ˜„".match(/[๐Ÿ˜„]/u)
[ '๐Ÿ˜„', index: 0, input: '๐Ÿ˜„', groups: undefined ]
ย