Regex to match and limit character classes

Robby Pond

I'm not sure if this is possible using Regex but I'd like to be able to limit the number of underscores allowed based on a different character. This is to limit crazy wildcard queries to a search engine written in Java.

The starting characters would be alphanumeric. But I basically want a match if there are more underscores than preceding characters. So

BA_ would be fine but BA___ would match the regex and would get kicked out of the query parser.

Is that possible using Regex?

Casimir et Hippolyte

Yes you can do it. This pattern will succeed only if there are less underscores than letters (you can adapt it with the characters you want):

^(?:[A-Z](?=[A-Z]*(\\1?+_)))*+[A-Z]+\\1?$

(as Pshemo notices it, anchors are not needed if you use the matches() method, I wrote them to illustrate the fact that this pattern must be bounded whatever the means. With lookarounds for example.)

negated version:

^(?:[A-Z](?=[A-Z]*(\\1?+_)))*\\1?_*$

The idea is to repeat a capture group that contains a backreference to itself + an underscore. At each repetition, the capture group is growing. ^(?:[A-Z](?=[A-Z]*+(\\1?+_)))*+ will match all letters that have a correspondant underscore. You only need to add [A-Z]+ to be sure that there is more letters, and to finish your pattern with \\1? that contains all the underscores (I make it optional, in case there is no underscore at all).

Note that if you replace [A-Z]+ with [A-Z]{n} in the first pattern, you can set exactly the number of characters difference between letters and underscores.


To give a better idea, I will try to describe step by step how it works with the string ABC-- (since it's impossible to put underscores in bold, i use hyphens instead) :

 In the non-capturing group, the first letter is found 
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
 let's enter the lookahead (keep in mind that all in the lookahead is only
 a check and not a part of the match result.)
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
 the first capturing group is encounter for the first time and its content is not
 defined. This is the reason why an optional quantifier is used, to avoid to make
 the lookahead fail. Consequence: \1?+ doesn't match something new.
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
 the first hyphen is matched. Once the capture group closed, the first capture
    group is now defined and contains one hyphen. 
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
 The lookahead succeeds, let's repeat the non-capturing group.
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
 The second letter is found
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
 We enter the lookahead
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
 but now, things are different. The capture group was defined before and
 contains an hyphen, this is why \1?+ will match the first hyphen.
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
 the literal hyphen matches the second hyphen in the string. And now the
 capture group 1 contains the two hypens. The lookahead succeeds.
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
 We repeat one more time the non capturing group.
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
 In the lookahead. There is no more letters, it's not a problem, since
 the * quantifier is used.
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
 \\1?+ matches now two hyphens.
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
 but there is no more hyphen in the string for the literal hypen and the regex
 engine can not use the bactracking since \1?+ has a possessive quantifier.
 The lookahead fails. Thus the third repetition of the non-capturing group too!
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
 ensure that there is at least one more letter.
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
 We match the end of the string with the backreference to capture group 1 that
 contains the two hyphens. Note that the fact that this backreference is optional
 allows the string to not have hyphens at all. 
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
 This is the end of the string. The pattern succeeds.
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$


Note: The use of the possessive quantifier for the non-capturing group is needed to avoid false results. (Where you can observe a strange behavior, that can be useful.)

Example:ABC--- and the pattern: ^(?:[A-Z](?=[A-Z]*(\1?+-)))*[A-Z]+\1?$ (without the possessive quantifier)

 The non-capturing group is repeated three times and `ABC` are matched:
ABC---     ^(?:[A-Z](?=[A-Z]*(\1?+-)))*[A-Z]+\1?$
 Note that at this step the first capturing group contains ---
 But after the non capturing group, there is no more letter to match for [A-Z]+
 and the regex engine must backtrack.
ABC---     ^(?:[A-Z](?=[A-Z]*(\1?+-)))*[A-Z]+\1?$

Question: How many hyphens are in the capture group now?
Answer:   Always three!

If the repeated non-capturing group gives a letter back, the capture group contains always three hyphens (as the last time the capture group has been read by the regex engine).This is counter-intuitive, but logical.

 Then the letter C is found:
ABC---     ^(?:[A-Z](?=[A-Z]*(\1?+-)))*[A-Z]+\1?$
 And the three hyphens
ABC---     ^(?:[A-Z](?=[A-Z]*(\1?+-)))*[A-Z]+\1?$
 The pattern succeeds
ABC---     ^(?:[A-Z](?=[A-Z]*(\1?+-)))*[A-Z]+\1?$

Robby Pond asked me in comments how to find strings that have more underscores than letters (all that is not an underscore). The best way is obviously to count the numbers of underscores and to compare with the string length. But about a full regex solution, it is not possible to build a pattern for that with Java since the pattern needs to use the recursion feature. For example you can do it with PHP:

$pattern = <<<'EOD'
~
 (?(DEFINE)
     (?<neutral> (?: _ \g<neutral>?+ [A-Z] | [A-Z] \g<neutral>?+ _ )+ )
 )

 \A (?: \g<neutral> | _ )+ \z
~x
EOD;

var_dump(preg_match($pattern, '____ABC_DEF___'));

本文收集自互联网,转载请注明来源。

如有侵权,请联系[email protected] 删除。

编辑于
0

我来说两句

0条评论
登录后参与评论

相关文章

来自分类Dev

Regex - Match all numbers until character

来自分类Dev

Regex match but exclude character from capture group

来自分类Dev

Elegant way to use regex to match order-indifferent groups of characters (in a substring) while limiting how may times a given character can appear?

来自分类Dev

Wrong RegEx match in Dart

来自分类Dev

Jquery Value match Regex

来自分类Dev

Regex multiple match substring

来自分类Dev

Search/Match regex in Python

来自分类Dev

为什么[regex] match()和-match不同?

来自分类Dev

为什么[regex] match()和-match不同?

来自分类Dev

使用MATCH AGAINST进行MySQL LIMIT

来自分类Dev

Shortest match in regex from end

来自分类Dev

std :: regex_match与字符éèà

来自分类Dev

C ++ regex_match行为

来自分类Dev

Codeigniter regex_match与比较

来自分类Dev

Lua: How to start match after a character

来自分类Dev

AngularJS prevent input on textarea when character limit is reached

来自分类Dev

查找Regex.Match失败的地方

来自分类Dev

C ++无法使regex_match正常工作

来自分类Dev

PHP: Trying to understand subpattern regex match

来自分类Dev

How to dynamically create regex to use in .match Javascript?

来自分类Dev

Javascript regex match()返回匹配的部分(子集)

来自分类Dev

Using regex in Scala to group and pattern match

来自分类Dev

Regex to match the first half of a UK postcode

来自分类Dev

How to get regex match on my scenario

来自分类Dev

C ++ boost :: regex_match奇怪的行为

来自分类Dev

Laravel Regex Match Url with JPG not working

来自分类Dev

Regex: How to match sequence of SAME characters?

来自分类Dev

需要preg_match / regex格式

来自分类Dev

Java Regex: How to Match URL Path?