Skip to content

Commit b1dd281

Browse files
committedAug 10, 2020
update docs: about named groups, lookahead, lookbehind, more edits
1 parent 8308cf2 commit b1dd281

File tree

1 file changed

+45
-51
lines changed

1 file changed

+45
-51
lines changed
 

‎docs/regular_expressions.rst

+45-51
Original file line numberDiff line numberDiff line change
@@ -265,8 +265,8 @@ order).
265265
Quantification
266266
--------------
267267

268-
Quantifier
269-
~~~~~~~~~~
268+
Quantifiers
269+
~~~~~~~~~~~
270270

271271
Any item of a regular expression may be followed by quantifier.
272272
Quantifier specifies number of repetition of the item.
@@ -341,7 +341,7 @@ RegEx Matches
341341
The choice
342342
----------
343343

344-
Expressions in the choice are separated by ``|``.
344+
Expressions in the choice are separated by vertical bar ``|``.
345345

346346
So ``fee|fie|foe`` will match any of ``fee``, ``fie``,
347347
or ``foe`` in the target string (as would ``f(e|i|o)e``).
@@ -373,11 +373,10 @@ RegEx Matches
373373

374374
.. _subexpression:
375375

376-
Subexpressions
377-
--------------
376+
Groups
377+
------
378378

379-
The brackets ``( ... )`` may also be used to define regular expression
380-
subexpressions.
379+
The brackets ``( ... )`` are used to define regular expression groups (ie subexpressions).
381380

382381
.. note::
383382
`TRegExpr <tregexpr.html>`__
@@ -410,8 +409,8 @@ Whole regular expression has number ``0``.
410409
Backreferences
411410
--------------
412411

413-
Metacharacters ``\1`` through ``\9`` are interpreted as backreferences.
414-
``\n`` matches previously matched subexpression ``n``.
412+
Metacharacters ``\1`` through ``\9`` are interpreted as backreferences to groups.
413+
They match the previously found group with the specified index.
415414

416415
=========== ============================
417416
RegEx Matches
@@ -420,15 +419,28 @@ RegEx Matches
420419
``(.+)\1+`` also ``abab`` and ``123123``
421420
=========== ============================
422421

423-
 ``(['"]?)(\d+)\1`` matchs ``"13"`` (in double quotes), or ``'4'`` (in
424-
single quotes) or ``77`` (without quotes) etc
422+
RegEx ``(['"]?)(\d+)\1`` matches ``"13"`` (in double quotes), or ``'4'`` (in
423+
single quotes) or ``77`` (without quotes) etc.
424+
425+
Named Groups and Backreferences
426+
-------------------------------
427+
428+
To make some group (ie subexpression) named, use this syntax: ``(?P<name>)``. Name of group must be valid identifier: first char is letter or "_", other chars are alphanumeric or "_". All named groups are also usual groups and share the same numbers 1 to 9.
429+
430+
Backreferences to named groups are ``(?P=name)``, the numbers ``\1`` to ``\9`` can also be used.
431+
432+
========================== ============================
433+
RegEx Matches
434+
========================== ============================
435+
``(?P<qq>['"])\w+(?P=qq)`` ``"word"`` and ``'word'``
436+
========================== ============================
425437

426438
Modifiers
427439
---------
428440

429441
Modifiers are for changing behaviour of regular expressions.
430442

431-
You can set modifiers globally in your system or change inside the the
443+
You can set modifiers globally in your system or change inside the
432444
regular expression using the `(?imsxr-imsxr) <#inlinemodifiers>`_.
433445

434446
.. note::
@@ -547,45 +559,26 @@ RegEx Matches
547559

548560
The modifier is set `On` by default.
549561

550-
Extensions
562+
Assertions
551563
----------
552564

553-
.. _lookahead:
565+
.. _assertions:
554566

555-
(?=<lookahead>)
556-
~~~~~~~~~~~~~~~
567+
Currently engine supports only these kinds of assertions:
557568

558-
``Look ahead`` assertion. It checks input for the regular expression
559-
``<look-ahead>``, but do not capture it.
569+
Positive lookahead assertion: ``foo(?=bar)`` matches "foo" only before "bar", and "bar" is excluded from the match.
560570

561-
.. note::
562-
`TRegExpr <tregexpr.html>`__
571+
Positive lookbehind assertion: ``(?<=foo)bar`` matches "bar" only after "foo", and "foo" is excluded from the match.
563572

564-
Look-ahead is not implemented in TRegExpr.
573+
Assertions are allowed only at the very beginning and ending of expression. They can contain subexpressions of any complexity (quantifiers are allowed, even groups are allowed). Lookahead and lookbehind can be present both.
565574

566-
In many cases you can replace ``look ahead`` with
567-
`Sub-expression <#subexpression>`_ and just ignore what will be
568-
captured in this subexpression.
575+
Non-capturing Groups
576+
--------------------
569577

570-
For example ``(blah)(?=foobar)(blah)`` is the same as ``(blah)(foobar)(blah)``.
571-
But in the latter version you have to exclude the middle sub-expression
572-
manually - use ``Match[1] + Match[3]`` and ignore ``Match[2]``.
578+
Syntax is like this: ``(?:subexpression)``.
573579

574-
This is just not so convenient as in the former version where you can use
575-
whole ``Match[0]`` because captured by ``look ahead`` part would not be
576-
included in the regular expression match.
577-
578-
.. _inlinemodifiers:
579-
580-
581-
(?:<non-capturing group>)
582-
~~~~~~~~~~~~~~~~~~~~~~~~~
583-
584-
``?:`` is used when you want to group an expression, but you do not want to
585-
save it as a matched/captured portion of the string.
586-
587-
So this is just a way to organize your regex into subexpressions without
588-
overhead of capturing result:
580+
Such groups do not have the "index" and are invisible for backreferences.
581+
Non-capturing groups are used when you want to group a subexpression, but you do not want to save it as a matched/captured portion of the string. So this is just a way to organize your regex into subexpressions without overhead of capturing result:
589582

590583
================================ =======================================
591584
RegEx Matches
@@ -596,11 +589,14 @@ RegEx Matches
596589
only ``sorokin.engineer``
597590
================================ =======================================
598591

599-
(?imsgxr-imsgxr)
600-
~~~~~~~~~~~~~~~~
592+
Inline Modifiers
593+
----------------
601594

602-
You may use it inside regular expression for modifying modifiers by the fly.
595+
.. _inlinemodifiers:
596+
597+
Syntax is like this: ``(?i)``, ``(?-i)``, ``(?msgxr-imsgxr)``.
603598

599+
You may use it inside regular expression for modifying modifiers on-the-fly.
604600
This can be especially handy because it has local scope in a regular
605601
expression. It affects only that part of regular expression that follows
606602
``(?imsgxr-imsgxr)`` operator.
@@ -620,13 +616,12 @@ RegEx Matches
620616
``((?i)Saint-)?Petersburg``   ``saint-Petersburg``, but not ``saint-petersburg``
621617
============================= ==================================================
622618

623-
(?#text)
624-
~~~~~~~~
619+
Comments
620+
--------
625621

626-
A comment, the text is ignored.
622+
Syntax is like this: ``(?#text)``. Text inside brackets is ignored.
627623

628-
Note that the comment is closed by
629-
the nearest ``)``, so there is no way to put a literal ``)`` in
624+
Note that the comment is closed by the nearest ``)``, so there is no way to put a literal ``)`` in
630625
the comment.
631626

632627
Afterword
@@ -635,4 +630,3 @@ Afterword
635630
In this `ancient blog post from previous
636631
century <https://sorokin.engineer/posts/en/text_processing_from_birds_eye_view.html>`__
637632
I illustrate some usages of regular expressions.
638-

0 commit comments

Comments
 (0)
Please sign in to comment.