Regular expressions in php ~ DPLW

Regular expressions are a string of characters that form a pattern , usually representative of another group of characters more, so we can compare the pattern with another set of characters to see the matches.

Regular expressions are available in almost any programming language, but although its syntax is relatively uniform, each language uses its own dialect.

If this is the first time you approach the concept of regular expressions (regex for short) will encourage you probably already know that the you used, even without knowing it, at least in its most basic. For example, when run in a DOS window *.* dir for a listing of all files in a directory, we are using the concept of regular expressions, where the pattern * matches any string of characters.

A simplified example:

am / / this is our pattern.

When compared with:

am / / matches

panorama / / matches

ambition / / matches

camp / / matches

hand / / does not match

It is simply to go comparing a pattern (pattern) in this example is the sequence of letters "AM" with a string (subject) and see if it exists within the same sequence. If it exists, we say that we found a match (match, in English).

Another example:

the / / this is the pattern

aleve the slight wing fan / / if we compare it with this string, it matches

So far the examples have been simple, since the patterns used were literal, that is only found when there is an occurrence matches exactly.

If we know in advance the exact string searching, it is not necessary to break a complicated pattern, we can use the exact string pattern seeking, and that and nothing else will be the match. Thus, if a list of names in the user data we can use pepe pepe pattern. But if in addition we want to find pepe pepa occurrences and bobby, literals are not enough.

The power of regular expressions is precisely the flexibility of employers, who may be confronted with any word or text string that has a known structure.

In fact it is usually not necessary to use regular expression functions if we are to use literal patterns. There are other functions (string functions) working more effectively and quickly with literals.

Characters and Metacharacters

Our pattern can be formed by a set of characters (a group of letters, numbers or symbols) or meta characters that represent other characters, or allow a search context.

The meta-characters are so named because they are representing themselves, but are interpreted in a special way.

Here is the list of most used meta characters:

. *? + [] () {} ^ $ | \

We will see its use, grouped according to their purpose.

Metacharacters positioning, or anchors

^ And $ signs are used to indicate where our pattern must be located within the string for there to be a coincidence.

When we use the ^ sign to say that the pattern must appear at the beginning of the string to match. When we use the $ sign indicating that the pattern must appear at the end of the set of characters. Or more accurately, before a newline. Thus:

^ Am / / our employer

am / / matches

bed / / does not match

Ambidextrous / / matches

Pam / / does not match

wow / / does not match

m $

am / / matches

salam / / matches

amber / / does not match

Pam / / matches

^ M $

am / / matches

salam / / does not match

amber / / does not match

Escaping characters

It may happen that we need to include in our pattern as a sign metacharacter some literal, ie, for himself and not for what it represents. To indicate this purpose we use an escape character: the backslash \.
Thus, a pattern set to 12 \ $ does not match a string ending in 12, and another with $ 12.

As a rule, the backslash \ becomes normal special characters and normal characters makes them special.

The point. as a metacharacter

If a metacharacter is a character that can represent others, then the point is the metacharacter for excellence. A dot in the pattern represents any character except newline.

And as we have seen, if we want to find in the chain is just one point, we escape it: \.

pattern:. 'l'

aleve the slight wing fan

Notice in the above example as the pattern is any character (including whitespace) followed by a l.

Quantifiers Metacharacters

The metacharacters we've seen now tell us whether our pattern matches the string to compare.But what if we compare with our chain a pattern that can be one or more times, or may not be?For this we use a special type of meta characters: the multipliers.

These metacharacters that apply to the character or group of characters that precede them, indicate that number must be present in the chain so that there is an occurrence.

So called quantifiers or multiplier. The most used are *? +

* / / Matches if the character (or group of characters) that
/ / Above is present 0 or more times
/ / Ab * matches "a", "ab", "abbb" and so on.
/ / Example:
cannot * a / / matches sings, channels, cantttta

? / / Matches if the character (or group of characters) that precedes
/ / Is present 0 or 1
/ / Ab? matches "a", "ab" does not match "abb"
/ / Example:
cannot? a / / matches sings and Canadian
d? the / / and coincides with the
(Ala)? Dinner / / matches dinner and pantry

+ / / Matches if the character (or group) that precedes it is
/ / Present at least 1 or more times.
/ / Ab + matches "ab", "abbb" and so on. No matches "a"
/ / Example:
cannot + a / / matches sings canttttta does not coincide with channels

Range metacharacters

The brackets [] included in an employer allow you to specify the range of valid characters to compare. Just there any of them to the condition that:

[Abc] / / The pattern matches the string if this is
/ / Any of these three characters: a, b, c
[Ac] / / matches if there is a letter in the range ("a", "b" or "c")
c [ao] sa / / matches home and everything
[^ Abc] / / The pattern matches the string if there is NO
/ / None of these three characters: a, b, c
/ / Note that the sign ^ here has a value excluding
c [^ ao] sa / / matches ceases, Cusa, CISA (etc) not match
/ / With a house or anything
[0-9] / / Matches a string containing any
/ / Number between 0 and 9
[^ 0-9] / / matches a string that contains no
/ / Number
[AZ] / / Matches any alphabetic character,
/ / In case. Does not include numbers.
[Az] / / As above, in lowercase
[AZ] / / Any alphabetic character, case sensitive

One thing to remember is that the rules of regular expression syntax do not apply equally within the brackets. For example, the metacharacter ^ anchor does not work here, but of character denier. Nor is it necessary to escape all metacharacters with a backslash. You only need to escape the following metacharacters:] \ ^ -

The rest of metacharacters can be included as they are considered, within the brackets-standard characters.

pattern: [aeiou]

aleve the slight wing fan

pattern: [^ aeiou]

aleve the slight wing fan

pattern: [ad]

aleve the slight wing fan

As these patterns are used over and over again, no shortcuts:

/ / Shortcut equivalent to meaning

\ D [0-9] / / numbers 0 to 9

\ D [^ 0-9] / / instead of \ d

\ W [0-9A-Za-z] / / any number or letter

\ W [^ 0-9A-Za-z] / / opposite of \ w, a character not

/ / Either letter or number

\ S [\ t \ n \ r] / / space, including space,

/ / Tab, new line or return

\ S [^ \ t \ n \ r] / / opposite of \ s, any character

/ / That is not blank

/ / Only POSIX regex

[[: Alpha:]] / / any alphabetic character aA - zZ.

[[: Digit:]] / / Any number (integer) 0-9

[[: Alnum:]] / / Any alphanumeric character 0 9 aA zZ

[[: Space:]] / / space

Alternation metacharacters and aggregators

(Xyz) / / matches the exact sequence xyz

x | y / / matches if present x or y

(Don | Dona) / / matches if it precedes "Don" or "Doña"

Parentheses serve not only to group sequences of characters, but also for capturing subpatterns which can then be returned to the script (backreference). We'll talk more about it by trying to POSIX and PCRE functions in the following pages.

A typical example would be a regular expression pattern captures urls which valid and generate links them to the flight:

Code:

 <? 
 $ Text = "one of the best sites is http://www.cristalab.com";
 $ Text = ereg_replace ("http: \ / \ / (.* \. (Com | net | ))"," org \ 1", $ text);
 print $ text;
 ?>

The above example would produce a usable link where the URL would take the back-reference \ 0 and the visible part of the back-reference \ 1 one of the best sites is www.cristalab.com

Note that in the above example we used two sets of parentheses (nested), so that there would be two captures: The retro-reference \ 0 coincides with the agreement sought. To capture it is not necessary to use parentheses.

The retro-reference \ 1 matches in this case "www.cristalab.com" and is captured by the parentheses (.* \. (Com | net | org))

The retro-reference \ 2 matches the "net" and corresponds to the nested parentheses (com | net | org)

Note that this feature to capture occurrences and have them available for retroreferencias consumes system resources. If you use parentheses in your regular expressions, but you know in advance that you will not reuse the occurrences, and you can dispense with the capture, placed after the first parenthesis?:

Code:

 <?
 text = ereg_replace ("http: \ / \ / (.* \. (?: com | net | org ))","< a href = \" \ 0 \ "> \ 1 </ a>", $ text );
 ?>

In writing (?: Com | net | org) the parenthesized subpattern is clustered, but the coincidence is no longer captured.

DPLW

Nov 14, 2011