Regular expression

Regular expression aka regex or regexp provides a concise and flexible means for matching string of text, such as particular characters, words or patterns of characters. A regular expression is written in a formal language that can be interpreted by regular expression processor.

Really cleaver "wild card" expression for matching and parsing strings
very powerful and quite cryptic
fun once you understand them
Regulars expression are a language unto themselves
A language of "maker characters" - programming with characters
It is kind of an "old school" language compact

Regular expression Quick guide

^ Matches the beginning of a line

$ Matches the end of the line

. Matches any character

\s Matches whitespace

\S Matches any non whitespace character

* Repeats a character zero or more times

*? Repeats a character zero or more times (non greedy)

+ Repeats a character one or more times

+? Repeats a character one or more times(non greedy)

[aeiou] Matches a single character in the listed set

[^XYZ] Matches a single character not in the listed set

[a-z0-9] set of character can include a range

( Indicates where string extraction is to start

) Indicates where string extraction is to end

The regular expression moduled

before you can use regular expression in you program you must import library using "import re"
you can use re.search() to see if a string matches a regular expression, similar to using find method for strings
you can use re.findall() to extract portions of a string that match your regular expression, similar to a combinations of find() and slicing:var[5:10]

using re.search() like find

hand=open('file.txt')

for line in hand:

line = line.rstrip()

if line.find('From:')>=0:

print(line)

is same like

import re

hand = open('file.txt')

for line in hand:

line=line.rstrip()

if re.search('Form:',line(:

print(line)

using re.search() like startswith()

hand=open('file.txt')

for line in hand:

line=line.rstrip()

if line.startswith('From:'):

print(line)

is same like

import re

hand=open('file.txt')

for line in hand:

line = line.rstrip()

if re.search('^From:',line):

print(line)

we fine-tune what is matched by adding special characters to the string

Wild card characters

The dot character matches any character
if you add the asterisk character, the character is " any number of times"

^X>*:

line must start with Capital X and it may have any number of characters for by colon(:)

Matching and Extracting data

re.search() return A True/False depending on weather the string matches the regular expression
if we actually want the matching string to be extracted , we use re.findall()

[0-9] + one or more digit

import re

x= "my 2 facorite number are 19 and 42"

y= re.findall('[0-9]+',x)

print(y)

['2','19','42']

the repeat characters (* and +) push outward in both direction (greedy) to match the largest possible string

$F.+:

first character in the match is F following one or more characters and last character in the match is a:

Non Greedy Matching

Network Technology

Builds on the top of IP(Internet Protocol)
Assumes IP might lose some data stores and retransmits data if it seems to be lost
handles "flow control" using a transmit window
Provides a nice reliable pipe

Ports are similar to telephone number extensions

HyperText Transfer Protocol

Since TCP and Python gives us a reliable socket what do we want to do with the socket? What problem do we want to solve?
Application Protocols

Main
World wide web

The Hyper Text transfer Protocol is the set of rules to allow browsers to retrieve web documents from servers over the internet

Using Developer Console in browser

ASCII American standard code for Information interchange

Unicode means universal code

UTF 8 is similar to ASCII

IN above program encode() converts into bytes

and decode() converts into normal form or data

Retrieving web page

import urllib.request,urllib.parse, urllib.error

fhand=urllib.request.urlopen("www.somesite.come")

counts=dict()

for line in fhand:

words=line.decode().split()

for word in words:

counts[word]=counts.get(word,0)+1

print(counts)

What is web scraping ?

When a program or script pretend to be a browser and retrieves web pages, looks at those web pages , extracts information and then looks at more web pages
Search engines scraps web pages -we call this "spidering the web" or "web crawling"

Why scrap?

Pull data particularly social data - who links to who?
Get you own data back out of some system that has no "export capability"
Monitor a site for new information
Spider the web to make a database for a search engine

The TCP/IP gives us pipes/sockets between application
we designed application protocols to make use of these pipes
Hypertext transfer Protocol (HTTP) is a simple yet powerful protocol
Python has good support for sockets, HTTP and HTML parsing

If you want use our samples "as is", download our Python 3 version of BeautifulSoup4 from

http://www.py4e.com/code3/bs4.zip

Data on the web

with the HTTP request/response well understood and well supported, there was a natural move toward exchanging data between programs using these protocols
We needed to come up with an agreed way to represent data going between application and across network
there are two commonly used formats: XML and JSON

XML (eXtensible Markup language)

XML become poluer when HTML become popular

Primary purpose is to help information system share structure data
it started as a simplified subset of the standard Generalized Markup language(SGML), and is designed to be relatively human-legible