Transcript
The Web Graph and PageRank
Networked Life Prof. Michael Kearns
Lecture Roadmap • • • •
Viewing the Web as a directed graph or network Structure of the Web Measures of vertex importance in a network Google’s PageRank algorithm
Directed Networks • The web graph is a directed network: – – – –
page A can point/link to page B, without a reciprocal link represent directed edges with arrows each web page/site thus has “in-links” and “out-links” in-degree = # of in-links; out-degree = # of out-links
B D A
C
E
F
The Web as Network • What is the structure of this network? – connected components – degree distributions – etc.
• What does it say about the people building and using it? – page and link generation – visitation statistics
• What are the algorithmic consequences? – web search – community identification
Graph Structure in the Web [Broder et al. paper]
• Report on the results of two massive “web crawls” • Executed by AltaVista in May and October 1999 • Details of the crawls: – – – –
automated script following hyperlinks (URLs) from pages found large set of starting points collected over time crawl implemented as breadth-first search have to deal with webspam, infinite paths, timeouts, duplicates, etc.
• May ’99 crawl: – 200 million pages, 1.5 billion links
• Oct ’99 crawl: – 271 million pages, 2.1 billion links
• Unaudited, self-reported Sep ’03 stats: – 3 major search engines claim > 3 billion pages indexed
Five Easy Pieces • Authors did two kinds of breadth-first search: – ignoring link direction weak connectivity – only following forward links strong connectivity
• They then identify five different regions of the web: – strongly connected component (SCC):
• can reach any page in SCC from any other in directed fashion
– component IN:
• can reach any page in SCC in directed fashion, but not reverse
– component OUT:
• can be reached from any page in SCC, but not reverse
– component TENDRILS:
• weakly connected to all of the above, but cannot reach SCC or be reached from SCC in directed fashion (e.g. pointed to by IN)
– SCC+IN+OUT+TENDRILS form weakly connected component (WCC) – everything else is called DISC (disconnected from the above) – here is a visualization of this structure
Size of the Five • • • • • • •
SCC: ~56M pages, ~28% IN: ~43M pages, ~ 21% OUT: ~43M pages, ~21% TENDRILS: ~44M pages, ~22% DISC: ~17M pages, ~8% WCC > 91% of the web --- the giant component One interpretation of the pieces: – SCC: the heart of the web – IN: newer sites not yet discovered and linked to – OUT: “insular” pages like corporate web sites
Diameter Measurements • Directed worst-case diameter of the SCC: – at least 28
• Directed worst-case diameter of IN SCC OUT: – at least 503
• Over 75% of the time, there is no directed path between a random start and finish page in the WCC – when there is a directed path, average length is 16
• Average undirected distance in the WCC is 7 • Moral: – web is a “small world” when we ignore direction – otherwise the picture is more complex
Degree Distributions • They are, of course, heavy-tailed • Power law distribution of component size – consistent with the Erdos-Renyi model
• Undirected connectivity of web not reliant on “connectors” – what happens as we remove high-degree vertices?
Important Vertices • So far: have emphasized macroscopic aspects of network structure – – – –
degree distribution: all degrees diameter: average over all vertex pairs connectivity: giant component with most vertices clustering: average over all vertices, compare to overall edge density
– – – –
high degree link between communities betweeness centrality Google’s PageRank algorithm
• Also interesting to identify “important” individuals within the network • Some purely structural definitions of importance:
Background on Web Search • Most important sources of information: words in query and on web pages • For instance, on query “mountain biking”: – realize “bike” and “bicycle” are synonymous under context “mountain” – but “bike” and “motorcylce” are not – find documents with these words and their correlates (“Trek”, “trails”)
• Subject of the field of information retreival • But many other “features” or “signals” may be useful for identifying good or useful sites: – font sizes – frequency of exclamation marks
• PageRank idea: use the link structure of the web
Directed Networks • Q: What constitutes an “important” vertex in a directed network? – could just use in-degree; view directed links as referrals
• PageRank answer: an important page is pointed to by lots of other important pages • This, of course, is circular…
B D A
C
E
F
The PageRank Algorithm • • • • •
Suppose p and q are pages where q p Let R(q) be rank of q; let out(q) be out-degree of q Idea: q “distributes” its rank over its out-links Each out-link of q receives R(q)/out(q) So then p’s rank should be:
R( p)
R(q) /out(q)
qPOINTS( p )
B
D A
C
E
F
The PageRank Algorithm • What guarantee do we have that all these equations are consistent? • Idea: turn the equations into an update rule (algorithm) • At each time step, pick some vertex p to update, and perform:
R( p)
R(q) /out(q)
qPOINTS( p )
• Right-hand side R(q) are current/frozen values; R(p) is new rank of p • Initialize all R(p) to 1 • Claim: Under broad conditions, this algorithm will converge: – at some point, updates no longer change any values – have found solution to all the R(p) equations
0.31
0.15
B
C
A
0.38
E
D
0.38
F
0.64
0.54
0.15
G
0.21
0.40
B
C
A
0.44
E
D
0.44
F
0.70
0.67
Summary • PageRank defines importance in a circular or self-referential fashion • Circularity is broken with a simple algorithm that provably converges • PageRank defines R(p) globally; more subtle than local properties of p