Preview only show first 10 pages with watermark. For full document please download

The Web Graph And Pagerank

   EMBED


Share

Transcript

The Web Graph and PageRank Networked Life Prof. Michael Kearns Lecture Roadmap • • • • Viewing the Web as a directed graph or network Structure of the Web Measures of vertex importance in a network Google’s PageRank algorithm Directed Networks • The web graph is a directed network: – – – – page A can point/link to page B, without a reciprocal link represent directed edges with arrows each web page/site thus has “in-links” and “out-links” in-degree = # of in-links; out-degree = # of out-links B D A C E F The Web as Network • What is the structure of this network? – connected components – degree distributions – etc. • What does it say about the people building and using it? – page and link generation – visitation statistics • What are the algorithmic consequences? – web search – community identification Graph Structure in the Web [Broder et al. paper] • Report on the results of two massive “web crawls” • Executed by AltaVista in May and October 1999 • Details of the crawls: – – – – automated script following hyperlinks (URLs) from pages found large set of starting points collected over time crawl implemented as breadth-first search have to deal with webspam, infinite paths, timeouts, duplicates, etc. • May ’99 crawl: – 200 million pages, 1.5 billion links • Oct ’99 crawl: – 271 million pages, 2.1 billion links • Unaudited, self-reported Sep ’03 stats: – 3 major search engines claim > 3 billion pages indexed Five Easy Pieces • Authors did two kinds of breadth-first search: – ignoring link direction  weak connectivity – only following forward links  strong connectivity • They then identify five different regions of the web: – strongly connected component (SCC): • can reach any page in SCC from any other in directed fashion – component IN: • can reach any page in SCC in directed fashion, but not reverse – component OUT: • can be reached from any page in SCC, but not reverse – component TENDRILS: • weakly connected to all of the above, but cannot reach SCC or be reached from SCC in directed fashion (e.g. pointed to by IN) – SCC+IN+OUT+TENDRILS form weakly connected component (WCC) – everything else is called DISC (disconnected from the above) – here is a visualization of this structure Size of the Five • • • • • • • SCC: ~56M pages, ~28% IN: ~43M pages, ~ 21% OUT: ~43M pages, ~21% TENDRILS: ~44M pages, ~22% DISC: ~17M pages, ~8% WCC > 91% of the web --- the giant component One interpretation of the pieces: – SCC: the heart of the web – IN: newer sites not yet discovered and linked to – OUT: “insular” pages like corporate web sites Diameter Measurements • Directed worst-case diameter of the SCC: – at least 28 • Directed worst-case diameter of IN  SCC  OUT: – at least 503 • Over 75% of the time, there is no directed path between a random start and finish page in the WCC – when there is a directed path, average length is 16 • Average undirected distance in the WCC is 7 • Moral: – web is a “small world” when we ignore direction – otherwise the picture is more complex Degree Distributions • They are, of course, heavy-tailed • Power law distribution of component size – consistent with the Erdos-Renyi model • Undirected connectivity of web not reliant on “connectors” – what happens as we remove high-degree vertices? Important Vertices • So far: have emphasized macroscopic aspects of network structure – – – – degree distribution: all degrees diameter: average over all vertex pairs connectivity: giant component with most vertices clustering: average over all vertices, compare to overall edge density – – – – high degree link between communities betweeness centrality Google’s PageRank algorithm • Also interesting to identify “important” individuals within the network • Some purely structural definitions of importance: Background on Web Search • Most important sources of information: words in query and on web pages • For instance, on query “mountain biking”: – realize “bike” and “bicycle” are synonymous under context “mountain” – but “bike” and “motorcylce” are not – find documents with these words and their correlates (“Trek”, “trails”) • Subject of the field of information retreival • But many other “features” or “signals” may be useful for identifying good or useful sites: – font sizes – frequency of exclamation marks • PageRank idea: use the link structure of the web Directed Networks • Q: What constitutes an “important” vertex in a directed network? – could just use in-degree; view directed links as referrals • PageRank answer: an important page is pointed to by lots of other important pages • This, of course, is circular… B D A C E F The PageRank Algorithm • • • • • Suppose p and q are pages where q  p Let R(q) be rank of q; let out(q) be out-degree of q Idea: q “distributes” its rank over its out-links Each out-link of q receives R(q)/out(q) So then p’s rank should be: R( p)   R(q) /out(q) qPOINTS( p ) B  D A C E F The PageRank Algorithm • What guarantee do we have that all these equations are consistent? • Idea: turn the equations into an update rule (algorithm) • At each time step, pick some vertex p to update, and perform: R( p)  R(q) /out(q) qPOINTS( p ) • Right-hand side R(q) are current/frozen values; R(p) is new rank of p • Initialize all R(p) to 1 • Claim: Under broad conditions, this algorithm will converge: – at some point, updates no longer change any values – have found solution to all the R(p) equations  0.31 0.15 B C A 0.38 E D 0.38 F 0.64 0.54 0.15 G 0.21 0.40 B C A 0.44 E D 0.44 F 0.70 0.67 Summary • PageRank defines importance in a circular or self-referential fashion • Circularity is broken with a simple algorithm that provably converges • PageRank defines R(p) globally; more subtle than local properties of p