Archive for the ‘PageRank’ Category

Congressional PageRank… [How To Avoid Bribery Charges]

Saturday, October 17th, 2015

Congressional PageRank – Analyzing US Congress With Neo4j and Apache Spark by William Lyon.

From the post:

As we saw previously, legis-graph is an open source software project that imports US Congressional data from Govtrack into the Neo4j graph database. This post shows how we can apply graph analytics to US Congressional data to find influential legislators in Congress. Using the Mazerunner open source graph analytics project we are able to use Apache Spark GraphX alongside Neo4j to run the PageRank algorithm on a collaboration graph of US Congress.

While Neo4j is a powerful graph database that allows for efficient OLTP queries and graph traversals using the Cypher query language, it is not optimized for global graph algorithms, such as PageRank. Apache Spark is a distributed in-memory large-scale data processing engine with a graph processing framework called GraphX. GraphX with Apache Spark is very efficient at performing global graph operations, like the PageRank algorithm. By using Spark alongside Neo4j we can enhance our analysis of US Congress using legis-graph.

Excellent walk-through to get you started on analyzing influence in congress, with modern data analysis tools. Getting a good grip on all these tools with be valuable.

Political scientists, among others, have studied the question of influence in Congress for decades so if you don’t want to repeat the results of others, being by consulting the American Political Science Review for prior work in this area.

An article that reports counter-intuitive results is: The Influence of Campaign Contributions on the Legislative Process by Lynda W. Powell.

From the introduction:

Do campaign donors gain disproportionate influence in the legislative process? Perhaps surprisingly, political scientists have struggled to answer this question. Much of the research has not identified an effect of contributions on policy; some political scientists have concluded that money does not matter; and this bottom line has been picked up by reporters and public intellectuals.1 It is essential to answer this question correctly because the result is of great normative importance in a democracy.

It is important to understand why so many studies find no causal link between contributions and policy outcomes. (emphasis added)

Linda cites much of the existing work on the influence of donations on process so her work makes a great starting point for further research.

As far as the lack of a “casual link between contributions and policy outcomes,” I think the answer is far simpler than Linda suspects.

The existence of a quid-pro-quo, the exchange of value for a vote on a particular bill, is the essence of the crime of public bribery. For the details (in the United States), see: 18 U.S. Code § 201 – Bribery of public officials and witnesses

What isn’t public bribery is to donate funds to an office holder on a regular basis, unrelated to any particular vote or act on the part of that official. Think of it as bribery on an installment plan.

When U.S. officials, such as former Secretary of State Hillary Clinton complain of corruption in other governments, they are criticizing quid-pro-quo bribery and not installment plan bribery as it is practiced in the United States.

Regular contributions gains ready access to legislators and, not surprisingly, more votes will go in your favor than random chance would allow.

Regular contributions are more expensive than direct bribes but avoiding the “causal link” is essential for all involved.

On Lemmings and PageRank

Tuesday, March 17th, 2015

Solving Open Source Discovery by Andrew Nesbitt.

From the post:

Today I’m launching, a project that I’ve been working on for the past couple of months.

The intention is to help developers find new open source libraries, modules and frameworks and keep track of ones they depend upon.

The world of open source software depends on a lot of open source libraries. We are standing on the shoulders of giants, which helps us to reach further than we could otherwise.

The problem with platforms like Rubygems and NPM is there are so many libraries, with hundreds of new ones added every day. Trying to find the right library can be overwhelming.

How do you find libraries that help you solve problems? How do you then know which of those libraries are worth using?

Andrew substitutes dependencies for links in a page rank algorithm and then:

Within I’ve aggregated over 700,000 projects, written in 130 languages from across 22 package managers, including dependencies, releases, license information and source code repository infomation. This results in a rich index of almost every open source library available for use today.

Follow me on Twitter at @teabass and @librariesio for updates. Discussion on Hacker News:

Is going to be useful? Yes!

Is a fun way to explore projects? Yes!

Is a great alternative to current source search options? Yes!

Is the solution to open source discovery? Less clear.

I say that because PageRank, whether using hyperlinks or dependencies, results in a lemming view of the world in question.

Wikipedia reports this is an image of a lemming:


I, on the other hand, bear a passing resemblance to this image:


I offer those images as evidence that I am not a lemming! 😉

The opinions and usages of others can be of interest, but I follow work and people of interest to me, not because they are of interest to others. Otherwise I would be following Lady Gaga on Twitter, for example. To save you the trouble of downloading her forty-five million (45M) followers, I hereby attest that I am not one of them.

Make no mistake, Andrew’s work should be used, followed, supported, improved, but as another view of an important data set, not a solution.

I first saw this in a tweet by Arfon Smith.

Are EigenVectors Dangerous?

Tuesday, August 13th, 2013

neo4j: Extracting a subgraph as an adjacency matrix and calculating eigenvector centrality with JBLAS by Mark Needham.

Mark continues his exploration of Eigenvector centrality by adding Eigenvector centrality values back to the graph from which it was developed.

Putting the Eigenvector centrality measure results back into Neo4j make they easier to query.

What troubles me is that Eigenvector centrality values are based only upon the recorded information we have for the graph.

There is no allowance for missing relationships or any validation of the Eigenvector centrality values found.

Recalling Paul Revere was a “terrorist” in his day, the NSA uses algorithms to declare nodes “important,” lack of access to courts for detainees, and Eigenvector centrality values start to look dangerous.

How would you validate Eigenvector centrality values? Not mathematically but against known values or facts outside of your graph.

How Important is Your Node in the Social Graph?

Tuesday, August 13th, 2013

Java/JBLAS: Calculating eigenvector centrality of an adjacency matrix by Mark Needham.

OK, Mark’s title is more accurate but mine is more likely to get you to look beyond the headline. 😉

From the post:

I recently came across a very interesting post by Kieran Healy where he runs through a bunch of graph algorithms to see whether he can detect the most influential people behind the American Revolution based on their membership of various organisations.

The first algorithm he looked at was betweenness centrality which I’ve looked at previously and is used to determine the load and importance of a node in a graph.

This algorithm would assign a high score to nodes which have a lot of nodes connected to them even if those nodes aren’t necessarily influential nodes in the graph.

If we want to take the influence of the other nodes into account then we can use an algorithm called eigenvector centrality.

You may remember Kieran Healy’s post from Using Metadata to Find Paul Revere [In a Perfect World], where I pointed out that Kieran was using clean data. No omissions, no variant spellings, no confusion of any sort.

I suspect any sort of analysis would succeed with the proviso that it only gets clean data. Unlikely in an unclean data world.

But that to one side, Mark does a great job of assembling references on eigenvectors and code for processing. Follow all the resources in Mark’s post and you will have a much deeper understanding of this area.

Be sure to take note of the comparison between PageRank and Eigenvector centrality. Results are computational artifacts of choices that are visible when examining the end results.

PS: The Wikipedia link for Centrality cites Opsahl, Tore; Agneessens, Filip; Skvoretz, John (2010). “Node centrality in weighted networks: Generalizing degree and shortest paths“. Social Networks 32 (3): 245. doi:10.1016/j.socnet.2010.03.006 as a good summary. The link for the title leads to a preprint which is freely available.

Analyzing the Enron Data…

Saturday, December 29th, 2012

Analyzing the Enron Data: Frequency Distribution, Page Rank and Document Clustering by Sujit Pal.

From the post:

I’ve been using the Enron Dataset for a couple of projects now, and I figured that it would be interesting to see if I could glean some information out of the data. One can of course simply read the Wikipedia article, but that would be too easy and not as much fun :-).

My focus on this analysis is on the “what” and the “who”, ie, what are the important ideas in this corpus and who are the principal players. For that I did the following:

  • Extracted the words from Lucene’s inverted index into (term, docID, freq) triples. Using this, I construct a frequency distribution of words in the corpus. Looking at the most frequent words gives us an idea of what is being discussed.
  • Extract the email (from, {to, cc, bcc}) pairs from MongoDB. Using this, I piggyback on Scalding’s PageRank implementation to produce a list of emails by page rank. This gives us an idea of the “important” players.
  • Using the triples extracted from Lucene, construct tuples of (docID, termvector), then cluster the documents using KMeans. This gives us an idea of the spread of ideas in the corpus. Originally, the idea was to use Mahout for the clustering, but I ended up using Weka instead.

I also wanted to get more familiar with Scalding beyond the basic stuff I did before, so I used that where I would have used Hadoop previously. The rest of the code is in Scala as usual.

Good practice for discovery of the players and main ideas when the “fiscal cliff” document set “leaks,” as you know it will.

Relationships between players and their self-serving recountings versus the data set will make an interesting topic map.

Faster Ranking As A Goal?

Wednesday, June 13th, 2012

When I read in Quantum Computers Could Help Search Engines Keep Up With the Internet’s Growth:

Most people don’t think twice about how Internet search engines work. You type in a word or phrase, hit enter, and poof — a list of web pages pops up, organized by relevance.

Behind the scenes, a lot of math goes into figuring out exactly what qualifies as most relevant web page for your search. Google, for example, uses a page ranking algorithm that is rumored to be the largest numerical calculation carried out anywhere in the world. With the web constantly expanding, researchers at USC have proposed — and demonstrated the feasibility — of using quantum computers to speed up that process.

“This work is about trying to speed up the way we search on the web,” said Daniel Lidar, corresponding author of a paper on the research that appeared in the journal Physical Review Letters on June 4.

As the Internet continues to grow, the time and resources needed to run the calculation — which is done daily — grow with it, Lidar said.

I thought of my post earlier today about inexact computing and how our semantics are inexact. (On the value of being inexact)

Is it the case that quantum computing is going to help us be more exact more quickly?

I am not sure what the advantage of being wrong more quickly could be? Do you?

The full reference:

Silvano Garnerone, Paolo Zanardi, Daniel Lidar. Adiabatic Quantum Algorithm for Search Engine Ranking. Physical Review Letters, 2012; 108 (23) DOI: 10.1103/PhysRevLett.108.230506

Chance discover of an interesting journal feature:


We propose an adiabatic quantum algorithm for generating a quantum pure state encoding of the PageRank vector, the most widely used tool in ranking the relative importance of internet pages. We present extensive numerical simulations which provide evidence that this algorithm can prepare the quantum PageRank state in a time which, on average, scales polylogarithmically in the number of web pages. We argue that the main topological feature of the underlying web graph allowing for such a scaling is the out-degree distribution. The top-ranked log⁡(n) entries of the quantum PageRank state can then be estimated with a polynomial quantum speed-up. Moreover, the quantum PageRank state can be used in “q-sampling” protocols for testing properties of distributions, which require exponentially fewer measurements than all classical schemes designed for the same task. This can be used to decide whether to run a classical update of the PageRank.

Physics Synopsis:

Although quantum computing has only been demonstrated for small calculations so far, researchers are interested in finding problems where its potentially massive parallelism would pay off if scaled-up versions can be made. In Physical Review Letters, Silvano Garnerone of the Institute for Quantum Computing at the University of Waterloo, Canada, and colleagues simulate the speedup achieved by using a quantum approach to rank websites.

The PageRank method, implemented by Google, assigns each website a score based on how many other sites link to it and what their scores are. Starting with an enormous matrix that represents which sites link to which others, the algorithm evaluates the probability that a steady stream of surfers starting at random sites and following random links will be found at each site. This information helps determine which search results should be listed highest. The PageRank calculation currently requires a time that is roughly proportional to the number of sites. This slowdown with size is not as bad as for many complex problems, but it can still take many days to rank the entire worldwide web.

Garnerone and colleagues propose an approach to page ranking that uses an “adiabatic quantum algorithm,” in which a simple matrix with a known solution is gradually transformed into the real problem, producing the desired solution. They simulated many relatively small networks that had similar link topology to the worldwide web, and found that reconstructing and reading out the most relevant part of the PageRank required a time that grows more slowly than the best classical algorithms available. – Don Monroe

That looks like a really cool feature to me.

Abstract for the initiated. Synopsis for the may be interested.

Are there IR/KD/etc. journals following that model?

Seems like a good way to create “trading zones” where we will become aware of work in other areas.

Web Data Commons

Thursday, March 22nd, 2012

Web Data Commons

From the webpage:

More and more websites have started to embed structured data describing products, people, organizations, places, events into their HTML pages. The Web Data Commons project extracts this data from several billion web pages and provides the extracted data for download. Web Data Commons thus enables you to use the data without needing to crawl the Web yourself.

More and more websites embed structured data describing for instance products, people, organizations, places, events, resumes, and cooking recipes into their HTML pages using encoding standards such as Microformats, Microdatas and RDFa. The Web Data Commons project extracts all Microformat, Microdata and RDFa data from the Common Crawl web corpus, the largest and most up-to-data web corpus that is currently available to the public, and provide the extracted data for download in the form of RDF-quads and (soon) also in the form of CSV-tables for common entity types (e.g. product, organization, location, …).

Web Data Commons thus enables you to use structured data originating from hundreds of million web pages within your applications without needing to crawl the Web yourself.

Pages in the Common Crawl corpora are included based on their PageRank score, thereby making the crawls snapshots of the current popular part of the Web.

This reminds me of the virtual observatory practice in astronomy. Astronomical data is too large to easily transfer and many who need to use the data lack the software or processing power. The solution? Holders of the data make it available via interfaces that deliver a sub-part of the data, processed according to the requester’s needs.

The Web Data Commons is much the same thing as it frees most of us from both crawling the web and/or extracting structured data from it. Or at least giving us the basis for more pointed crawling of the web.

A very welcome development!


Sunday, February 19th, 2012

MoleculaRnetworks: An integrated graph theoretic and data mining tool to explore solvent organization in molecular simulation by Barbara Logan Mooney, L. René Corrales and Aurora E. Clark.


This work discusses scripts for processing molecular simulations data written using the software package R: A Language and Environment for Statistical Computing. These scripts, named moleculaRnetworks, are intended for the geometric and solvent network analysis of aqueous solutes and can be extended to other H-bonded solvents. New algorithms, several of which are based on graph theory, that interrogate the solvent environment about a solute are presented and described. This includes a novel method for identifying the geometric shape adopted by the solvent in the immediate vicinity of the solute and an exploratory approach for describing H-bonding, both based on the PageRank algorithm of Google search fame. The moleculaRnetworks codes include a preprocessor, which distills simulation trajectories into physicochemical data arrays, and an interactive analysis script that enables statistical, trend, and correlation analysis, and other data mining. The goal of these scripts is to increase access to the wealth of structural and dynamical information that can be obtained from molecular simulations. © 2012 Wiley Periodicals, Inc.

Data mining, graph theory, PageRank, something for everyone in this article!

Not to mention innovative use of PageRank with non-WWW data.

MoculaRnetworks code.

Sublinear Time Algorithm for PageRank Computations and Related Applications

Tuesday, February 14th, 2012

Sublinear Time Algorithm for PageRank Computations and Related Applications by Christian Borgs, Michael Brautbar, Jennifer Chayes, Shang-Hua Teng

In a network, identifying all vertices whose PageRank is more than a given threshold value $\Delta$ is a basic problem that has arisen in Web and social network analyses. In this paper, we develop a nearly optimal, sublinear time, randomized algorithm for a close variant of this problem. When given a network \graph, a threshold value $\Delta$, and a positive constant $c>1$, with probability $1-o(1)$, our algorithm will return a subset $S\subseteq V$ with the property that $S$ contains all vertices of PageRank at least $\Delta$ and no vertex with PageRank less than $\Delta/c$. The running time of our algorithm is always $\tilde{O}(\frac{n}{\Delta})$. In addition, our algorithm can be efficiently implemented in various network access models including the Jump and Crawl query model recently studied by \cite{brautbar_kearns10}, making it suitable for dealing with large social and information networks.

As part of our analysis, we show that any algorithm for solving this problem must have expected time complexity of ${\Omega}(\frac{n}{\Delta})$. Thus, our algorithm is optimal up to a logarithmic factor. Our algorithm (for identifying vertices with significant PageRank) applies a multi-scale sampling scheme that uses a fast personalized PageRank estimator as its main subroutine. We develop a new local randomized algorithm for approximating personalized PageRank, which is more robust than the earlier ones developed by Jeh and Widom \cite{JehW03} and by Andersen, Chung, and Lang \cite{AndersenCL06}. Our multi-scale sampling scheme can also be adapted to handle a large class of matrix sampling problems that may have potential applications to online advertising on large social networks (See the appendix).

Pay close attention to the author’s definition of “significant” vertices:

A basic problem in network analysis is to identify the set of its vertices that are “significant.” For example, the significant nodes in the web graph defined by a query could provide the authoritative contents in web search; they could be the critical proteins in a protein interaction network; and they could be the set of people (in a social network) most effective to seed the influence for online advertising. As the networks become larger, we need more efficient algorithms to identify these “significant” nodes.

As far as online advertising, I await the discovery by vendors that “pull” models of advertising pre-qualify potential purchasers. “Push” models spam everyone within reach, with correspondingly low success rates.

For your convenience, the cites that don’t work well as source in the abstract:

brautbar_kearns10 – Local Algorithms for Finding Interesting Individuals in Large Networks by Mickey Brautbar , Michael Kearns.

Jeh and Widom – Scaling personalized web search (ACM), Scaling personalized web search (Stanford, free).

Andersen, Chung, and Lang – Local graph partitioning using PageRank vectors.

Documents as geometric objects: how to rank documents for full-text search

Wednesday, January 25th, 2012

Documents as geometric objects: how to rank documents for full-text search Michael Nielsen on July 7, 2011.

From the post:

When we type a query into a search engine – say “Einstein on relativity” – how does the search engine decide which documents to return? When the document is on the web, part of the answer to that question is provided by the PageRank algorithm, which analyses the link structure of the web to determine the importance of different webpages. But what should we do when the documents aren’t on the web, and there is no link structure? How should we determine which documents most closely match the intent of the query?

In this post I explain the basic ideas of how to rank different documents according to their relevance. The ideas used are very beautiful. They are based on the fearsome-sounding vector space model for documents. Although it sounds fearsome, the vector space model is actually very simple. The key idea is to transform search from a linguistic problem into a geometric problem. Instead of thinking of documents and queries as strings of letters, we adopt a point of view in which both documents and queries are represented as vectors in a vector space. In this point of view, the problem of determining how relevant a document is to a query is just a question of determining how parallel the query vector and the document vector are. The more parallel the vectors, the more relevant the document is.

This geometric way of treating documents turns out to be very powerful. It’s used by most modern web search engines, including (most likely) web search engines such as Google and Bing, as well as search libraries such as Lucene. The ideas can also be used well beyond search, for problems such as document classification, and for finding clusters of related documents. What makes this approach powerful is that it enables us to bring the tools of geometry to bear on the superficially very non-geometric problem of understanding text.

Very much looking forward to future posts in this series. There is no denying the power of “vector space model” but that leaves unasked what is lost in the transition from linguistic to geometric space?

Wiki PageRank with Hadoop

Saturday, October 8th, 2011

Wiki PageRank with Hadoop

From the post:

In this tutorial we are going to create a PageRanking for Wikipedia with the use of Hadoop. This was a good hands-on excercise to get started with Hadoop. The page ranking is not a new thing, but a suitable usecase and way cooler than a word counter! The Wikipedia (en) has 3.7M articles at the moment and is still growing. Each article has many links to other articles. With those incomming and outgoing links we can determine which page is more important than others, which basically is what PageRanking does.

Excellent tutorial! Non-trivial data set and gets your hands wet with Hadoop, one of the rising stars in data processing. What’s not to like?

Question: What other processing looks interesting for the Wiki pages?

The running time on some jobs would be short enough to plan a job at the start of class, from live suggestions, then run the job during the presentation/lecture, present the results/post-mortem of mistakes after the break.

Now that would make an interesting class. Suggestions?

Ergodic Control and Polyhedral approaches to PageRank Optimization

Monday, September 26th, 2011

Ergodic Control and Polyhedral approaches to PageRank Optimization by Olivier Fercoq, Marianne Akian, Mustapha Bouhtou, Stéphane Gaubert (Submitted on 10 Nov 2010 (v1), last revised 19 Sep 2011 (this version, v2))


We study a general class of PageRank optimization problems which consist in finding an optimal outlink strategy for a web site subject to design constraints. We consider both a continuous problem, in which one can choose the intensity of a link, and a discrete one, in which in each page, there are obligatory links, facultative links and forbidden links. We show that the continuous problem, as well as its discrete variant when there are no constraints coupling different pages, can both be modeled by constrained Markov decision processes with ergodic reward, in which the webmaster determines the transition probabilities of websurfers. Although the number of actions turns out to be exponential, we show that an associated polytope of transition measures has a concise representation, from which we deduce that the continuous problem is solvable in polynomial time, and that the same is true for the discrete problem when there are no coupling constraints. We also provide efficient algorithms, adapted to very large networks. Then, we investigate the qualitative features of optimal outlink strategies, and identify in particular assumptions under which there exists a “master” page to which all controlled pages should point. We report numerical results on fragments of the real web graph.

I mention this research to raise several questions:

  1. Does PageRank have a role to play in presentation for topic map systems?
  2. Should PageRank results in topic map systems be used assign subject identifications?
  3. If your answer to #2 is yes, what sort of subjects and how would you design the user choices leading to them?
  4. Are you monitoring user navigations of your topic maps?
  5. Has user navigation of your topic maps affected their revision or design of following maps?
  6. Are the navigations in #5 the same as choices based on search results? (In theory or practice.)
  7. Is there an optimal strategy for linking nodes in a topic map?