PageRank algorithm implementation. There are two implementations of PageRank.
The first implementation uses the standalone GraphFrame interface and runs PageRank
for a fixed number of iterations. This can be run by setting maxIter.
var PR = Array.fill(n)( 1.0 )
val oldPR = Array.fill(n)( 1.0 )
for( iter <-0 until maxIter ) {
swap(oldPR, PR)
for( i <-0 until n ) {
PR[i] = alpha + (1 - alpha) * inNbrs[i].map(j => oldPR[j] / outDeg[j]).sum
}
}
The second implementation uses the org.apache.spark.graphx.Pregel interface and runs PageRank
until convergence. This can be run by setting tol.
var PR = Array.fill(n)( 1.0 )
val oldPR = Array.fill(n)( 0.0 )
while( max(abs(PR - oldPr)) > tol ) {
swap(oldPR, PR)
for( i <-0 until n if abs(PR[i] - oldPR[i]) > tol ) {
PR[i] = alpha + (1 - \alpha) * inNbrs[i].map(j => oldPR[j] / outDeg[j]).sum
}
}
alpha is the random reset probability (typically 0.15), inNbrs[i] is the set of
neighbors which link to i and outDeg[j] is the out degree of vertex j.
Note that this is not the "normalized" PageRank and as a consequence pages that have no
inlinks will have a PageRank of alpha. In particular, the pageranks may have some values
greater than 1.
The resulting vertices DataFrame contains one additional column:
pagerank (DoubleType): the pagerank of this vertex
The resulting edges DataFrame contains one additional column:
weight (DoubleType): the normalized weight of this edge after running PageRank
PageRank algorithm implementation. There are two implementations of PageRank.
The first implementation uses the standalone GraphFrame interface and runs PageRank for a fixed number of iterations. This can be run by setting
maxIter
.The second implementation uses the
org.apache.spark.graphx.Pregel
interface and runs PageRank until convergence. This can be run by settingtol
.alpha
is the random reset probability (typically 0.15),inNbrs[i]
is the set of neighbors which link toi
andoutDeg[j]
is the out degree of vertexj
.Note that this is not the "normalized" PageRank and as a consequence pages that have no inlinks will have a PageRank of alpha. In particular, the pageranks may have some values greater than 1.
The resulting vertices DataFrame contains one additional column:
DoubleType
): the pagerank of this vertexThe resulting edges DataFrame contains one additional column:
DoubleType
): the normalized weight of this edge after running PageRank