graphframes package¶

Subpackages¶

Contents¶

class graphframes.GraphFrame(v, e)[source]¶

Represents a graph with vertices and edges stored as DataFrames.

Parameters

v – DataFrame holding vertex information. Must contain a column named “id” that stores unique vertex IDs.
e – DataFrame holding edge information. Must contain two columns “src” and “dst” storing source vertex IDs and destination vertex IDs of edges, respectively.

>>> localVertices = [(1,"A"), (2,"B"), (3, "C")]
>>> localEdges = [(1,2,"love"), (2,1,"hate"), (2,3,"follow")]
>>> v = spark.createDataFrame(localVertices, ["id", "name"])
>>> e = spark.createDataFrame(localEdges, ["src", "dst", "action"])
>>> g = GraphFrame(v, e)

aggregateMessages(aggCol, sendToSrc=None, sendToDst=None)[source]¶

Aggregates messages from the neighbours.

When specifying the messages and aggregation function, the user may reference columns using the static methods in graphframes.lib.AggregateMessages.

See Scala documentation for more details.

Parameters

aggCol – the requested aggregation output either as pyspark.sql.Column or SQL expression string
sendToSrc – message sent to the source vertex of each triplet either as pyspark.sql.Column or SQL expression string (default: None)
sendToDst – message sent to the destination vertex of each triplet either as pyspark.sql.Column or SQL expression string (default: None)

Returns

DataFrame with columns for the vertex ID and the resulting aggregated message

bfs(fromExpr, toExpr, edgeFilter=None, maxPathLength=10)[source]¶

Breadth-first search (BFS).

See Scala documentation for more details.

Returns: DataFrame with one Row for each shortest path between matching vertices.

cache()[source]¶: Persist the dataframe representation of vertices and edges of the graph with the default storage level.

connectedComponents(algorithm='graphframes', checkpointInterval=2, broadcastThreshold=1000000)[source]¶

Computes the connected components of the graph.

See Scala documentation for more details.

Parameters

algorithm – connected components algorithm to use (default: “graphframes”) Supported algorithms are “graphframes” and “graphx”.
checkpointInterval – checkpoint interval in terms of number of iterations (default: 2)
broadcastThreshold – broadcast threshold in propagating component assignments (default: 1000000)

Returns

DataFrame with new vertices column “component”

property degrees¶

The degree of each vertex in the graph, returned as a DataFrame with two columns:

“id”: the ID of the vertex
‘degree’ (integer) the degree of the vertex

Note that vertices with 0 edges are not returned in the result.

Returns: DataFrame with new vertices column “degree”

dropIsolatedVertices()[source]¶

Drops isolated vertices, vertices are not contained in any edges.

Returns: GraphFrame with filtered vertices.

property edges¶: DataFrame holding edge information, with unique columns “src” and “dst” storing source vertex IDs and destination vertex IDs of edges, respectively.

filterEdges(condition)[source]¶

Filters the edges based on expression, keep all vertices.

Parameters: condition – String or Column describing the condition expression for filtering.
Returns: GraphFrame with filtered edges.

filterVertices(condition)[source]¶

Filters the vertices based on expression, remove edges containing any dropped vertices.

Parameters: condition – String or Column describing the condition expression for filtering.
Returns: GraphFrame with filtered vertices and edges.

find(pattern)[source]¶

Motif finding.

See Scala documentation for more details.

Parameters: pattern – String describing the motif to search for.
Returns: DataFrame with one Row for each instance of the motif found

property inDegrees¶

The in-degree of each vertex in the graph, returned as a DataFame with two columns:

“id”: the ID of the vertex
“inDegree” (int) storing the in-degree of the vertex

Note that vertices with 0 in-edges are not returned in the result.

Returns: DataFrame with new vertices column “inDegree”

labelPropagation(maxIter)[source]¶

Runs static label propagation for detecting communities in networks.

See Scala documentation for more details.

Parameters: maxIter – the number of iterations to be performed
Returns: DataFrame with new vertices column “label”

property outDegrees¶

The out-degree of each vertex in the graph, returned as a DataFrame with two columns:

“id”: the ID of the vertex
“outDegree” (integer) storing the out-degree of the vertex

Note that vertices with 0 out-edges are not returned in the result.

Returns: DataFrame with new vertices column “outDegree”

pageRank(resetProbability=0.15, sourceId=None, maxIter=None, tol=None)[source]¶

Runs the PageRank algorithm on the graph. Note: Exactly one of fixed_num_iter or tolerance must be set.

See Scala documentation for more details.

Parameters

resetProbability – Probability of resetting to a random vertex.
sourceId – (optional) the source vertex for a personalized PageRank.
maxIter – If set, the algorithm is run for a fixed number of iterations. This may not be set if the tol parameter is set.
tol – If set, the algorithm is run until the given tolerance. This may not be set if the numIter parameter is set.

Returns

GraphFrame with new vertices column “pagerank” and new edges column “weight”

parallelPersonalizedPageRank(resetProbability=0.15, sourceIds=None, maxIter=None)[source]¶

Run the personalized PageRank algorithm on the graph, from the provided list of sources in parallel for a fixed number of iterations.

See Scala documentation for more details.

Parameters

resetProbability – Probability of resetting to a random vertex
sourceIds – the source vertices for a personalized PageRank
maxIter – the fixed number of iterations this algorithm runs

Returns

GraphFrame with new vertices column “pageranks” and new edges column “weight”

persist(storageLevel=StorageLevel(False, True, False, False, 1))[source]¶: Persist the dataframe representation of vertices and edges of the graph with the given storage level.

property pregel¶

Get the graphframes.lib.Pregel object for running pregel.

See graphframes.lib.Pregel for more details.

shortestPaths(landmarks)[source]¶

Runs the shortest path algorithm from a set of landmark vertices in the graph.

See Scala documentation for more details.

Parameters: landmarks – a set of one or more landmarks
Returns: DataFrame with new vertices column “distances”

stronglyConnectedComponents(maxIter)[source]¶

Runs the strongly connected components algorithm on this graph.

See Scala documentation for more details.

Parameters: maxIter – the number of iterations to run
Returns: DataFrame with new vertex column “component”

svdPlusPlus(rank=10, maxIter=2, minValue=0.0, maxValue=5.0, gamma1=0.007, gamma2=0.007, gamma6=0.005, gamma7=0.015)[source]¶

Runs the SVD++ algorithm.

See Scala documentation for more details.

Returns: Tuple of DataFrame with new vertex columns storing learned model, and loss value

triangleCount()[source]¶

Counts the number of triangles passing through each vertex in this graph.

See Scala documentation for more details.

Returns: DataFrame with new vertex column “count”

property triplets¶

The triplets (source vertex)-[edge]->(destination vertex) for all edges in the graph.

Returned as a DataFrame with three columns:

“src”: source vertex with schema matching ‘vertices’
“edge”: edge with schema matching ‘edges’
‘dst’: destination vertex with schema matching ‘vertices’

Returns: DataFrame with columns ‘src’, ‘edge’, and ‘dst’

unpersist(blocking=False)[source]¶: Mark the dataframe representation of vertices and edges of the graph as non-persistent, and remove all blocks for it from memory and disk.

property vertices¶: DataFrame holding vertex information, with unique column “id” for vertex IDs.

graphframes package¶

Subpackages¶

Contents¶

Table of Contents

Previous topic

Next topic

This Page