This is my comprehensive viva report version 3.
While doing research work under Prof. Dip Banerjee, Prof. Kishore Kothapalli.
Graph is a generic data structure and is a superset of lists, and trees. Binary search on sorted lists can be interpreted as a balanced binary tree search. Database tables can be thought of as indexed lists, and table joins represent relations between columns. This can be modeled as graphs instead. Assignment of registers to variables (by compiler), and assignment of available channels to a radio transmitter and also graph problems. Finding shortest path between two points, and sorting web pages in order of importance are also graphs problems. Neural networks are graphs too. Interaction between messenger molecules in the body, and interaction between people on social media, also modeled as graphs.
Exploring optimizations for dynamic pagerank algorithm based on CUDA : V3
1. Exploring optimizations for dynamic pagerank algorithm based on GPU
Subhajit Sahu
Advisor: Kishore Kothapalli
Center for Security, Theory, and Algorithmic Research (CSTAR)
International Institute of Information Technology, Hyderabad (IIITH)
Gachibowli, Hyderabad, India - 500 032
subhajit.sahu@research.iiit.ac.in
1. Introduction
Graph is a generic data structure and is a superset of lists, and trees. Binary search on
sorted lists can be interpreted as a balanced binary tree search. Database tables can be
thought of as indexed lists, and table joins represent relations between columns. This can be
modeled as graphs instead. Assignment of registers to variables (by compiler), and
assignment of available channels to a radio transmitter and also graph problems. Finding
shortest path between two points, and sorting web pages in order of importance are also
graphs problems. Neural networks are graphs too. Interaction between messenger
molecules in the body, and interaction between people on social media, also modeled as
graphs.
2. The web has a bowtie structure on many levels. There is usually one giant strongly
connected component, with several pages pointing into this component, several pages
pointed to by the component, and a number of disconnected pages. This structure is seen as
a fractal on many different levels. [1]
Static graphs are those which do not change with time. Static graph algorithms are
techniques used to solve such a graph problem (developed since the 1940s). To solve larger
and larger problems, a number of optimizations (both algorithmic and hardware/software
techniques) have been developed to take advantage of vector-processors (like Cray),
multicores, and GPUs. A lot of research had to be done in order to find ways to enhance
concurrency. The techniques include a number of concurrency models, locking
techniques, transactions, etc. This is especially due to a lack of single-core performance
improvements.
Graphs where relations vary with time, are called temporal graphs. As you might guess,
many problems use temporal graphs. These temporal graphs can be thought of as a series
of static graphs at different points in time. In order to solve graph problems with these
temporal graphs, people would normally take the graph at a certain point in time, and run the
necessary static graph algorithm on it. This worked out fine, and as the size of the temporal
graph grows, this repeated computation becomes increasingly slower. It is possible to take
advantage of previous results, in order to compute the result for the next time point. Such
algorithms are called dynamic graph algorithms. This is an ongoing area of research,
which includes new algorithms, hardware/software optimization techniques for distributed
systems, multicores (shared memory), GPUS, and even FPGAs. Optimization of algorithms
can focus on space complexity (memory usage), time complexity (query time),
preprocessing time, and even accuracy of result.
While dynamic algorithms only focus on optimizing the algorithm’s computation time,
dynamic graph data structures focus on improving graph update time, and memory usage.
3. Dense graphs are usually represented by an adjacency matrix (bit matrix). Sparse graphs
can be represented with variations of adjacency lists (like CSR), and edge lists. Sparse
graphs can also be thought of as sparse matrices, and edges of a vertex can be considered
a bitset. In fact, a number of graph algorithms can be modeled as linear algebra operations
(see nvGraph, cuGraph frameworks). A number of dynamic graph data structures have also
been developed to improve update speed (like PMA), or enable concurrent updates and
computation (like Aspen’s compressed functional trees). [2]
Streaming / dynamic / time-evolving graph data structures maintain only the latest graph
information. Historical graphs on the other hand keep track of all previous states of the
graph. Changes to the graphs can be thought of as edge insertions and deletions, which
are usually done in batches. Except for functional techniques, updating a graph usually
involves modifying a shared structure using some kind of fine-grained synchronization. It
might also be possible to store additional information along with vertices/edges, though this
is usually not the focus of research (graph databases do). In the recent decade or so, a
number of graph streaming frameworks have been developed, each with a certain focus
area, and targeting a certain platform (distributed system / multiprocessor / GPU / FPGA /
ASIC). Such frameworks focus on designing an improved dynamic graph data structure, and
define a fundamental model of computation. For GPUs, the following frameworks exist:
cuSTINGER, aimGraph, faimGraph, Hornet, EvoGraph, and GPMA. [2]
4. 2. Pagerank algorithm
The pagerank algorithm is a technique used to sort web pages (or vertices of a graph) by
importance. It is quite popularly the algorithm published by the founders of Google. Other
link analysis algorithms include HITS, TrustRank, and HummingBird. Such algorithms are
also used for word sense disambiguation in lexical semantics, rank streets by traffic,
measure impact of communities on web, provide recommendations, analysis of
neural/protein networks, determine species essential for health of the environment, or even
quantify the scientific impact of researchers. [3]
In order to understand the pagerank algorithm, consider this random (web) surfer model.
Each web page is modeled as a vertex, and each hyperlink as an edge. The surfer (such as
you) initially visits a web page at random. He then follows one of the links on the page,
leading to another web page. After following some links, the surfer would eventually decide
to visit another web page (at random). The probability of the random surfer being on a
certain page is what the pagerank algorithm returns. This probability (or importance) of a
web page depends upon the importance of web pages pointing to it (markov chain). This
definition of pagerank is recursive, and takes the form of an eigen-value problem. Solving
for pagerank thus requires multiple iterations of computation, which is known as the
power-iteration method. Each computation is essentially a (sparse) matrix multiplication.
A damping factor (of 0.85) is used to counter the effect of spider-traps (like self-loops),
which can otherwise suck up all importance. Dead-ends (web pages with no out-links) are
countered by effectively linking it to all vertices of the graph (making markov matrix column
stochastic), which otherwise would leak out importance. [4]
Note that as originally conceived, the PageRank model does not factor a web browser’s
back button into a surfer’s hyperlinking possibilities. Surfers in one class, if teleporting, may
be much more likely to jump to pages about sports, while surfers in another class may be
much more likely to jump to pages pertaining to news and current events. Such differing
5. teleportation tendencies can be captured in two different personalization vectors. However,
it makes the once query-independent, user independent PageRankings user-dependent and
more calculation-laden. Nevertheless, it seems this little personalization vector has had more
significant side effects. Google has recently used this personalization vector to control
spamming done by the so-called link farms. [1]
Pagerank algorithms almost always take the following parameters: damping, tolerance, and
max. iterations. Here, tolerance defines the error between the previous and the current
iterations. Though this is usually L1-norm, L2 and L∞-norm are also used sometimes. Both
damping and tolerance control the rate of convergence of the algorithm. The choice of
tolerance function also affects the rate of convergence. However, adjusting damping can
give completely different pagerank values. Since the ordering of vertices is important, and
not the exact values, it can usually be a good idea to choose a larger tolerance value.
3. Optimizing Pagerank
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try
reducing the work per iteration, and the other is to try reducing the number of iterations.
These goals are often at odds with one another. A no. of techniques can be used to
compress adjacency lists. The gap technique proposed stores only the difference between
the neighbour ids in edge lists. The reference encoding technique uses sets of edges as
reference to define an edge list (but it is not easy to find the reference vertices). Research
has also been done of compressing the rank vector (which is dense), using smaller
custom data types, but it was found to be not so useful. [1]
The adapting pagerank technique “locks” vertices which have converged, and saves
iteration time by skipping their computation. [1] Identical nodes, which have the same
in-links, can be removed to reduce duplicate computations and thus reduce iteration time.
Road networks often have chains which can be short-circuited before pagerank computation
to improve performance. Final ranks of chain nodes can be easily calculated. This reduces
both the iteration time, and the number of iterations. If a graph has no dangling nodes,
pagerank of each strongly connected component can be computed in topological order. This
helps reduce the iteration time, no. of iterations, and also enable concurrency in pagerank
computation. The combination of all of the above methods is the STICD algorithm. [5] A
somewhat similar aggregation algorithm is BlockRank which computes the pagerank of
hosts, local pagerank of pages within hosts independently, and aggregates them with
weights for the final rank vector. It produces a speed-up of factor 2 on some datasets. The
global PageRank solution can be found in a computationally efficient manner by computing
the subPageRank of each connected component, then pasting the subPageRanks together
to form the global PageRank, using Avrachenkov et. al. method. These methods exploit
the inherent reducibility in the graph. Bianchini et. al. suggest using the Jacobi method to
compute the PageRank vector. [1]
6. Pagerank algorithm is a live algorithm which means that an ongoing computation can be
paused during graph update, and simply be resumed afterwards (instead of restarting it).
The first updating paper by Chien et al. (2002) identifies a small portion of the web graph
“near” the link changes and model the rest of the web as a single node in a new, much
smaller graph; compute a pagerank for this small graph and transfer these results to the
much bigger, original graph. [1]
4. Graph streaming frameworks / databases
STINGER uses an extended form of CSR with edge lists represented and link-list of
contiguous blocks. Each edge has 2 timestamps, and fine-locking is used per edge.
cuSTINGER extends STINGER for CUDA GPUs and uses contiguous edge list instead
(CSR). faimGraph is a GPU framework with fully dynamic vertex and edge updates. It has
an in-GPU memory manager, and uses a paged linked-list for edges similar to STINGER.
Hornet also implements its own memory manager, and uses B+ trees to maintain blocks
efficiently, and keep track of empty space. LLAMA uses a variant of CSR with large
multi-versioned arrays. It stores all snapshots of a graph, and persists old snapshots to disk.
GraphIn uses CSR along with edge lists, and updates CSR after edge lists are large
enough. GraphOne is also similar, and uses page-aligned memory for high-degree vertices.
GraphTau is based on Apache Spark and uses read-only partitioned collections of data sets.
It uses a window sliding model for graph snapshots. Aspen uses C-tree (tree of trees) based
on purely functional compressed search trees to store graph structures. Elements are stored
in chunks and compressed using difference encoding. It allows any no. of readers and a
single writer, and the framework guarantees strict serializability. Tegra stores the full history
of the graph and relies on recomputing graph algorithms on affected subgraphs. It also uses
a cost model to guess when full recomputation might be better. It uses an adaptive radix tree
as the core data structure for efficient updates and range scans. [2]
Unlike graph streaming frameworks, graph databases focus on rich attached data, complex
queries, transactional support with ACID properties, data replication and sharding. A few
graph databases have started to support global analytics as well. However, most graph
databases do not offer dedicated support for incremental changes. Little research exists into
accelerating streaming graph processing using low-cost atomics, hardware transactions,
FPGAs, high-performance networking hardware. On average, the highest rate of ingestion
is achieved by shared memory single-node designs. [2]
7. 5. NVIDIA Tesla V100 GPU Architecture
NVIDIA Tesla was a line of products targeted at stream processing / general-purpose
graphics processing units (GPGPUs). In May 2020, NVIDIA retired the Tesla brand because
of potential confusion with the brand of cars. Its new GPUs are branded NVIDIA Data
Center GPUs as in the Ampere A100 GPU. [6]
The NVIDIA Tesla GV100 (Volta) is a 21.1 billion transistor TSMC 12nm FinFET with die
size 815 mm2
. Here is a short summary of its features:
● 84 SMs, each with 64 independent FP, INT cores.
8. ● Shared memory size config. up to 96KB / SM.
● 4 512-bit memory controllers (total 4096-bit).
● Upto 6 bidirectional NVLink, 25 GB/s per direction (for IBM Power 9 CPUs).
● 4 dies / HBM stack, with 4 stacks. 16 GB with 900 GB/s HBM2 (Samsung).
● Native/sideband SEDEC (1 correct, 2 detect) ECC (for HBM, REG, L1, L2).
Each SM has 4 processing blocks (each handles 1 warp of 32 threads). L1 data cache is
combined with shared memory of 128 KB / SM (explicit caching not as necessary anymore).
Volta also supports write-caching (not just load, as previous architectures). NVLink
supports coherency allowing data reads from GPU memory to be stored in CPU cache.
Address Translation Service (ATS) allows the GPU to access CPU page tables directly
(malloc ptr). The new copy engine doesn't need pinned memory. Volta per-thread
program-counter, call-stack, allows interleaved executions of warp threads, enabling
fine-grained synchronization between threads within a warp (use __syncwarp()).
Cooperative groups enable synchronization between warps, grid-wide, multi-GPUs,
cross-warp, sub-warp. [7]
6. Experiments
Adjusting data types for rank vector
Custom fp16 bfloat16 float double
1. Performance of vector element sum using float vs bfloat16 as the storage type.
2. Comparison of PageRank using float vs bfloat16 as the storage type (pull, CSR).
3. Performance of PageRank using 32-bit floats vs 64-bit floats (pull, CSR).
Adjusting CSR format for graph
Regular 32-bit Hybrid 32-bit Hybrid 64-bit
9. single bit 32-bit index
4-bit block 28-bit index (30 eff.) 60-bit index (62 eff.)
8-bit block 24-bit index (27 eff.) 56-bit index (59 eff.)
16-bit block 16-bit index (20 eff.) 48-bit index (52 eff.)
32-bit block 32-bit index (32 eff.)
1. Comparing space usage of regular vs hybrid CSR (various sizes).
Adjusting Pagerank parameters
Damping Factor adjust dynamic-adjust
Tolerance L1 norm L2 norm L∞ norm
1. Comparing the effect of using different values of damping factor, with PageRank (pull, CSR).
2. Experimenting PageRank improvement by adjusting damping factor (α) between iterations.
3. Comparing the effect of using different functions for convergence check, with PageRank (...).
4. Comparing the effect of using different values of tolerance, with PageRank (pull, CSR).
Adjusting Sequential approach
Push Pull Class CSR
1. Performance of contribution-push based vs contribution-pull based PageRank.
2. Performance of C++ DiGraph class based vs CSR based PageRank (pull).
Adjusting OpenMP approach
Map Reduce Uniform Hybrid
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Performance of sequential execution based vs OpenMP based vector element sum.
3. Performance of uniform-OpenMP based vs hybrid-OpenMP based PageRank (pull, CSR).
Comparing sequential approach
OpenMP nvGraph
10. Sequential vs vs
OpenMP vs
1. Performance of sequential execution based vs OpenMP based PageRank (pull, CSR).
2. Performance of sequential execution based vs nvGraph based PageRank (pull, CSR).
3. Performance of OpenMP based vs nvGraph based PageRank (pull, CSR).
Adjusting Monolithic (Sequential) optimizations (from STICD)
Split components Skip in-identicals Skip chains Skip converged
1. Performance benefit of PageRank with vertices split by components (pull, CSR).
2. Performance benefit of skipping in-identical vertices for PageRank (pull, CSR).
3. Performance benefit of skipping chain vertices for PageRank (pull, CSR).
4. Performance benefit of skipping converged vertices for PageRank (pull, CSR).
Adjusting Levelwise (STICD) approach
Min. component size Min. compute size Skip teleport calculation
1. Comparing various min. component sizes for topologically-ordered components (levelwise...).
2. Comparing various min. compute sizes for topologically-ordered components (levelwise...).
3. Checking performance benefit of levelwise PageRank when teleport calculation is skipped.
Note: min. components size merges small components even before generating block-graph /
topological-ordering, but min. compute size does it before pagerank computation.
Comparing Levelwise (STICD) approach
Monolithic nvGraph
Levelwise (STICD) vs
1. Performance of monolithic vs topologically-ordered components (levelwise) PageRank.
Adjusting ranks for dynamic graphs
update new zero fill 1/N fill
11. update old, new scale, 1/N fill
1. Comparing strategies to update ranks for dynamic PageRank (pull, CSR).
Adjusting Levelwise (STICD) dynamic approach
Skip unaffected components For fixed graphs For temporal graphs
1. Checking for correctness of levelwise PageRank when unchanged components are skipped.
2. Perf. benefit of levelwise PageRank when unchanged components are skipped (fixed).
3. Perf. benefit of levelwise PageRank when unchanged components are skipped (temporal).
Note: fixed ⇒ static graphs with batches of random edge updates. temporal ⇒ batches of edge
updated from temporal graphs.
Comparing dynamic approach with static
nvGraph dynamic Monolithic dynamic Levelwise dynamic
nvGraph static vs: temporal
Monolithic static vs: fixed, temporal vs: fixed, temporal
Levelwise static vs: fixed vs: fixed, temporal
1. Performance of nvGraph based static vs dynamic PageRank (temporal).
2. Performance of static vs dynamic PageRank (temporal).
3. Performance of static vs dynamic levelwise PageRank (fixed).
4. Performance of levelwise based static vs dynamic PageRank (temporal).
Note: fixed ⇒ static graphs with batches of random edge updates. temporal ⇒ batches of edge
updated from temporal graphs.
Adjusting Monolithic CUDA approach
Map launch
Reduce memcpy launch in-place launch vs
Thread /V launch sort/p. vertices sort edges
Block /V launch sort/p. vertices sort edges
12. Switched /V thread launch block launch switch-point
1. Comparing various launch configs for CUDA based vector multiply.
2. Comparing various launch configs for CUDA based vector element sum (memcpy).
3. Comparing various launch configs for CUDA based vector element sum (in-place).
4. Performance of memcpy vs in-place based CUDA based vector element sum.
5. Comparing various launch configs for CUDA thread-per-vertex based PageRank (pull, CSR).
6. Sorting vertices and/or edges by in-degree for CUDA thread-per-vertex based PageRank.
7. Comparing various launch configs for CUDA block-per-vertex based PageRank (pull, CSR).
8. Sorting vertices and/or edges by in-degree for CUDA block-per-vertex based PageRank.
9. Launch configs for CUDA switched-per-vertex based PageRank focusing on thread approach.
10. Launch configs for CUDA switched-per-vertex based PageRank focusing on block approach.
11. Sorting vertices and/or edges by in-degree for CUDA switched-per-vertex based PageRank.
12. Comparing various switch points for CUDA switched-per-vertex based PageRank (pull, ...).
Note: sort/p. vertices ⇒ sorting vertices by ascending or descending order of in-degree, or simply
partitioning (by in-degree). sort edges ⇒ sorting edges by ascending or descending order of id.
Adjusting Monolithic CUDA optimizations (from STICD)
Split components Skip in-identicals Skip chains Skip converged
1. Performance benefit of CUDA based PageRank with vertices split by components.
2. Performance benefit of skipping in-identical vertices for CUDA based PageRank (pull, CSR).
3. Performance benefit of skipping chain vertices for CUDA based PageRank (pull, CSR).
4. Performance benefit of skipping converged vertices for CUDA based PageRank (pull, CSR).
Adjusting Levelwise (STICD) CUDA approach
Min. component size Min. compute size Skip teleport calculation
1. Min. component sizes for topologically-ordered components (levelwise, CUDA) PageRank.
2. Min. compute sizes for topologically-ordered components (levelwise CUDA) PageRank.
Note: min. components size merges small components even before generating block-graph /
topological-ordering, but min. compute size does it before pagerank computation.
Comparing Levelwise (STICD) CUDA approach
nvGraph Monolithic CUDA
13. Monolithic vs vs
Monolithic CUDA vs
Levelwise CUDA vs vs
1. Performance of sequential execution based vs CUDA based PageRank (pull, CSR).
2. Performance of nvGraph vs CUDA based PageRank (pull, CSR).
3. Performance of Monolithic CUDA vs Levelwise CUDA PageRank (pull, CSR, ...).
Comparing dynamic CUDA approach with static
nvGraph dynamic Monolithic dynamic Levelwise dynamic
nvGraph static vs: fixed, temporal vs: fixed, temporal vs: fixed, temporal
Monolithic static vs: fixed, temporal vs: fixed, temporal vs: fixed, temporal
Levelwise static vs: fixed, temporal vs: fixed, temporal vs: fixed, temporal
1. Performance of static vs dynamic CUDA based PageRank (fixed).
2. Performance of static vs dynamic CUDA based PageRank (temporal).
3. Performance of CUDA based static vs dynamic levelwise PageRank (fixed).
4. Performance of static vs dynamic CUDA based levelwise PageRank (temporal).
Note: fixed ⇒ static graphs with batches of random edge updates. temporal ⇒ batches of edge
updated from temporal graphs.
Comparing dynamic optimized CUDA approach with static
nvGraph dynamic Monolithic dynamic Levelwise dynamic
nvGraph static vs: fixed vs: fixed vs: fixed
Monolithic static vs: fixed vs: fixed vs: fixed
Levelwise static vs: fixed vs: fixed vs: fixed
1. Performance of CUDA based optimized dynamic monolithic vs levelwise PageRank (fixed).
Note: fixed ⇒ static graphs with batches of random edge updates. temporal ⇒ batches of edge
updated from temporal graphs.
14. 7. Packages
1. CLI for SNAP dataset, which is a collection of more than 50 large networks.
This is for quickly fetching SNAP datasets that you need right from the CLI. Currently there is
only one command clone, where you can provide filters for specifying exactly which datasets
you need, and where to download them. If a dataset already exists, it is skipped. This
summary is shown at the end. You can install this with npm install -g snap-data.sh.
15. 2. CLI for nvGraph, which is a GPU-based graph analytics library written by NVIDIA,
using CUDA.
This is for running nvGraph functions right from the CLI with graphs in MatrixMarket format
(.mtx) directly. It just needs a x86_64 linux machine with NVIDIA GPU drivers installed.
Execution time, along with the results can be saved in JSON/YAML file. The executable code
is written in C++. You can install this with npm install -g nvgraph.sh.
8. Further action
List dynamic graph algorithms
List dynamic graph data structures
List graph processing frameworks
List graph applications
Package graph processing frameworks
9. Bibliography
[1] A. Langville and C. Meyer, “Deeper Inside PageRank,” Internet Math., vol. 1, no. 3, pp.
335–380, Jan. 2004, doi: 10.1080/15427951.2004.10129091.
[2] M. Besta, M. Fischer, V. Kalavri, M. Kapralov, and T. Hoefler, “Practice of Streaming and
Dynamic Graphs: Concepts, Models, Systems, and Parallelism,” CoRR, vol.
abs/1912.12740, 2019.
[3] Contributors to Wikimedia projects, “PageRank,” Wikipedia, Jul. 2021.
https://en.wikipedia.org/wiki/PageRank (accessed Mar. 01, 2021).
[4] J. Leskovec, “PageRank Algorithm, Mining massive Datasets (CS246), Stanford
University,” YouTube, 2019.
[5] P. Garg and K. Kothapalli, “STIC-D: Algorithmic techniques for efficient parallel
pagerank computation on real-world graphs,” in Proceedings of the 17th International
Conference on Distributed Computing and Networking - ICDCN ’16, New York, New
York, USA, Jan. 2016, pp. 1–10, doi: 10.1145/2833312.2833322.
[6] Contributors to Wikimedia projects, “Nvidia Tesla,” Wikipedia, Apr. 2021.
https://en.wikipedia.org/wiki/Nvidia_Tesla (accessed Jun. 01, 2021).
[7] NVIDIA Corporation, “NVIDIA Tesla V100 GPU Architecture Whitepaper,” NVIDIA
Corporation, 2017. Accessed: Jul. 13, 2021. [Online]. Available:
https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pd
f.