High-Dimensional Data Visualization, Geometry, and Stock Market Crashes

Visualizing High
Dimensional Data with
Manifold Learning in R
BY COLLEEN M. FARRELLY, DATA SCIENTIST AT GRAHAM HOLDINGS
(KAPLAN HIGHER AND PROFESSIONAL EDUCATION)

My Path to Data Science
Former MD/PhD student who started doing research/attending workshops in geometry,
topology, and machine learning
Switched degree programs into biostatistics with a topology-based slant
Have worked in biotechnology, military, education, and the social sciences
Currently on the business side of running a university, with a lot of financial modeling and risk
modeling

Mining for Data Relationships
Exploratory analysis
 Important step in data science
projects
 Trend/covariance visualization
 Clustering
 Powerful combination for
understanding many types of
problems
Types of data problems
 Time series analyses
 Predictive analyses
 Network analyses
9
3
13
5
1
7
8
14
10
11
12
6
16
15
17
2
4
0204060
Intelligence and Achievement Dendrogram
hclust (*, "complete")
dist(mydata[, 2:4])
Height
Unique subgroup identified

Time Series and Financial Data
Key tasks in time
series/financial data
analyses:
 Forecasting future time
points
 Identifying drivers of the
dynamic process (ex. why
are sales rising?)
 Identifying tipping points
(crashes, spikes…)
 Identifying covarying
behavior (sectors that
behave similarly, stocks that
influence each other, daily
rising/falling patterns…)
Dow Jones Industrial Average

Morse-Smale Clustering
Multivariate technique from topology
similar to mode clustering
 Find peaks and valleys in data by filtering
on a defined function:
 A watershed on mountains
 Dribbling a soccer ball across a field of hills
 Separate data based on shared peaks
and valleys
 Many nice developments on
convergence and theoretical properties
R package has nice dimensionality
reduction plots to highlight cluster
differences with respect to the filter
function and predictor sets
5

Dimensionality Reduction and Visualization
Helpful in visualizing multivariate trends and group
differences, particularly for multivariate time series
data
Assume data lies in a lower-dimensional subspace and
map full dataset to that subspace (right)
Types of methods:
 Linear (principal component analysis, or PCA)
 Nonlinear (manifold learning)
 Local (preserving neighborhood metrics like distance
between points)
 Global (preserving global characteristics like
connectedness and limits)
Manifold learning methods related to a branch of
mathematics called differential geometry

Manifold Learning Methods
Three main methods considered in this analysis:
 Multidimensional scaling (MDS)
 Global method based on distance preservation and matrix
decomposition
 Distances can be Euclidean, geodesic, Manhattan...
 Nice theoretical result relating it to PCA when best subspace is
linear
 Locally linear embedding (LLE)
 Local method based on nearest neighbor graph, weighting, and
matrix decomposition
 Related to ISOMAP and other methods
 t-distributed stochastic neighbor embedding (t-SNE)
 Local and global method based on mapping of probability
distributions and random walks
 Preserves both local and global characteristics of the original data
space
 Very strong performance on a variety of problems lately
Breast Cancer Dataset Comparison

Example Stock Market Dataset
Emerging markets
 Important for investors
 Future drivers of global trade
 Global trends
 Daily fluctuations
 Tipping points (crashes and opportunities)
This example:
 Recent Kaggle dataset of daily National Stock
Exchange of India prices from July 2003-
February 2018:
 https://www.kaggle.com/abhishekyana/nse-listed-
1384-companies-data/data
 Cleaned (nulls removed, <1%) and daily fluctuation
ranges added (7 total time series columns)
 3616 days included

Clustering Results
R package (msr)
 10 nearest
neighbors
 Persistence
level=1
 5 level splits
 Plot of group
trajectories (far
left)
4 distinct groups
 2 represent stable
trends (red, blue)
 2 represent
transition points in
market behavior
(green, aqua)

PCA Plot
R function
princomp()
with 2
components
Fits quite well
and shows
spread within
each cluster

MDS Plot
R function
cmdscale() with
2 components
and a Euclidean
distance metric
Relationships
very linear and
well-separated
globally
 Matches PCA
well
 Separates into:
1. Daily price
2. Daily
fluctuation
0 5000 10000 15000
-600-400-2000
MDS Results
Dimension 1
Dimension2

LLE Plot
R function lle()
with 2
components
and 10 nearest
neighbors (lle
package)
Separation and
fit not great
Suggests global
behavior more
important than
local for this
time series 0 1 2 3
-4-3-2-101
LLE Results
Dimension 1
Dimension2

t-SNE Plot
R package dimRed
with function
getDimRedData(),
perplexity
(smoothing) at 80,
2 components, and
tsne method
Parses out tipping
points within
growth period and
exact moments of
transitional events
(see green group)
-30 -20 -10 0 10 20 30 40
-30-20-100102030
tSNE Results
Dimension 1
Dimension2

Deep Dive into MDS Components
MDS components separate into prices
(component 1) and fluctuation ranges
(component 2), summarized in
correlation table
Fluctuation ranges increasing as the
market gains points (left)
Original Time Series MDS Component 1 MDS Component 2
open 1.00E+00 3.25E-03
high 1.00E+00 -6.71E-03
low 1.00E+00 9.00E-03
fluctuation.range 6.84E-01 -7.06E-01
close 1.00E+00 -2.56E-03
day.range 5.14E-01 -7.47E-01
adj_close 1.00E+00 -2.41E-03

Transition Periods Deep Dive
Transition
periods
overlap with
long-term
trends
Shorter time-
to-transition
periods in
recent years

Results Overview
NSE shows exponential growth in a time period of changes
 New regulations
 Oil price drops
 Fall of inflation
Tipping points of growth
 Includes current period, starting late 2017/early 2018
 Actually predicted tumble of NSE during February of 2018 in late 2017
 Crash predicted by several economists for sometime in 2018:
 https://www.getmoneyrich.com/indian-stock-market-correction-likely-in-2017-2018/
 https://www.livemint.com/Money/pXdnLHA2r1FJfwJhFEDqjO/Stock-market-crash-Experts-divided-on-whether-theres-more.html
Fluctuations and volatility
 Increasing in past few years
 Can vary a lot during the day while starting and closing with similar values

Conclusions
Clustering and dimensionality reduction for
multivariate data exploration
 Helpful for understanding multivariate time
series data
 Helpful for understanding other types of data
prior to analysis
Performs very well, showing behavior
deviations before major events
Can provide an understanding of covariance
structure (relationships between stocks,
volatility within a market…)

References
Farrelly, C. M. (2017). Dimensionality Reduction Ensembles. arXiv preprint arXiv:1710.04484.
Gerber, S., Rübel, O., Bremer, P. T., Pascucci, V., & Whitaker, R. T. (2013). Morse–smale
regression. Journal of Computational and Graphical Statistics, 22(1), 193-214.
Kruskal, J. B. (1964). Multidimensional scaling by optimizing goodness of fit to a nonmetric
hypothesis. Psychometrika, 29(1), 1-27.
Maaten, L. V. D., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of machine learning
research, 9(Nov), 2579-2605.
Roweis, S. T., & Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear
embedding. science, 290(5500), 2323-2326.
Wold, S., Esbensen, K., & Geladi, P. (1987). Principal component analysis. Chemometrics and
intelligent laboratory systems, 2(1-3), 37-52.
ResearchGate profile with folder for talk (data, R code, PPT):
https://www.researchgate.net/profile/Colleen_Farrelly2

High-Dimensional Data Visualization, Geometry, and Stock Market Crashes

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à High-Dimensional Data Visualization, Geometry, and Stock Market Crashes

Similaire à High-Dimensional Data Visualization, Geometry, and Stock Market Crashes (20)

Plus de Colleen Farrelly

Plus de Colleen Farrelly (20)

Dernier

Dernier (20)

High-Dimensional Data Visualization, Geometry, and Stock Market Crashes