Miami Data Science SALON (Nov 2018) talk regarding geometric methods for dimensionality reduction, data visualization, and stock market analysis (India's NSE).
High-Dimensional Data Visualization, Geometry, and Stock Market Crashes
1. Visualizing High
Dimensional Data with
Manifold Learning in R
BY COLLEEN M. FARRELLY, DATA SCIENTIST AT GRAHAM HOLDINGS
(KAPLAN HIGHER AND PROFESSIONAL EDUCATION)
2. My Path to Data Science
Former MD/PhD student who started doing research/attending workshops in geometry,
topology, and machine learning
Switched degree programs into biostatistics with a topology-based slant
Have worked in biotechnology, military, education, and the social sciences
Currently on the business side of running a university, with a lot of financial modeling and risk
modeling
3. Mining for Data Relationships
Exploratory analysis
Important step in data science
projects
Trend/covariance visualization
Clustering
Powerful combination for
understanding many types of
problems
Types of data problems
Time series analyses
Predictive analyses
Network analyses
9
3
13
5
1
7
8
14
10
11
12
6
16
15
17
2
4
0204060
Intelligence and Achievement Dendrogram
hclust (*, "complete")
dist(mydata[, 2:4])
Height
Unique subgroup identified
4. Time Series and Financial Data
Key tasks in time
series/financial data
analyses:
Forecasting future time
points
Identifying drivers of the
dynamic process (ex. why
are sales rising?)
Identifying tipping points
(crashes, spikes…)
Identifying covarying
behavior (sectors that
behave similarly, stocks that
influence each other, daily
rising/falling patterns…)
Dow Jones Industrial Average
5. Morse-Smale Clustering
Multivariate technique from topology
similar to mode clustering
Find peaks and valleys in data by filtering
on a defined function:
A watershed on mountains
Dribbling a soccer ball across a field of hills
Separate data based on shared peaks
and valleys
Many nice developments on
convergence and theoretical properties
R package has nice dimensionality
reduction plots to highlight cluster
differences with respect to the filter
function and predictor sets
5
6. Dimensionality Reduction and Visualization
Helpful in visualizing multivariate trends and group
differences, particularly for multivariate time series
data
Assume data lies in a lower-dimensional subspace and
map full dataset to that subspace (right)
Types of methods:
Linear (principal component analysis, or PCA)
Nonlinear (manifold learning)
Local (preserving neighborhood metrics like distance
between points)
Global (preserving global characteristics like
connectedness and limits)
Manifold learning methods related to a branch of
mathematics called differential geometry
7. Manifold Learning Methods
Three main methods considered in this analysis:
Multidimensional scaling (MDS)
Global method based on distance preservation and matrix
decomposition
Distances can be Euclidean, geodesic, Manhattan...
Nice theoretical result relating it to PCA when best subspace is
linear
Locally linear embedding (LLE)
Local method based on nearest neighbor graph, weighting, and
matrix decomposition
Related to ISOMAP and other methods
t-distributed stochastic neighbor embedding (t-SNE)
Local and global method based on mapping of probability
distributions and random walks
Preserves both local and global characteristics of the original data
space
Very strong performance on a variety of problems lately
Breast Cancer Dataset Comparison
8. Example Stock Market Dataset
Emerging markets
Important for investors
Future drivers of global trade
Global trends
Daily fluctuations
Tipping points (crashes and opportunities)
This example:
Recent Kaggle dataset of daily National Stock
Exchange of India prices from July 2003-
February 2018:
https://www.kaggle.com/abhishekyana/nse-listed-
1384-companies-data/data
Cleaned (nulls removed, <1%) and daily fluctuation
ranges added (7 total time series columns)
3616 days included
11. MDS Plot
R function
cmdscale() with
2 components
and a Euclidean
distance metric
Relationships
very linear and
well-separated
globally
Matches PCA
well
Separates into:
1. Daily price
2. Daily
fluctuation
0 5000 10000 15000
-600-400-2000
MDS Results
Dimension 1
Dimension2
12. LLE Plot
R function lle()
with 2
components
and 10 nearest
neighbors (lle
package)
Separation and
fit not great
Suggests global
behavior more
important than
local for this
time series 0 1 2 3
-4-3-2-101
LLE Results
Dimension 1
Dimension2
13. t-SNE Plot
R package dimRed
with function
getDimRedData(),
perplexity
(smoothing) at 80,
2 components, and
tsne method
Parses out tipping
points within
growth period and
exact moments of
transitional events
(see green group)
-30 -20 -10 0 10 20 30 40
-30-20-100102030
tSNE Results
Dimension 1
Dimension2
14. Deep Dive into MDS Components
MDS components separate into prices
(component 1) and fluctuation ranges
(component 2), summarized in
correlation table
Fluctuation ranges increasing as the
market gains points (left)
Original Time Series MDS Component 1 MDS Component 2
open 1.00E+00 3.25E-03
high 1.00E+00 -6.71E-03
low 1.00E+00 9.00E-03
fluctuation.range 6.84E-01 -7.06E-01
close 1.00E+00 -2.56E-03
day.range 5.14E-01 -7.47E-01
adj_close 1.00E+00 -2.41E-03
15. Transition Periods Deep Dive
Transition
periods
overlap with
long-term
trends
Shorter time-
to-transition
periods in
recent years
16. Results Overview
NSE shows exponential growth in a time period of changes
New regulations
Oil price drops
Fall of inflation
Tipping points of growth
Includes current period, starting late 2017/early 2018
Actually predicted tumble of NSE during February of 2018 in late 2017
Crash predicted by several economists for sometime in 2018:
https://www.getmoneyrich.com/indian-stock-market-correction-likely-in-2017-2018/
https://www.livemint.com/Money/pXdnLHA2r1FJfwJhFEDqjO/Stock-market-crash-Experts-divided-on-whether-theres-more.html
Fluctuations and volatility
Increasing in past few years
Can vary a lot during the day while starting and closing with similar values
17. Conclusions
Clustering and dimensionality reduction for
multivariate data exploration
Helpful for understanding multivariate time
series data
Helpful for understanding other types of data
prior to analysis
Performs very well, showing behavior
deviations before major events
Can provide an understanding of covariance
structure (relationships between stocks,
volatility within a market…)
18. References
Farrelly, C. M. (2017). Dimensionality Reduction Ensembles. arXiv preprint arXiv:1710.04484.
Gerber, S., Rübel, O., Bremer, P. T., Pascucci, V., & Whitaker, R. T. (2013). Morse–smale
regression. Journal of Computational and Graphical Statistics, 22(1), 193-214.
Kruskal, J. B. (1964). Multidimensional scaling by optimizing goodness of fit to a nonmetric
hypothesis. Psychometrika, 29(1), 1-27.
Maaten, L. V. D., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of machine learning
research, 9(Nov), 2579-2605.
Roweis, S. T., & Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear
embedding. science, 290(5500), 2323-2326.
Wold, S., Esbensen, K., & Geladi, P. (1987). Principal component analysis. Chemometrics and
intelligent laboratory systems, 2(1-3), 37-52.
ResearchGate profile with folder for talk (data, R code, PPT):
https://www.researchgate.net/profile/Colleen_Farrelly2