MSC Classification: 58D15, 68T99. Keywords: Manifold Learning, Embedding Spaces, Discretized Gradient Flow

Discretized Gradient Flow for Manifold Learning

Dara Gold [email protected] Jenner & Block LLP (contractor) and Steven Rosenberg [email protected] Department of Mathematics and Statistics
Boston University
Boston, Ma 02215, USA

Abstract.

Gradient descent, or negative gradient flow, is a standard technique in optimization to find minima of functions. Many implementations of gradient descent rely on discretized versions, i.e., moving in the gradient direction for a set step size, recomputing the gradient, and continuing. In this paper, we present an approach to manifold learning where gradient descent takes place in the infinite dimensional space ${\mathcal{E}}={\rm Emb}(M,\mathbb{R}^{N})$ of smooth embeddings $\phi$ of a manifold $M$ into $\mathbb{R}^{N}$ . Implementing a discretized version of gradient descent for $P:{\mathcal{E}}\longrightarrow{\mathbb{R}}$ , a penalty function that scores an embedding $\phi\in{\mathcal{E}}$ , requires estimating how far we can move in a fixed direction – the direction of one gradient step – before leaving the space of smooth embeddings. Our main result is to give an explicit lower bound for this step length in terms of the Riemannian geometry of $\phi(M)$ . In particular, we consider the case when the gradient of $P$ is pointwise normal to the embedded manifold $\phi(M)$ . We prove this case arises when $P$ is invariant under diffeomorphisms of $M$ , a natural condition in manifold learning.

1. Introduction

A common approach in data analysis and machine learning is manifold learning, i.e., determining how to approximate a finite set of $\{y_{i}\}_{i=1}^{L}$ in Euclidean space ${\mathbb{R}}^{N}$ by a $k$ -dimensional embedded, compact manifold $M$ for some $k\ll N$ [6, 12, 13, 23, 29, 36]. (The definition of embedding is at the end of the Introduction.) While classic approaches to non-linear manifold learning include Isometric Map** (IsoMap), Local-Linear Embeddings (LLE), and Laplacian and Hessian eigenmaps, there is a growing body of work that uses gradient descent of functionals to find manifold representations of high dimensional data. This mathematical setup involves the space of smooth embeddings ${\mathcal{E}}={\rm Emb}(M,\mathbb{R}^{N})$ considered as an open subset of the infinite dimensional vector space of all maps from $M$ to ${\mathbb{R}}^{N}$ with the Banach space topology coming from a high Sobolev norm or the $C^{\infty}$ Fréchet space topology. We also have a $C^{1}$ penalty function $P:{\mathcal{E}}\rightarrow\mathbb{R}$ which typically contains a data fitting term and a regularization term, as explained below. (In kee** with the literature, we assume that $k$ and the diffeomorphism type of $M$ are given.) In theory, finding a global minimum of $P$ via the negative gradient flow of $P$ on ${\mathcal{E}}$ gives an optimal embedding, or one that “best fits” the training set $\{y_{i}\}$ . The main result of this paper (Theorem 5.7) gives precise bounds on the practical implementation issue of determining how far one can flow in a fixed gradient direction and still have an embedded manifold.

To avoid overfitting – or choosing a $\phi$ such that $\phi(M)$ fits the $\{y_{i}\}$ very closely but performs poorly on new data points – a penalty function $P$ can penalize $\phi(M)$ both for being too far from $\{y_{i}\}$ and for “twisting too much” to fit the data. Thus a typical penalty function $P=P_{1}+P_{2}:{\mathcal{E}}\longrightarrow{\mathbb{R}}$ contains two terms: (i) a data fitting term $P_{1}(\phi)=\sum_{i=1}^{r}d^{2}(\phi(M),y_{i}),$ where $d(\phi(M),y_{i})$ is the Euclidean distance from $y_{i}$ to the closest point in $\phi(M)$ ; (ii) a regularization term $P_{2}$ designed to prevent overfitting, e.g. $P_{2}(\phi)=\|\phi\|_{s},$ the $s$ -Sobolev norm of $\phi$ . (For overviews of this standard approach, see [5], [39].) Gradient descent, i.e., moving in the direction of $-\nabla P$ in ${\mathcal{E}}$ , can find a local or global minimum of $P$ , or an optimal manifold to fit $\{y_{i}\}$ .

While there are theoretical challenges with this setup, we focus on an implementation problem in this paper. In theory, to find a negative gradient flow line on ${\mathcal{E}}$ , we need to know the gradient of $P$ at each point of ${\mathcal{E}}.$ This is generally intractable for computer calculations. Instead, the gradient flow is often discretized: we move in the negative gradient direction from an initial point $\phi_{0}$ for a fixed step size to a new point $\phi_{1}$ , stop and recompute the gradient at $\phi_{1}$ , then iterate until the gradient is smaller than a specified amount. Since a gradient vector in the tangent space of ${\mathcal{E}}$ corresponds to a vector field along $\phi(M)$ , we need to estimate a lower bound $t^{*}=t^{*}(\phi)$ for how far we can move from a fixed embedding $\phi$ in the negative gradient direction $-\nabla P_{\phi}$ and still remain in the space of embeddings. In summary, we avoid the usual problem that forward geometric flows tend to develop singularities by first discretizing the flow, and then estimating how big a gradient step avoids singularities.

This practical issue is the main focus of this paper. In the main result, Theorem 5.7, we provide such a lower bound $t^{*}$ , which in effect measures how well discretized flow can approximate the smooth flow. Here $t^{*}$ depends on the local and global extrinsic geometry of $\phi(M)$ .

We emphasize that our approach to manifold learning directly tackles the infinite dimensional nature of this optimization problem via gradient flow and without making any simplifying choices that reduce the problem to finite dimensions. Typical choices in the literature are parametric methods, which fix a finite dimensional parameter space of embeddings, and RKHS methods, which reduce the optimization to a finite dimensional problem via the Representation Theorem, but only after making a choice of kernel function. In contrast, our approach only assumes that $M$ is compact, possibly with boundary, and so must contend with infinite dimensional analytic issues. Since we are given a finite set of training data, the compactness assumption is reasonable.

We briefly discuss the issues with directly working with the smooth gradient flow on ${\mathcal{E}}$ . It may be difficult to prove that $P$ is differentiable for typical data terms which measure the minimum distance from a data point to the embedded manifold [4]. Even if $P$ is differentiable, in this infinite dimensional case it is not clear that a gradient flow line $\gamma(t)$ stays in ${\mathcal{E}}$ or converges as $t\longrightarrow\infty$ to a critical point of $P$ , as ${\mathcal{E}}$ is an open dense set in the space of smooth maps from $M$ to ${\mathbb{R}}^{N}.$ Even if we can prove convergence, since neither $P$ nor ${\mathcal{E}}$ is in general convex, a critical point need not be a global minimum, and a second derivative test for local minima may be difficult to develop and implement. Perhaps most fundamentally, even the short time existence for the gradient flow may be difficult to establish, particularly if we use the most natural $C^{\infty}$ topology on ${\mathcal{E}}.$ These problems are well known in differential geometry, e.g., in the study of minimal submanifolds. In contrast, discretized gradient flow is both a tool for theoretical results on gradient flow [1, Ch. 11] and for computer implementations based on discretized, usually linearized, versions of gradient flow [14], although there may again be convergence issues [11].

As an overview of the paper, in §2, we give a short overview of manifold learning with references to the literature. §3 gives an outline of the proof of Theorem 5.7. In §4, we argue that the entire penalty term should be invariant under the diffeomorphism group of $M$ , just like the data fitting term (i). In particular, regularization terms built from geometric quantities like the volume or total mean curvature of $M$ have this invariance, while more familiar regularization terms like a Sobolev norm of the embedding do not. We prove in Theorem 4.2 that for a diffeomorphism invariant penalty function $P$ , the gradient vector field $\nabla P_{\phi}$ is guaranteed to be pointwise normal to $\phi(M).$ §5 gives the proof of Theorem 5.7. §6 is a discussion of potential extensions of this work. Appendix A contains a proof of a quantitative implicit function theorem used in §5.

We recall the technical definition of an embedding of a manifold $M$ into a manifold $W$ ( $W={\mathbb{R}}^{N}$ for us).

Definition 1.1.

A smooth map $f:M\longrightarrow W$ between smooth manifolds is an immersion if the differential $df_{x}:T_{x}M\longrightarrow T_{f(x)}W$ is injective for all $x\in M.$ An immersion is an embedding if $f$ is a homeomorphism from $M$ to $f(M)$ in the induced topology, i.e., a set $V\subset f(M)$ is open iff $V=U\cap f(M)$ for an open set $U\subset W.$

Since $M$ is compact in this paper, the unwieldy topological condition for an embedding simplifies.

Proposition 1.1.

[26, Prop. 4.22] If $M$ is compact, a smooth immersion $f:M\longrightarrow W$ is an embedding.

2. Related work

Manifold learning is an approach to dimensionality reduction, the attempt to replace high dimensional data in ${\mathbb{R}}^{N},N\gg 0$ , by a low dimensional subset. Standard techniques in manifold learning, such as Locally Linear Embedding (LLE), IsoMap [40], Laplacian Eigenmaps [5], and Hessian Eigenmaps [13], involve algorithms that reduce to (often nontrivial) minimization problems in finite dimensions. In theory, these minimization problems can be solved by Lagrangian multipliers, so gradient descent is not a built in feature of these approaches. (We note that our discretization method is somewhat the reverse of the successful manifold approximation approach of Laplacian eigenmaps, where a discrete set of data in ${\mathbb{R}}^{N}$ that apparently lies close to a submanifold is parametrized by a subset of ${\mathbb{R}}^{k}$ through eigenvectors of a graph Laplacian; this parametrization is our $\phi^{-1}.$ )

In contrast, our approach is inherently infinite dimensional and relies on gradient flow, as explained in the Introduction. The use of gradient flow for functionals on infinite dimensional manifolds of maps has a large literature in machine learning, where this comes under the general heading of nonparametric methods. (In the parametric approach, one restricts attention to a finite dimensional submanifold depending on a finite dimensional family of parameters.) Osher and Sethian introduced the Level Set Method [35], which has been applied to machine learning by using gradient desecent on energy functionals (which act on the space of level set functions) to find optimal data-classification boundaries. Viewing the decision boundary this way avoids typical problems that arise with cusps and discontinuities in a flow whose speed is curvature dependent. This work has been extended in many directions, including computer vision and image analysis, fluid mechanics, and classification problems [33, 38, 41, 44]. In supervised learning, [3] finds optimal statistical labeling functions by using gradient descent of penalty functionals that include both a data term $P_{1}$ as above and a geometric regularization term $P_{2}$ . (It should be noted that this paper has to resort to parametric methods to implement the discretized gradient flow algorithm.) There are intriguing connections between regularization methods and classical physical equations in Lin et al. [27].

Although applied here to manifold learning, the appearance of gradient flow in infinite dimensions of course has its roots in differential geometry. In minimal submanifold theory, the penalty function is the purely geometric volume of the embedded manifold, and the gradient flow is the mean curvature flow. A sampling of results is in [21, 24, 25, 37, 43]. Similar to our approach, Mayer [30] uses a discretized approximation to the gradient flow, which more closely mimics implementation processes. It is worth noting that historically, the modern study of gradient flow in differential geometry was initiated by Morse [32] in the 1930s on the infinite dimensional space of paths on a Riemannian manifold, which was then adapted by Milnor [31] to develop Morse theory on finite dimensional manifolds. In turn, Morse theory has undergone widespread development through Floer theory and its many variants in the past 25 years (see e.g., [2]).

As described in the Introduction, our penalty functions evaluate manifold embeddings with diffeomorphism invariant data and geometric regularization terms. The use of such terms is a develo** area at the interface of differential geometry and machine learning. [41] uses the surface area of a decision boundary as the regularization term $P_{2}(\phi)$ , while [3] uses the area of the manifold itself for $P_{2}(\phi)$ , and [7] uses a discrete version of the total mean curvature of a surface with applications to tomography. This last article contains many references to generalizations of optimization methods to a fixed finite dimensional Riemannian manifold, while our interest is in the infinite dimensional space of embeddings. Finally, the strongest connection to date between manifold learning and differential geometry is in the work of Fefferman et al. [16, 17, 18, 19, 20] on the “manifold hypothesis.”

Although using gradient descent for manifold learning has widespread applications to machine learning, discretizing the flow - which is needed for most implementations - has many unstudied challenges. [3] for example, which is most closely related to our paper, uses a fixed step size in their gradient flow implementation. Our paper is the first to address the maximum step size that ensures a manifold remains an embedding when finding a low dimensional representation of training data.

3. Proof Outline for the Discretized Gradient Flow Estimate

Because of the computational detail in §5, we give an overview of the proof structure and the locations of key results.

3.1. General Overview

In Theorem 4.2 in §4, we give a natural condition on the penalty function $P:{\mathcal{E}}\longrightarrow{\mathbb{R}}$ under which $\nabla P$ is pointwise normal to an embedding $\phi(M)$ . Throughout the paper, we assume that $P$ satisfies this condition.

Given a pointwise normal vector field $u$ along $\phi(M)$ with the length of each vector in $u$ at most one, §5, which has our main results, gives a lower bound for $t^{*}$ such that

\phi_{t}(m)=\phi(m)+tu_{m}

remains an embedding for all $|t|<t^{*}$ .¹¹1 The Euler class of the normal bundle $e\in H^{N-{\rm dim}(M)}(M)$ is the obstruction to the global existence of a unit normal vector field. Since $e$ may be nonzero, we must refer to vector fields whose elements have length at most one. If $N>2{\rm dim}(M)$ , the obstruction vanishes. In particular, this applies to $u=k_{\phi}^{-1}\cdot\nabla P_{\phi}$ , where $k_{\phi}=\max_{x\in M}\|\nabla P_{\phi(x)}\|$ . Since $M$ is compact, it suffices to prove that $\phi_{t}(M)$ is an injective immersion.

3.2. Note on Computation of Key Values

Proposition 5.1 gives a condition under which $\phi_{t}$ is an immersion, and Theorem 5.2 defines the bound $t^{*}$ in which $\phi_{t}$ is injective. Finally, together with our assumption that $M$ is compact, our main result Theorem 5.7 concludes the map** is an embedding. In the proof of Theorem 5.2, $t^{*}$ is initially a function of the quantities $\epsilon,\delta_{H},\delta,K$ . $\epsilon$ is defined in §5.1(1), and $K$ is explicitly defined in §5.1(7) as the maximal principal eigenvalue of $\phi(M).$ In Lemma 4, $\epsilon$ is computed as a function of $\delta$ and $K$ , so $t^{*}=t^{*}(K,\delta,\delta_{H}).$ The dependence of $t^{*}$ on $\delta_{H}$ is eliminated after (5.24), so finally $t^{*}=t^{*}(K,\delta).$

The computation of $\delta$ is significantly more involved. The characterizing property of $\delta$ is in §5.1(8). $\delta$ is defined in (5.2) as the minimum of a quantity $\delta(q_{0},v_{0})$ , where $(q_{0},v_{0})$ is in the normal bundle of $\phi(M)$ . In turn, $\delta(q_{0},v_{0})$ is computed in the proof of Proposition 5.6 in three steps, each of which builds on the prior: $\delta^{0}(q_{0},v_{0})$ is defined in (5.17), $\delta^{1}(q_{0},v_{0})$ is defined by (5.18), and $\delta^{2}(q_{0},v_{0})$ is defined in (5.22). Finally $\delta(q_{0},v_{0})$ is defined in (5.23) in terms of $\delta^{0}(q_{0},v_{0}),\delta^{2}(q_{0},v_{0})$ . These steps are recapped in Remark 5.2.

4. A Condition for Normal Gradient Vector Fields

As outlined in the introduction, manifold learning involves searching for an embedding $\phi:M\longrightarrow{\mathbb{R}}^{N}$ with $y_{i}\in{\rm Im}(\phi)$ for training data $\{y_{i}\}.$ Of course, $y_{i}\in{\rm Im}(\phi)$ iff $y_{i}\in{\rm Im}(\phi\circ g)$ , where $g\in{\rm Diff}(M)$ is a diffeomorphism of $M$ . Thus the penalty term $P_{1}:{\mathcal{E}}\longrightarrow{\mathbb{R}}$ which measures goodness of fit should not distinguish between $\phi$ and $\phi\circ g$ , i.e., this penalty term must be invariant under the action of ${\rm Diff}(M)$ : $P_{1}(\phi)=P_{1}(\phi\circ g)$ . The data penalty term $P_{1}(\phi)=\sum_{i=1}^{r}d^{2}(\phi(M),y_{i})$ in the introduction is clearly diffeomorphism-invariant. (Since the quotient space ${\mathcal{E}}/{\rm Diff}(M)$ may have a non-Hausdorff topology, we consider diffeomorphism-invariant penalty functions on ${\mathcal{E}}$ , rather than penalty functions on the quotient space.) These types of invariant functionals are familiar in gauge theory, where functionals are invariant under gauge group actions, and in Gromov-Witten theory, where maps are defined only up to holomorphic automorphisms.

Similarly, we can replace the non-diffeomorphism invariant regularization term $\|\phi\|_{s}$ , which is computed in a choice of local coordinates, by e.g. $P_{2}^{\prime}(\phi)={\rm vol}(\phi(M))$ , which measures a combination of the first derivatives of $\phi=(\phi^{1},\ldots,\phi^{N})$ , or by
$P_{2}^{\prime}(\phi)=\int_{M}\left[\sum_{j=1}^{N}(({\rm Id}+\Delta)^{s}\phi^{j% })\cdot\phi^{j}\right]^{1/2}{\rm dvol}_{M}$ , which is equivalent to the $s$ -Sobolev norm by the basic elliptic estimate. As a simple example, for ${\mathcal{E}}={\rm Emb}(S^{2},{\mathbb{R}}^{3})$ , $P_{1}^{\prime}(\phi)=d^{2}(\phi(S^{2}),\vec{0})$ , $P_{2}^{\prime}={\rm vol}(\phi(S^{2}))$ , and for the standard unit sphere as the initial embedding $\phi_{0}(S^{2})$ , gradient flow for $P^{\prime}=P^{\prime}_{1}+P^{\prime}_{2}$ shrinks the unit sphere to the origin in infinite time.

In this section, we prove that such penalty functions have gradients that are pointwise normal vector fields to $M$ , and apply this result to ${\mathcal{E}}$ . We first review a known result about the gradient function on a finite dimensional manifold with a group action. Recall that for a $C^{1}$ function $P:Z\longrightarrow{\mathbb{R}}$ on an oriented Riemannian manifold $(Z,h)$ , the gradient vector field $\nabla P$ is characterized by

dP_{m}(v)=\langle\nabla P,v\rangle_{h(m)},

for all $m\in Z,v\in T_{m}Z.$ Here $dP_{m}:T_{m}Z\longrightarrow{\mathbb{R}}$ , the differential of $P$ at $m$ , is independent of the Riemannian metric.

Lemma 4.1.

Let $G$ be a connected Lie group acting via isometries on a Riemannian manifold $Z$ . A function $P:Z\longrightarrow{\mathbb{R}}$ is $G$ -invariant ( $P(g\cdot m)=P(m)$ for all $m\in Z,g\in G$ ) iff $\nabla P_{m}$ is perpendicular to the orbit $\mathcal{O}_{m}=\{g\cdot m:g\in G\}$ for all $m\in Z.$

Strictly speaking, we mean $\nabla P(m)\perp_{h(m)}T_{m}\mathcal{O}_{m}.$

Proof.

If $P$ is $G$ -invariant, then $\mathcal{O}_{m}$ is contained in a level set of $P$ . The gradient is always perpendicular to a level set: for $X\in T_{m}\mathcal{O}$ , take a curve $\gamma(t)\in\mathcal{O}_{m}$ with $\dot{\gamma}(0)=X$ , and compute

0=(d/dt)|_{t=0}P(\gamma(t))=dP_{m}(X)=\langle\nabla P_{m},X\rangle.

Conversely, assume that $\nabla P_{m}\perp T\mathcal{O}_{m}$ for all $m$ . Take a smooth path $\eta(t),t\in[0,1],$ from $e\in G$ to a fixed $g\in G$ , and for a fixed $m\in Z$ define $\gamma(t)=\eta(t)\cdot m.$ Then

0=\langle\nabla P_{\gamma(t)},\dot{\gamma}(t)\rangle=dP_{\gamma(t)}(\dot{% \gamma}(t)),

so $P$ is constant along $\gamma(t).$ In particular, $P(m)=P(\gamma(0))=P(\gamma(1))=P(g\cdot m)$ . ∎

We want to apply this result with $Z,G$ given by ${\mathcal{E}},{\rm Diff}(M)$ , respectively. (Since ${\rm Diff}(M)$ need not be connected, we have to restrict to ${\rm Diff}_{0}(M)$ , the connected component of the identity diffeomorphism.) The smooth structure on map** spaces is well known (see e.g., [15]). Rather than go through the technicalities of the Lie group structure on ${\rm Diff}(M)$ [34], we give a direct proof.

The tangent space $T_{\phi}{\mathcal{E}}$ at an embedding $\phi$ is given by the infinitesimal variation of a family of embeddings $\phi(t)$ , which for fixed $m\in M$ is given by $(d/dt)|_{t=0}\phi_{t}(m)\in T_{\phi(m)}{\mathbb{R}}^{N}\simeq{\mathbb{R}}^{N}.$ Thus elements $X$ of $T_{\phi}{\mathcal{E}}$ are “ ${\mathbb{R}}^{n}$ -valued vector fields along $\phi(M)$ ,” i.e., smooth functions $X:M\longrightarrow{\mathbb{R}}^{N}.$

For $\phi\in{\mathcal{E}}$ , $M$ has a Riemannian metric $g_{\phi}$ given by the $\phi$ -pullback of the standard metric/dot product on ${\mathbb{R}}^{N}$ restricted to $\phi(M).$ Specifically, for $v,w\in T_{m}M$ , $\langle v,w\rangle_{m}=d\phi(v)\cdot d\phi(w).$ Denote the associated volume form on $M$ by ${\rm dvol}_{\phi}.$ We take the $L^{2}$ inner product on $T_{\phi}{\mathcal{E}}$ associated to the standard metric/dot product on ${\mathbb{R}}^{N}$ and $g_{\phi}$ :

\langle X,Y\rangle_{\phi}=\int_{M}X_{m}\cdot Y_{m}\ {\rm dvol}_{\phi}(m).

Thus the gradient of $P:{\mathcal{E}}\longrightarrow{\mathbb{R}}$ is characterized by

dP_{\phi}(X)=\langle\nabla P_{\phi},X\rangle_{\phi}=\int_{M}\nabla P_{m}\cdot X% _{m}\ {\rm dvol}_{\phi}(m).

${\rm Diff}(M)$ acts on $\phi\in{\mathcal{E}}$ by $g\cdot\phi=\phi\circ g^{-1}.$ It is standard that ${\rm Diff}(M)$ acts via isometries on ${\mathcal{E}}$ with the $L^{2}$ metric.

In our setting, we can strengthen Lemma 4.1 to the pointwise normal condition
$\nabla P_{\phi(m)}\cdot Q_{m}=0$ for all $Q_{m}\in T_{\phi(m)}\phi(M),m\in M$ , for a ${\rm Diff}(M)$ -invariant $P$ .

Theorem 4.2.

For a $C^{1}$ function $P:{\mathcal{E}}\longrightarrow\mathbb{R}$ , the gradient $\nabla P$ is pointwise normal to $T_{\phi(m)}\phi(M)$ for all $m\in M$ and for all $\phi\in{\mathcal{E}}$ if and only if $P$ is invariant under diffeomorphisms in ${\rm Diff}_{0}(M)$ , the path connected component of the identity in ${\rm Diff}(M).$

We note that this pointwise perpendicularity is measured in the usual dot product on ${\mathbb{R}}^{N}$ , even though we have implicitly been using $\phi$ -pullback metrics on $M$ . In particular, the theoretical use of the pullback metric does not affect the practical implementation of discretized gradient flow.

Proof.

Assume $P$ is ${\rm Diff}_{0}(M)$ -invariant. As in the Lemma, we conclude that $\nabla P_{\phi}\perp_{L^{2}}T_{\phi}\mathcal{O}_{\phi}.$

Take a family of diffeomorphisms $g_{t}$ of $M$ with $g_{0}={\rm Id}$ and with tangent vector $X=(d/dt)|_{t=0}g_{t}\in T_{\rm Id}{\rm Diff}(M).$ Then $\phi\circ g_{t}\in\mathcal{O}_{\phi}$ , and the vector field $(d/dt)|_{t=0}\phi\circ g_{t}=d\phi(X)$ tangent to $\phi(M)$ is in $T_{\phi}\mathcal{O}_{\phi}$ . Conversely, any tangent vector field $V$ to $\phi(M)$ integrates to a family of diffeomorphisms in ${\rm Diff}_{0}(M)$ , so we conclude that $V\in T_{\phi}\mathcal{O}_{\phi}$ and that (up to a choice of topology on ${\rm Diff}(M)$ ) $T_{\phi}\mathcal{O}_{\phi}$ is the space of tangent vector fields to $\phi(M).$

Fix $m_{0}\in M$ and a vector $Q_{m_{0}}\in T_{\phi(m_{0})}\phi(M)$ . Choose a sequence $\epsilon_{k}\longrightarrow 0$ and smooth functions $f_{k}:\phi(M)\longrightarrow\mathbb{R}$ such that $\int_{M}f_{k}\ {\rm dvol}_{\phi}=1$ , ${\rm supp}(f_{k})\subset B_{\epsilon_{k}}(\phi(m_{0}))\cap\phi(M)$ , with $B_{\epsilon_{k}}(\phi(m_{0}))$ the Euclidean ball of radius $\epsilon_{k}$ centered at $\phi(m_{0})$ . Extend $Q_{m_{0}}$ to a vector field $Q=Q_{m}$ on $\phi(M)$ , and define the vector fields $Y_{k}$ on $\phi(M)$ by:

Y_{k}(\phi(m))=f_{k}(\phi(m))\cdot Q_{m}.

Then we have

	$\displaystyle 0$	$\displaystyle=$	$\displaystyle\lim_{\epsilon_{k}\longrightarrow 0}\langle\nabla P_{\phi},Y_{% \epsilon_{k}}\rangle=\lim_{\epsilon_{k}\longrightarrow 0}\langle\nabla P_{\phi% },f_{k}\cdot Q\rangle=\lim_{\epsilon_{k}\longrightarrow 0}\int_{M}\nabla P_{% \phi}(\phi(m))\cdot f_{k}(\phi(m))Q_{m}\ {\rm dvol}_{\phi}$
		$\displaystyle=$	$\displaystyle\nabla P_{\phi}(\phi(m_{0}))\cdot Q_{m_{0}}$

Therefore $\nabla P_{\phi}(\phi(m_{0}))\perp Q_{m_{0}}$ , and so $\nabla P_{\phi}(\phi(m_{0}))\perp T_{\phi(m_{0})}\phi(M)$ .

For the converse, assume that $\nabla P_{\phi}(\phi(m))\perp T_{\phi(m)}\phi(M)$ for all $m\in M$ . Then $\nabla P\perp_{L^{2}}Q$ for all tangent vector fields $Q$ to $\phi(M)$ , and so $\nabla P$ is perpendicular to the orbit of ${\rm Diff}_{0}(M).$ As in Lemma 4.1, we conclude that $P$ is ${\rm Diff}_{0}(M)$ -invariant. ∎

5. Estimates for Flows in Normal Gradient Directions

Under the assumption that our penalty function is diffeomorphism invariant, to implement discretized gradient flow, by Theorem 4.2 we have to know how far $\phi(M)$ can move in a fixed normal gradient direction while remaining in the space of embeddings. The next set of results gives an explicit estimate for the lower bound $t^{*}$ of this flow, with the main result in Theorem 5.7.

Throughout the paper, we assume that $M$ is compact. By Prop. 1.1, $\phi:M\longrightarrow{\mathbb{R}}^{N}$ is an embedding iff it is an injective immersion. Recall that $\phi$ is an immersion if its differential $d\phi$ is pointwise injective, which is the infinitesimal condition for the map $\phi$ to be a local injection. Thus, there are two types of obstructions to a linearly deformed embedding $\phi_{t}$ of $\phi$ remaining an embedding: (1) a local obstruction, where distinct nearby points in $\phi(M)$ deform to the same point in $\phi_{t}(M)$ ; (2) a global obstruction, where points far from each other in the induced Riemannian metric on $\phi(M)$ deform to the same point in $\phi_{t}(M)$ because they are close in ${\mathbb{R}}^{N}.$ The local obstruction is controlled by the injectivity of the differential. Specifically, in Theorem 5.7, we conclude that $t^{*}$ is ultimately a function of $K$ and $\delta$ , where $K$ is is a bound on the principal curvature of $\phi$ and thus controls the local obstruction. The global obstruction, which cannot be treated by infinitesimal means, is controlled in Theorem 5.7 by $\delta$ , which is constructed by bounds in the Implicit Function Theorem.

5.1. Notation and Definitions

(1)

$\epsilon=\epsilon_{\phi}$ is chosen so that each $s$ in the $\epsilon$ -neighborhood $B_{\epsilon}(\phi(M))$ of $\phi(M)$ has a unique closest point $q=q(s)$ in $\phi(M)$ . The existence of this neighborhood is guaranteed by the $\epsilon$ -Neighborhood Theorem [26, Thm. 6.24]. $B_{\epsilon}(\phi(M))$ is diffeomorphic to a neighborhood of the zero section of the normal bundle $\nu=\nu_{\phi}$ of $\phi(M)$ : we have $s-q\in\nu_{q}=\nu_{\phi,q}$ , the fiber of $\nu_{\phi}$ at $q$ , and the map $s\mapsto s-q$ is the diffeomorphism. A lower bound for $\epsilon$ is given in Lemma 5.5 in terms of $\delta$ in (8) below; it will become explicit in Remark 5.2.

(2)

We use two sets of coordinates on $\mathbb{R}^{N}$ . Standard (global) coordinates are denoted $(x^{1},\ldots,x^{N})$ . We also represent points $s\in B_{\epsilon}(\phi(M))$ as

s=(q^{1},\ldots,q^{k},v^{1},\ldots,v^{N-k})=(q,v),

where the $q^{i}$ are local manifold coordinates and $v^{j}$ are local coordinates for the normal space. These are called normal coordinates. Thus $q\in\phi(M)$ has $q=(q^{1},\ldots,q^{k},0,\ldots,0)$ . Here $k={\rm dim}(M).$ Note that normal coordinates are not well defined outside $B_{\epsilon}(M).$

(3)

A vector in $\nu_{\phi}$ can be expressed either as $tv_{q}$ , where $v_{q}$ is a unit length vector at $q$ , or as $v^{i}w_{i,q}$ , where $\{w_{i,q}\}$ is an orthonormal basis of $\nu_{\phi,q}$ . There are $N-k$ $\{w_{i,q}\}$ vectors, each with $N$ Euclidean coordinates.

(4)

The endpoint map $E:\nu_{\phi}\rightarrow\mathbb{R}^{N}$ is $E(q,v)=q+v$ . It is given explicitly by:

E(q^{1},\ldots,q^{k},v^{1},\ldots,v^{N-k})=(x^{1}(q)+v^{i}w_{i,q}^{1},\ldots,x% ^{N}(q)+v^{i}w_{i,q}^{N}),

where the domain is in normal coordinates and the range is in standard coordinates.

Definition 5.1.

[31, §6] $e=q_{e}+v_{e}\in B_{\epsilon}(\phi(M))$ is a focal point if the Jacobian of the $E$ map is not full rank at $(q_{e},v_{e})$ .

(5)

The inclusion map $\phi(M)\rightarrow\mathbb{R}^{N}$ is $q=(q^{1},\cdots,q^{k})\mapsto(x^{1}(q),\cdots,x^{N}(q))=x(q)$ in manifold to Euclidean coordinates, so the first fundamental form is the matrix $(g_{ij})=\big{(}\frac{\partial x}{\partial q^{i}}\cdot\frac{\partial x}{% \partial q^{j}}\big{)}$ , where $\cdot$ is the Euclidean dot product. The second fundamental form at the normal vector $v\in\nu_{\phi}$ is the matrix ${\rm II}_{v}=\left(v\cdot\frac{\partial^{2}x}{\partial q^{i}\partial q^{j}}\right)$ .
(6)

At a fixed $q\in\phi(M)$ , we may choose manifold coordinates so that the first fundamental form is the identity matrix. The principal curvatures of $v$ at $q$ are by definition the eigenvalues $p_{1},\ldots,p_{k}$ of ${\rm II}_{v}$ . Here $p_{i}=p_{i}(q,v).$
(7)

Let $K$ be the maximal principal eigenvalue of $\phi(M)$ . Thus we take the maximum of the $p_{i}(v)$ over all unit vectors in $\nu_{\phi}.$
(8)

$\delta$ is chosen such that normal lines of length $\epsilon$ based at different, close points of $\phi(M)$ do not intersect: for $d_{\mathbb{R}^{N}}(\phi(m_{1}),\phi(m_{2}))<\delta$ , $\phi(m_{1})+t_{1}v_{1}\neq\phi(m_{2})+t_{2}v_{2}$ for unit normal vectors $v_{i}\in\nu_{\phi(m_{i})}$ , $i=1,2$ , and $|t_{1}|,|t_{2}|<\epsilon,$ with $\epsilon$ defined in (1) above. $\delta$ is precisely defined in (5.2), and estimated in Remark 5.2. ( $\delta$ is the reach of $\phi(M)$ , as in e.g. [20].)

Remark 5.1.

In the calculations below, estimates for $\epsilon,\delta,K$ are computed explicitly in terms of $\phi$ , local coordinates on $M$ , and local coordinates on $\nu_{\phi}$ . Specifically, a lower bound for $\epsilon$ in terms of $K$ and $\delta$ is given in Lemma 4. $K$ of course depends on $\phi$ , but is in fact independent of coordinates on $M$ , as it is the maximum eigenvalue of any normal component of the trace of the second fundamental form. The estimate of $\delta$ uses $\phi$ , local coordinates on $M$ , and local coordinates on $\nu_{\phi}$ in e.g., the proof of Proposition 5.6. It is reasonable to assume knowledge of coordinates on $M$ , as a manifold is specified by a cover of charts. In fact, local coordinates on $M$ and $\phi$ determine local coordinates on $\nu_{\phi}.$ ²²2Take the standard basis $\{e_{i}\}$ of ${\mathbb{R}}^{N}$ . For $I=(i_{1},\cdots,i_{N-k})$ with $1\leq i_{1}<\cdots<i_{N-k}\leq N,$ lexicographically ordered, set $e_{I}=(e_{i_{1}},\ldots,e_{i_{N-k}})$ Let $U_{I}$ be the open set of $q\in\phi(M)$ such that $I$ is the smallest multi-index such that the projection of $e_{I}$ into $\nu_{\phi,q}$ is a basis of $\nu_{\phi,q}$ . Then $\nu_{\phi}$ is trivial over $U_{I}$ , and we can form a new, fixed cover of $M$ by taking $\{V_{i}\cap U_{I}\}.$ In particular, the local coordinates on $\nu_{\phi}$ in (2) are not extra data, since the embedding $\phi$ determines which $q$ are in which $U_{I}$ . Thus, in the end our estimates depend only on local coordinates on $M$ and on $\phi.$ See Remark 5.2 for more details.

5.2. Calculating the Flow Length to Remain an Embedding

In this section, we compute $t^{*}$ such that for $t<t^{*}$ and $u$ a normal vector field along $\phi(M)$ with $|u_{\phi(m)}|\leq 1$ , the deformed manifold $\phi_{t}(M)=\{\phi(m)+tu_{\phi(m)}:m\in M\}$ is an embedding. As above, it suffices to prove that each $\phi_{t}$ is an immersion.

We start by determining which normal deformations $\phi_{t}(M)$ of $\phi(M)$ are still immersions.

Proposition 5.1.

Let $u$ be a normal vector field of length at most one along $\phi(M)\subset\mathbb{R}^{N}$ , and let $\epsilon$ be defined in §5.1(1). Then $\phi_{t}(M)=\{{\phi(m)+tu_{\phi(m)}:m\in M}\}$ is immersed in $\mathbb{R}^{N}$ for $|t|<\epsilon$ .

Proof.

Because $\phi:M\longrightarrow{\mathbb{R}}^{N}$ is an embedding, it suffices to show that the map $F_{t}:\phi(M)\rightarrow\phi_{t}(M)$ , $F_{t}(q)=q+tu_{q}$ , is an immersion. In normal coordinates, we have

F_{t}(q^{1},\ldots,q^{k})=(q^{1},\ldots,q^{k},tu^{1}_{q},\ldots,tu^{N-k}_{q}).

The differential $DF_{t}$ , written as an $N\times k$ matrix, is of the form

DF_{t}=\left(\begin{array}[]{c}\\ {\rm Id}_{k\times k}\\ \\ \hline\cr\\ {}^{\star}\end{array}\right),

where $\star$ is some $(N-k)\times k$ matrix. This has rank $k$ , so $F_{t}$ is an immersion. We note $\epsilon$ is implicitly used as normal coordinates are only defined in $B_{\epsilon}(\phi(M)).$ ∎

Thus $\phi_{t}$ is an embedding if it is injective. Theorem 5.2 proves injectivity for $|t|\leq t^{*}$ , where $t^{*}$ is defined in the Theorem statement. The proof of Theorem 5.2 follows after the proofs of Lemmas 5.3-5.5 and Proposition 5.6.

Theorem 5.2.

Let $u$ be a normal vector field of length at most one along $\phi(M)\subset\mathbb{R}^{N}$ Let $t^{*}=\min\{K^{-1},\delta/3\}$ . Then $\phi_{t}:M\rightarrow\mathbb{R}^{N}$ given by $m\mapsto\phi(m)+tu_{\phi(m)}$ is injective for $|t|\leq t^{*}$ .

Here $\delta$ is given by §5.1(8), and will be estimated explicitly after the proof of Proposition 5.6.

Proof. As in the previous proof, it suffices to show that $F_{t}:\phi(M)\longrightarrow\phi_{t}(M)$ is injective. We extend $F_{t}$ to a map between open subsets of ${\mathbb{R}}^{N}$ by setting

H_{t}:B_{\epsilon-t}(\phi(M))\longrightarrow B_{\epsilon}(\phi(M)),\ \ H_{t}(b% )=b+tu_{q(b)},

where $q(b)$ is the closest point in $\phi(M)$ to $b$ . Note that $H_{t}|_{\phi(M)}=F_{t}$ and that $H_{t}$ is defined only for $|t|<\epsilon.$

We now proceed with a series of Lemmas.

Lemma 5.3.

For each $q\in\phi(M)$ , there exists a ball $B_{\delta^{q}_{H_{t}}}(q)$ of radius $\delta_{H_{t}}^{q}$ around $q$ on which $H_{t}$ is a diffeomorphism.

Proof.

In normal coordinates, we have

H_{t}(b)=H_{t}(q^{1},\ldots,q^{k},v^{1},\ldots,v^{N-k})=(q^{1},\ldots,q^{k},v^% {1}+tu^{1}_{q(b)},\ldots,v^{N-k}+tu^{N-k}_{q(b)}).

For $q=(q,0)\in\phi(M)$ , the differential of the $H_{t}$ map has the matrix

DH_{t}(q)=\left(\begin{array}[]{c|c}{\rm Id}_{k\times k}&\frac{\partial q^{i}}% {\partial v^{j}}\\ &\\ \hline\cr\\ \frac{\partial(v^{i}+tu^{i}_{q})}{\partial q^{j}}&\frac{\partial(v^{i}+tu^{i}_% {q})}{\partial v^{j}}\end{array}\right)=\left(\begin{array}[]{c|c}{\rm Id}_{k% \times k}&0\\ &\\ \hline\cr\\ \frac{\partial(v^{i}+tu^{i}_{q})}{\partial q^{j}}&{\rm Id}_{(N-k)\times(N-k)}% \end{array}\right).

This matrix is invertible, so the Lemma follows from the inverse function theorem. ∎

Let $\delta_{H_{t}}=\min_{q}\{\delta_{H_{t}}^{q}\}$ . Set

\delta_{H}=\min\{\delta_{H_{t}}:|t|\leq.999\epsilon\}.

(5.1)

From the proof of the inverse function theorem, we can choose $\delta^{q}_{H_{t}}>0$ to be continuous in $t$ . We need $|t|<\epsilon$ , and then the further restriction $|t|\leq.999\epsilon$ ensures that $t$ lies in a compact subset of ${\mathbb{R}}$ . Thus $\delta_{H_{t}}$ and $\delta_{H}$ are positive. Note that $\delta_{H}=\delta_{H}(u)$ depends on the choice of the normal vector field $u$ .

Lemma 5.4.

$H_{t}|_{\phi(M)}$ is injective for $|t|<t^{*}\stackrel{{\scriptstyle\rm def}}{{=}}\min\left\{\epsilon,\delta_{H}/3\right\}$ .

Proof.

Assume instead that there exist $x,y\in\phi(M)$ such that $x+tu_{x}=y+tu_{y}$ for $|t|<t^{*}$ . By Lemma 5.3, $d_{\mathbb{R}^{N}}(x,y)>\delta_{H_{t}}$ . Then

$\displaystyle\delta_{H_{t}}$	$\displaystyle<$	$\displaystyle d_{\mathbb{R}^{N}}(x,y)=\|x-y\|=\|x-(x+tu_{x})+(x+tu_{x})-y\|$
	$\displaystyle\leq$	$\displaystyle\|x-(x+tu_{x})\|+\|(y+tu_{y})-y\|=\|tu_{x}\|+\|tu_{y}\|\leq 2\|t\|<2t^{*}$
	$\displaystyle\leq$	$\displaystyle 2\delta_{H_{t}}/3,$

since $t^{*}<\delta_{H_{t}}/3.$ This is a contradiction. ∎

We now compute $\epsilon$ in §5.1(1) in terms of $K$ in §5.1(7) and $\delta$ in §5.1(8). As mentioned above, $K$ is computed locally on $\phi(M)$ , while $\delta$ is computed globally using the Euclidean distance.

Lemma 5.5.

Set $\epsilon=\min\left\{K^{-1},\delta/3\right\}$ , where $K$ is given in §5.1(7) and $\delta$ is given in §5.1(8). Then every point in $B_{\epsilon}(\phi(M))$ has a unique closest point in $\phi(M).$

Proof.

By [31, Lem. 6.3], the focal points (Def. 5.1) of $\phi(M)$ along the normal line $l=q+tv$ are precisely the points $q+p_{i}^{-1}v$ , where the $p_{i}$ are the nonzero principal curvatures. The proof of the $\epsilon$ -Neighborhood Theorem in [26, Thm. 6.24] uses the invertibility of the endpoint map, so we must have $\epsilon<K^{-1}.$

Suppose there exists $b\in B_{\epsilon}(\phi(M))$ with closest points $x,y\in\phi(M)$ . Then $b=x+tv_{x}=y+t^{\prime}v_{y}$ for unit normal vectors $v_{x}$ at $x$ , $v_{y}$ at $y$ , and $|t|,|t^{\prime}|<\epsilon.$ By definition of $\delta$ , we have $d_{\mathbb{R}^{N}}(x,y)>\delta$ . As in the previous proof, we have

	$\displaystyle\delta<d_{\mathbb{R}^{N}}(x,y)$	$\displaystyle=$	$\displaystyle\|x-y\|=\|x-(x+tv_{x})+(y+t^{\prime}v_{y})-y\|$
		$\displaystyle\leq$	$\displaystyle\|t\|\|v_{x}\|+\|t^{\prime}\|\|v_{y}\|<2\epsilon\leq 2\delta/3,$

a contradiction. ∎

We can now define $\delta$ in (5.2) below, after which we explicitly estimate it in the proof of Proposition 5.6. The steps of the estimate are recapped in Remark 5.2. We first restrict the endpoint map $E:\nu_{\phi}\longrightarrow\mathbb{R}^{N}$ to the compact set $W=\{v\in\nu_{\phi}:|v|\leq.999K^{-1}\}.$ For fixed $q_{0}\in\phi(M)$ and $(q_{0},v_{0})\in\nu_{\phi,q_{0}}\cap W=W_{q_{0}}$ , the proof of Lemma 5.3 shows that $DE(q_{0},v_{0})$ is invertible. Therefore, there is a ball of radius $\delta(q_{0},v_{0})>0$ around $(q_{0},v_{0})$ on which $E$ is a diffeomorphism. Set $\delta_{q_{0}}=\delta(q_{0},0)$ and

\displaystyle A_{q_{0}}

\displaystyle=

\displaystyle\{q\in\phi(M):d_{\mathbb{R}^{N}}\>(q,q_{0})<\delta_{q_{0}}/2\}.

We claim that $E$ is a diffeomorphism on the the set $B_{q_{0}}\subset\nu_{\phi}$ given by

B_{q_{0}}=\{(q,v):|v|<\delta_{q_{0}}/2,q\in A_{q_{0}}\}.

Indeed, for $(q_{1},v_{1})\in B_{q_{0}}$ , we have

|(q_{1},v_{1})-(q_{0},0)|\leq|(q_{1},v_{1})-(q_{1},0)|+|(q_{1},0)-(q_{0},0)|% \leq\delta_{q_{0}}+\delta_{q_{0}}/2\leq\delta_{q_{0}}.

Thus for $(q_{1},v_{1}),(q_{2},v_{2})\in B_{q_{0}}$ and $(q_{1},v_{1})\neq(q_{2},v_{2})$ , we conclude $(q_{1},0),(q_{2},0)\in A_{q_{0}}$ and $E(q_{1},v_{1})\neq E(q_{2},v_{2})$ . Since $E$ is invertible on $B_{q_{0}},$ it is a diffeomorphism onto its image.

We set

\delta=\frac{1}{2}\min\{\delta(q_{0},v_{0}):(q_{0},v_{0})\in\nu_{\phi},|v_{0}|% \leq.999K^{-1}\}.

(5.2)

Since $M$ is compact and $|v_{0}|$ lies in a compact interval, $\delta$ is positive. In other words, for $q_{1},q_{2}\in\phi(M)$ with $d_{\mathbb{R}^{N}}(q_{1},q_{2})<\delta$ , we have $q_{1}+v_{1}\neq q_{2}+v_{2}$ for $|v_{1}|,|v_{2}|<\delta$ and $(q_{1},v_{1})\in\nu_{\phi,q_{1}},(q_{2},v_{2})\in\nu_{\phi,q_{2}}.$

For a fixed $(q_{0},v_{0})$ , it remains to compute $\delta(q_{0},v_{0})$ explicitly, after which $\delta$ in (5.2) is explicit. The computation of $\delta(q_{0},v_{0})$ uses a quantitative version [28] of the Implicit Function Theorem given in the next Proposition. The proof is in the Appendix.

To set the notation, let the matrix norm $\|A\|$ be the sup norm of the absolute values of the entries. For $G\in C^{1}(\mathbb{R}^{m+n},\mathbb{R}^{m})$ , let $(s_{0},y_{0})\in\mathbb{R}^{m+n}\times{\mathbb{R}}^{m}$ satisfy $G(s_{0},y_{0})=0$ . For fixed $\delta>0$ , set $V_{\delta}=V_{\delta(s_{0},y_{0})}=\{(s,y)\in\mathbb{R}^{m+n}:|s-s_{0}|\leq% \delta,|y-y_{0}|\leq\delta\}$ . We focus on the case $G(s,y)=E(s)-y$ for $m=n$ , the usual method to derive the Inverse Function Theorem from the Implicit Function Theorem.

Proposition 5.6.

Assume that the $m\times m$ matrix $\partial_{s}G(s_{0},y_{0})$ of partial derivatives of $G$ in the $s$ directions is invertible. Choose $\delta^{0}>0$ such that

\sup_{(s,y)\in V_{\delta^{0}}}\|{\rm Id}-[\partial_{s}G(s_{0},y_{0})]^{-1}% \partial_{s}G(s,y)\|\leq 1/2.

(5.3)

Set
(I) $B_{\delta^{0}}=\sup_{(s,y)\in V_{\delta^{0}}}\|\partial_{y}G(s,y)\|$ ,
(II) $P=\|\partial_{s}G(s_{0},y_{0})^{-1}\|$ ,
(III) $\delta^{1}=(2PB_{\delta^{0}})^{-1}\delta^{0}$ .
Then for the case $n=m$ and $G(s,y)=E(s)-y$ , on the set $\{(s,y):\|s-s_{0}\|<\delta^{0},\|y-y_{0}\|<\delta^{1},G(s,y)=0\}$ , $E$ has a $C^{1}$ inverse: $E(s)=y$ iff $s=E^{-1}(y).$ Equivalently, $E$ is a $C^{1}$ diffeomorphism on

E^{-1}(B_{\delta^{1}}(y_{0}))\cap B_{\delta^{0}}(s_{0}).

(5.4)

To apply the Proposition, we set $n=m=N$ and $G((q,v),y)=E(q,v)-y$ , where $E$ is the endpoint map. We follow the Proposition’s labeling in a series of steps:

Criterion I: Independent of the value of $\delta^{0}=\delta^{0}((q_{0},v_{0}),y_{0})$ , we have

	$\displaystyle B_{\delta^{0}}$	$\displaystyle=$	$\displaystyle\sup_{((q,v),y)\in V_{\delta^{0}}}\|\|\partial_{y}G((q,v),y)\|\|=\sup% _{((q,v),y)\in V_{\delta^{0}}}\|\|\partial_{y}(E(q,v)-y)\|\|$
		$\displaystyle=$	$\displaystyle\sup_{((q,v),y)\in V_{\delta^{0}}}\\|-{\rm Id}\\|=1.$

Criterion II: By §5.1(4),(7),

\partial_{(q,v)}G((q_{0},v_{0}),y_{0})=DE(q_{0},v_{0})

is invertible for $|v|<K^{-1}$ . In the notation of §5.1(4),

	$\displaystyle DE(q_{0},v_{0})=$
		$\displaystyle\left(\begin{array}[]{cccccc}\left(\frac{\partial x^{1}}{\partial q% ^{1}}+v^{i}\frac{\partial w_{i}^{1}}{\partial q^{1}}\right)\|_{(q_{0},v_{0})}&% \cdots&\left(\frac{\partial x^{1}}{\partial q^{k}}+v^{i}\frac{\partial w_{i}^{% 1}}{\partial q^{k}}\right)\|_{(q_{0},v_{0})}&w^{1}_{1,q_{0}}&\cdots&w_{N-k,q_{0% }}^{1}\\ \vdots&&\vdots&\vdots&&\vdots\\ \left(\frac{\partial x^{N}}{\partial q^{1}}+v^{i}\frac{\partial w_{i}^{N}}{% \partial q^{1}}\right)\|_{(q_{0},v_{0})}&\cdots&\left(\frac{\partial x^{N}}{% \partial q^{k}}+v^{i}\frac{\partial w_{i}^{N}}{\partial q^{k}}\right)\|_{(q_{0}% ,v_{0})}&w^{N}_{1,q_{0}}&\cdots&w_{N-k,q_{0}}^{N}\end{array}\right)$		(5.8)

By Cramer’s rule,

P=\|DE(q_{0},v_{0})^{-1}\|=(\det(DE(q_{0},v_{0})))^{-1}\|(DE(q_{0},v_{0})^{*}\|,

(5.9)

where $DE(q_{0},v_{0})^{*}_{(i,j)}$ is the usual minor of $DE(q_{0},v_{0})$ obtained by deleting the $i^{\rm th}$ row and $j^{\rm th}$ column. Since $\phi$ and the $w_{i}$ are given, we obtain an estimate for $P$ .

Criterion III: We now compute $\delta^{1}=\delta^{1}(q_{0},v_{0}),\delta^{0}=\delta^{0}(q_{0},v_{0})$ such that (5.3) holds for $((q,v),y)$ . Since (5.3) is independent of $y$ in our case, we need $\delta^{0}(q_{0},v_{0})$ such that

|(q,v)|<\delta^{0}(q_{0},v_{0})\Rightarrow\|{\rm Id}-[DE(q_{0},v_{0})]^{-1}DE(% q,v)\|\leq 1/2.

(5.10)

We consider a first order Taylor expansion of $DE(q,v)$ around $s_{0}=(q_{0},v_{0})$ . (Note: The summed index $j$ below refers to coordinates in $\mathbb{R}^{N}$ , not an exponent). For $s=(q,v)$ , we have

	$\displaystyle DE(s)$	$\displaystyle=$	$\displaystyle DE(s_{0})+\left(\begin{array}[]{ccc}R^{(1,1)}_{j}(q,v)(s-s_{o})^% {j}&\cdots&R^{(1,N)}_{j}(q,v)(s-s_{o})^{j}\\ \vdots&&\vdots\\ R^{(N,1)}_{j}(q,v)(s-s_{o})^{j}&\cdots&R^{(N,N)}_{j}(q,v)(s-s_{o})^{j}\end{% array}\right)$
		$\displaystyle\stackrel{{\scriptstyle\rm def}}{{=}}$	$\displaystyle DE(s_{0})+(R^{(p,r)}_{j}(q,v)(s-s_{o})^{j}).$

As in Criterion II, set $f^{r}_{p}=\frac{\partial x^{r}(q)}{\partial q^{p}}+v^{i}\frac{\partial w_{i}^{% r}(q)}{\partial q^{p}}$ for all $1\leq p\leq N$ , $1\leq r\leq k$ , and $f^{r}_{p}=w^{r}_{p,q}$ for $1\leq p\leq N$ , $k+1\leq r\leq N$ . A uniform bound on the error term is given by Taylor’s theorem with integral remainder:

	$\displaystyle\left\|R^{(p,r)}_{j}(q,v)(s-s_{0})^{j}\right\|$	$\displaystyle\leq\left\|\int_{0}^{1}(1-t)\partial_{j}f^{r}_{p}((1-t)(q_{0},v_{0% })+t(q,v))dt\right\|\cdot\left\|(s-s_{0})^{j}\right\|$
		$\displaystyle\leq\max\left\{\left\|\partial_{j}f^{r}_{p}(q,v)\right\|:1\leq j% \leq N,\|v\|\leq.999K^{-1},q\in\phi(M)\right\}\|s-s_{0}\|$
		$\displaystyle\stackrel{{\scriptstyle\rm def}}{{=}}L^{(p,r)}_{j}\|s-s_{0}\|.$

Here $\partial_{j}$ differentiates in the $s$ variable. Set

{L}=\max_{j,p,r}\{{L}^{(p,r)}_{j}\}.

(5.15)

Plugging (5.2) into the right hand side of (5.10) and canceling the identity matrix, the matrix norm in (5.10) becomes

	$\displaystyle\left\\|[DE(q_{0},v_{0})]^{-1}(R^{(p,r)}_{j}(q,v)(s-s_{0})^{j})\right\\|$	$\displaystyle=$	$\displaystyle\max_{j,p,r}\left\|([DE(q_{0},v_{0})]^{-1})^{p}_{\ell}(R^{(\ell,r)% }_{j}(q,v)(s-s_{0})^{j})\right\|$		(5.16)
		$\displaystyle\leq$	$\displaystyle N\\|[DE(q_{0},v_{0})]^{-1}\\|\cdot{L}\cdot\delta^{0}(q_{0},v_{0}),$		(5.16)

where the $N$ comes from the sum over $\ell=1,\ldots,N$ . Setting

\delta^{0}(q_{0},v_{0})=\left[2N\|DE(q_{0},v_{0})^{-1}\|\cdot{L}\right]^{-1},

(5.17)

we conclude that the estimate (5.10) is satisfied.

In summary, we now have

\delta^{1}(q_{0},v_{0})=(2PB_{\delta^{0}(q_{0},v_{0})})^{-1}\delta^{0}(q_{0},v% _{0})=(2P)^{-1}\delta^{0}(q_{0},v_{0}),

(5.18)

by Criterion I. Thus $\delta^{1}(q_{0},v_{0})$ is estimated by Criterion II and III.

By Proposition 5.6, $E$ is a diffeomorphism on $E^{-1}(B_{\delta^{1}(q_{0},v_{0})}(y_{0}))\cap B_{\delta^{0}(q_{0},v_{0})}(q_{% 0},v_{0})$ . To be explicit, we want to find radius $\delta(q_{0},v_{0})$ such that

B_{\delta(q_{0},v_{0})}(q_{0},v_{0})\subset E^{-1}(B_{\delta^{1}(q_{0},v_{0})}% (y_{0}))\cap B_{\delta^{0}(q_{0},v_{0})}(q_{0},v_{0}).

(5.19)

We first find $\delta^{2}(q_{0},v_{0})$ such that

|(q,v)-(q_{0},v_{0})|<\delta^{2}(q_{0},v_{0})\Rightarrow|E(q,v)-E(q_{0},v_{0})% |=|E(q,v)-y_{0}|<\delta^{1}(q_{0},v_{0}).

In other words, we want

|(q,v)-(q_{0},v_{0})|<\delta^{2}(q_{0},v_{0})\Rightarrow E(q,v)\in B_{\delta^{% 1}(q_{0},v_{0})}(y_{0}).

(5.20)

As above, we compute $\delta^{2}(q_{0},v_{0})$ by a Taylor series expansion of $E$ around $(q_{0},v_{0})$ :

E(q,v)=E(q_{0},v_{0})+\left(\sum\limits_{j}R^{1}_{j}(q,v)((q,v)-(q_{0},v_{0}))% ^{j},\ldots,\sum\limits_{j}R^{N}_{j}(q,v)((q,v)-(q_{0},s_{0}))^{j}\right),

with

	$\displaystyle\|R^{p}_{j}(q,v)\|$	$\displaystyle\leq$	$\displaystyle\max\left\{\left\|\partial_{j}(\phi^{p}+v^{i}w_{i}^{p})(q,v)\right% \|:1\leq j\leq N,\|v\|\leq.999K^{-1},q\in\phi(M)\right\}$		(5.21)
		$\displaystyle\stackrel{{\scriptstyle\rm def}}{{=}}$	$\displaystyle{S}^{p}.$		(5.21)

For $s_{0}=(q_{0},v_{0}),s=(q,v)$ , we have

	$\displaystyle\|E(s)-E(s_{0})\|^{2}$	$\displaystyle=$	$\displaystyle\sum\limits_{p=1}^{N}\left(\sum\limits_{j}R^{p}_{j}(s)(s-s_{0})^{% j}\right)^{2}\leq\sum\limits_{p=1}^{N}\left(\sum\limits_{j}\|R^{p}_{j}(s)\|^{2}% \right)\|s-s_{0}\|^{2}$
		$\displaystyle\leq$	$\displaystyle N\left(\sum\limits_{p=1}^{N}\|{S}^{p}\|^{2}\right)\|s-s_{0}\|^{2}% \leq\sum\limits_{p=1}^{N}\sum\limits_{j}\|{S}^{p}\delta^{2}(q_{0},v_{0})\|^{2}.$

Therefore, for

\delta^{2}(q_{0},v_{0})=\delta^{1}(q_{0},v_{0})\left(N\sum\limits_{p=1}^{N}|{S% }^{p}|^{2}\right)^{-1/2},

(5.22)

estimate (5.20) holds. Finally, setting

\delta(q_{0},v_{0})=\min\{\delta^{2}(q_{0},v_{0}),\delta^{0}(q_{0},v_{0})\}

(5.23)

accomplishes (5.19).

By Lemmas 5.4, 5.5, and using (5.2) to define $\delta$ , we know that Theorem 5.2 holds, i.e., $\phi_{t}$ is injective, for

t^{*}<\min\{K^{-1},\delta_{H}/3,\delta/3\}.

(5.24)

If we prove that $\delta_{H}>\delta$ , then we get injectivity of $\phi_{t}$ for $t^{*}<\min\{K^{-1},\delta/3\}$ , which is Theorem 5.2.

By the definition of $\delta$ in §3.1(8), we have $x,y\in\phi(M)$ and $d_{\mathbb{R}^{N}}(x,y)<\delta$ implies $x+t_{1}v_{x}\neq y+t_{2}v_{y}$ for $|t_{i}|<\epsilon$ and for any unit normal vectors $v_{x},v_{y}$ at $x,y,$ resp. By Lemma 5.3, for $d_{\mathbb{R}^{N}}(x,y)<\delta_{H_{t}}=\delta_{H_{t}}(u)$ for a fixed normal vector field $u$ of length at most one, we have $x+tu_{x}\neq y+tu_{y}$ . (By the remarks above Lemma 5.3, we also have $|t|<\epsilon$ here.) Since $\delta$ does not depend on a choice of vector field $u$ , we have $\delta\leq\delta_{H_{t}}(u).$ This implies $\delta\leq\delta_{H}.$ Thus we can conclude that $\phi_{t}$ is injective for $t^{*}<\min\{K^{-1},\delta/3\}$ , and the proof of Theorem 5.2 is complete.

Remark 5.2.

We review the explicit lower bound for $\delta$ . For $L$ defined by (5.15), $\delta^{0}(q_{0},v_{0})$ is defined by (5.17). For $P$ defined by (5.9), $\delta^{1}(q_{0},v_{0})$ is defined by (5.18). For $S^{p}$ defined in (5.21), $\delta^{2}(q_{0},v_{0})$ is defined in (5.22). Then (5.23) defines $\delta(q_{0},v_{0}).$ Finally, (5.2) defines $\delta.$

In particular, lower bounds on $L,$ $P$ , and $S^{p}$ will give a lower bound on $\delta.$ These constants depend on $q$ -derivatives (i.e., $M$ coordinate derivatives) of the ${\mathbb{R}}^{N}$ coordinates of $\phi$ and of vectors in $\nu_{\phi}$ (see e.g., (5.2)). Since the normal bundle is determined by $M$ and $\phi$ , our estimates are explicit in the sense of Remark 5.1.

5.3. The Main Theorem

Since $M$ is compact and since $\phi_{t}$ is an injective immersion for $|t|\leq t^{*}$ by Theorem 5.2, by Prop. 1.1 we obtain the main result that $\phi_{t}$ is an embedding for $t$ less than an explicit $t^{*}$ .

Theorem 5.7.

Let $u$ be a normal vector field of length at most one along $\phi(M)\subset\mathbb{R}^{N}$ . Let $t^{*}=\min\{K^{-1},\delta/3\}$ , with $K$ defined in §5.1(7) and $\delta$ estimated in Remark 5.2. Then $\phi_{t}:M\rightarrow\mathbb{R}^{N}$ given by $m\mapsto\phi(m)+tu_{\phi(m)}$ is an embedding for $|t|\leq t^{*}$ .

6. Discussion

In this paper, we have proposed treating manifold learning by gradient flow techniques that are standard in much of machine learning. By doing gradient flow in the infinite dimensional space of embeddings of a fixed manifold $M$ into ${\mathbb{R}}^{N}$ , we avoid parametric and RKHS methods. These methods typically restrict the class of manifolds considered to a finite dimensional space, which speeds up computation time at the cost of perhaps oversimplifying the problem. In our approach, we give both a theoretical reason to move only in normal directions to the embedded manifold and theoretical lower bounds on the existence for each step of a good discretized version of gradient flow on the space of embeddings. However, this paper does not discuss computational issues, which must be addressed in future work. In particular, one has to recompute the estimates for the maximal time $t^{*}$ of travel after each step. This reflects the theoretical issue that the gradient flow may leave the space of embeddings in finite time. It may be possible to add a penalty term to the objective function that forces the gradient flow to stay in the space of embeddings. This new term would involve the bounds we computed on both local quantities like $K$ and global quantities like $\delta$ in §5.1.

There are several practical and theoretical issues raised by this approach. On the practical side, if $M$ flows discretely in $k$ steps to a Riemannian manifold $M_{k}$ with a thin neck, as typically happens in mean curvature flow, then in Thm. 5.7 $K$ will be very large and $t^{*}=t^{*}_{k}$ will be small at $M_{k}$ . Thus the discretized gradient flow will essentially stop. It may be reasonable to pick the first $k$ such that $K$ at $M_{k}$ exceeds a specified threshold. We then backtrack to $M_{k-1}$ (or even further back to some $M_{k-r}$ for some $r>1$ ) and move to $M^{\prime}_{k}$ using the gradient at $M_{k-1}$ and new step size $\bar{t}_{k}<t^{*}_{k}$ , e.g., $\bar{t}_{k}=(1/2)t^{*}_{k}.$ Since the gradient vector field at $M^{\prime}_{k}$ is different from the gradient vector field at $M_{k}$ , the discretized flow may move $M^{\prime}_{k}$ to $M^{\prime}_{k+1}$ with $K$ at $M^{\prime}_{k+1}$ still below the threshold. Thus we may be able to extend the flow for an increased number of steps.

There are two theoretical issues that need further examination. The first is the choice of $M$ : how is this manifold specified? Based on Riemannian geometry estimates dating to the 1980s, it is reasonable to assume that we want to consider manifolds of a fixed dimension with a priori a lower bound on volume, an upper bound on diameter, and two-sided bounds on sectional curvature. Cheeger’s finiteness theorem [8] asserts that there are only a finite number of diffeomorphism classes among all such manifolds. (It would be interesting to determine if the class ${\mathcal{G}}(d,V,\tau)$ in [20] has a similar finiteness theorem. We note that the approach of Fefferman et al. has the strong advantage of not specifying the diffeomorphism type of $M$ .) However, while this in theory provides us with a finite list of choices, the proof of the finiteness theorem is nonconstructive. In practice, in many cases we might as well assume that $M$ is the closed unit ball $B^{k}$ in ${\mathbb{R}}^{k}$ . For example, in the famous Swiss roll examples, the data set appears to lie on the image of a severely deformed $B^{2}$ . In contrast, if the training data appears to lie on a deformed torus, $B^{2}$ is a worse choice for $M$ than the standard torus.

Perhaps even more importantly, it is unclear how to specify the dimension of $M$ in advance. This has been discussed in the literature: see e.g. [42] and its references for work done before the last decade, and [22] for more recent work. In these works, issues such as the potentially fractal/Hausdorff dimension of the data set have been discussed. From a more geometric mindset, we could speculatively start with a $k$ -manifold, and hope that in the long run, $M$ would collapse in the sense of Cheeger-Gromov [9] to a lower dimensional manifold of “best” dimension. This would address the issue that the initial choice for $M$ has to be modified as more data is considered. Even more speculatively, since all Riemannian manifolds are via cut locus arguments homeomorphic to a closed ball with gluings on the boundary, we could start with the $k$ -ball $B^{k}$ , add a regularization term, like the volume of $\partial B^{k}=S^{k-1},$ that penalizes the existence of a boundary, and hope that long time flow provides both dimension collapse and boundary gluing. We have no evidence that this will work, but a low dimensional computation is potentially feasible.

Acknowledgements

Our thanks to Carlangelo Liverani for allowing us to use his Quantitative Implicit Function Theorem. We are also grateful to Qinxun Bai, Andres Larrain-Hubach, Drew Lohn, and the referee for their helpful suggestions.

Appendix A The Quantitative Implicit Function Theorem

This quantitative version of the Implicit Function theorem and its proof are from [28] (see also [10, Appendix A]).

For notation, recall that $\|A\|$ is the sup norm of the absolute values of the entries of a matrix $A$ . For fixed $(x_{0},\lambda_{0})\in{\mathbb{R}}^{m}\times{\mathbb{R}}^{n}$ and fixed $\delta>0$ , set $V_{\delta}=V_{\delta(x_{0},\lambda_{0})}=\{(x,\lambda)\in\mathbb{R}^{m+n}:|x-x% _{0}|\leq\delta,|\lambda-\lambda_{0}|\leq\delta\}$ .

For $F\in C^{1}(\mathbb{R}^{m+n},\mathbb{R}^{m})$ , let $(x_{0},\lambda_{0})\in\mathbb{R}^{m}\times\mathbb{R}^{n}$ satisfy $F(x_{0},\lambda_{0})=0$ .

Theorem A.1 (Quantitative Implicit Function Theorem).

Assume that the $m\times m$ matrix $\partial_{x}F(x_{0},\lambda_{0})$ is invertible and choose $\delta>0$ such that

\sup_{(x,\lambda)\in V_{\delta}}||{\rm Id}-[\partial_{x}F(x_{0},\lambda_{0})]^% {-1}\partial_{x}F(x,\lambda)||\leq 1/2.

Let $B_{\delta}=\sup_{(x,\lambda)\in V_{\delta}}||\partial_{\lambda}F(x,\lambda)||$ and $M=||\partial_{x}F(x_{0},\lambda_{0})^{-1}||$ . Set $\delta^{1}=(2MB_{\delta})^{-1}\delta$ , and set $\Gamma_{\delta^{1}}=\{\lambda\in\mathbb{R}^{n}:|\lambda-\lambda_{0}|<\delta^{1}\}$ , $V_{\delta,\delta^{1}}=\{(x,\lambda)\in\mathbb{R}^{m+n}:|x-x_{0}|\leq\delta,|% \lambda-\lambda_{0}|\leq\delta^{1}\}$ .

Then there exists $g\in C^{1}(\Gamma_{\delta^{1}},\mathbb{R}^{m})$ such that all solutions of the equation $F(x,\lambda)=0$ in the set $V_{\delta,\delta^{1}}$ are given by $(g(\lambda),\lambda)$ . In addition, $\partial_{\lambda}g(\lambda)=-(\partial_{x}F(g(\lambda),\lambda))^{-1}\partial% _{\lambda}F(g(\lambda),\lambda)$ .

Proof.

Take $\lambda\in\Gamma_{\delta^{1}}=\{|\lambda-\lambda_{0}|<\delta^{1}\}$ . Consider $U_{\delta}=\{x\in\mathbb{R}^{m}:|x-x_{0}|\leq\delta\}$ and $\Omega_{\lambda}:U_{\delta}\rightarrow\mathbb{R}^{m}$ defined by

\Omega_{\lambda}(x)=x-\partial_{x}F(x_{0},\lambda_{0})^{-1}F(x,\lambda).

For $x\in U_{\delta},F(x,\lambda)=0$ is equivalent to $x=\Omega_{\lambda}(x)$ . We have

|\Omega_{\lambda}(x_{0})-\Omega_{\lambda_{0}}(x_{0})|\leq M|F(x_{0},\lambda)-F% (x_{0},\lambda_{0})|\leq MB_{\delta}\delta^{1}.

In addition, $|\partial_{x}\Omega_{\lambda}|=|{\rm Id}-\partial_{x}F(x_{0},\lambda_{0})^{-1}% \partial_{x}F(x,\lambda)|\leq 1/2$ , so $|\Omega_{\lambda}(x)-\Omega_{\lambda}(x_{0})|\leq\frac{1}{2}|x-x_{0}|.$ Thus

	$\displaystyle\|\Omega_{\lambda}(x)-x_{0}\|$	$\displaystyle\leq\|\Omega_{\lambda}(x)-\Omega_{\lambda}(x_{0})\|+\|\Omega_{% \lambda}(x_{0})-x_{0}\|$
		$\displaystyle\leq\frac{1}{2}\|x-x_{0}\|+MB_{\delta}\delta^{1}\leq\delta.$

Thus $\Omega_{\lambda}$ is a contraction on $U_{\delta}$ , and $\Omega_{\lambda}(x)=x$ has a unique solution $x=g(\lambda)$ by the Contraction Fixed Point Theorem. We have therefore obtained a function $g:\Gamma_{\delta^{1}}\rightarrow U_{\delta}$ such that $F(g(\lambda),\lambda)=0$ . All solutions in $V_{\delta,\delta^{1}}$ are of this form: if $F(x_{1},\lambda_{1})=0$ , then

|x_{1}-g(\lambda_{1})|=|\Omega_{\lambda_{1}}(x_{1})-\Omega_{\lambda_{1}}(g(% \lambda_{1}))|\leq\frac{1}{2}|x_{1}-g(\lambda_{1})|,

so $x_{1}=g(\lambda_{1}).$

For the final statement in the Theorem, let $\lambda,\lambda^{\prime}\in\Gamma_{\delta^{1}}$ . As above, we have

|g(\lambda)-g(\lambda^{\prime})|\leq\frac{1}{2}|g(\lambda)-g(\lambda^{\prime})% |+MB_{\delta}|\lambda-\lambda^{\prime}|

This yields the Lipschitz continuity of $g$ . To obtain differentiability, by Taylor’s theorem for $F\in C^{1}$ and the Lipschitz continuity of $g$ , we obtain, for $h\in\mathbb{R}^{n}$ ,

	$\displaystyle{0=\lim_{\|h\|\longrightarrow 0}\|h\|^{-1}\|F(g(\lambda+h),\lambda+h)-% F(g(\lambda),\lambda)\|}$
	$\displaystyle=\lim_{\|h\|\longrightarrow 0}\|h\|^{-1}\|F(g(\lambda+h),\lambda+h)-F(% g(\lambda),\lambda+h)+F(g(\lambda),\lambda+h)-F(g(\lambda),\lambda)\|$
	$\displaystyle=\lim_{\|h\|\longrightarrow 0}\|h\|^{-1}\|\partial_{x}F(g(\lambda),% \lambda+h)(g(\lambda+h)-g(\lambda))+\partial_{\lambda}F(g(\lambda),\lambda)(% \lambda+h-\lambda)\|$
	$\displaystyle=\partial_{x}F(g(\lambda),\lambda)\lim_{h\longrightarrow 0}\|h\|^{-% 1}\|g(\lambda+h)-g(\lambda)+(\partial_{x}F(g(\lambda),\lambda))^{-1}\|\partial_{% \lambda}F(g(\lambda),\lambda)\|.$

Since $\partial_{x}F(g(\lambda),\lambda)\neq 0,$ we get $\partial_{\lambda}g(\lambda)=-(\partial_{x}F(g(\lambda),\lambda))^{-1}\partial% _{\lambda}F(g(\lambda),\lambda)$ . ∎

References

[1] Luigi Ambrosia, Nicola Gigli, and Giuseppe Savaré, Gradient Flows in Metric Spaces and in the Space of Probability Measures, Birkhäuser, Basil, 2008.
[2] Michèle Audin and Mihai Damian, Morse theory and Floer homology, Universitext, Springer, London; EDP Sciences, Les Ulis, 2014.
[3] Qinxun Bai, Steven Rosenberg, Zheng Wu, and Stan Sclaroff, A differential geometric approach to classification, Proceedings of The 33rd International Conference on Machine Learning 48 (2016).
[4] Qinxun Bai, Steven Rosenberg, and Wei Xu, A geometric understanding of natural gradient, https://arxiv.longhoe.net/abs/2202.06232.
[5] Mihail Belkin and Partha Niyogi, Laplacian eigenmaps for dimensionality reduction and data representation, Neural Computation 15 (2003), 1373–1396.
[6] Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani, Manifold regularization: A geometric framework for learning from labeled and unlabeled examples, Journal of Machine Learning Research 7 (2006), 2399–2434.
[7] Ronny Bergmann et al., Discrete total variation of the normal vector field as shape prior with applications in geometric inverse problems, Inverse Problems 36 (2020).
[8] Jeff Cheeger, Finiteness theorems for Riemannian manifolds, Amer. J. Math. 92 (1970), 61–74.
[9] Jeff Cheeger and Mikhael Gromov, Collapsing Riemannian manifolds while kee** their curvature bounded. I, J. Differential Geom. 23 (1986), no. 3, 309–346.
[10] Luigi Chierchia, Kolomogorov-Arnold-Moser (KAM) theory, Mathematics of Complexity and Dynamical Systems. Vols. 1–3, Springer, New York (2012), 810–836.
[11] Yaim Cooper, Discrete gradient descent differs qualitatively from gradient flow, arXiv:1808.04839 (2018).
[12] Antonio Criminisi, Jamie Shotton, and Ender Konukoglu, Decision forests: A unified framework for classification, regression, density estimation, manifold learning and semi-supervised learning, Foundations and Trends in Computer Graphics and Vision 7 (2012), 81–227.
[13] David Donoho and Carrie Grimes, Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data, Proceedings of the National Academy of Sciences 100 (2003), no. 10, 5591–5596.
[14] Mark Droske and Martin Rumpf, A variational approach to nonrigid morphological image registration, SIAM Journal on Applied Mathematics 2 (2004), 668–687.
[15] James Eells, Jr., A setting for global analysis, Bull. Amer. Math. Soc. 72 (1966), 751–807.
[16] Charles Fefferman, Sergei Ivanov, Yaroslav Kurylev, Matti Lassas, **peng Lu, and Hariharan Narayanan, Reconstruction and interpolation of manifolds. II: Inverse problems for Riemannian manifolds with partial distance data, https://arxiv.longhoe.net/abs/2111.14528.
[17] Charles Fefferman, Sergei Ivanov, Yaroslav Kurylev, Matti Lassas, and Hariharan Narayanan, Reconstruction and interpolation of manifolds. I: The geometric Whitney problem, Found. Comput. Math. 20 (2020), no. 5, 1035–1133.
[18] Charles Fefferman, Sergei Ivanov, Matti Lassas, and Hariharan Narayanan, Fitting a manifold of large reach to noisy data, https://arxiv.longhoe.net/abs/1910.05084.
[19] by same author, Reconstruction of a Riemannian manifold from noisy intrinsic distances, SIAM J. Math. Data Sci. 2 (2020), no. 3, 770–808.
[20] Charles Fefferman, Sanjoy Mitter, and Hariharan Narayanan, Testing the manifold hypothesis, J. Amer. Math. Soc. 29 (2016), no. 4, 983–1049.
[21] Claus Gerhardt, Evolutionary surfaces of prescribed mean curvature, Journal of Differential Equations 36 (1980), 139–172.
[22] Daniele Granata and Vincenzo Carnevale, Accurate estimation of the intrinsic dimension using graph distances: Unraveling the geometric complexity of datasets, Sci. Rep. 6 (2016), https://www.nature.com/articles/srep31377.
[23] Guodong Guo, Yun Fu, Charles R. Dyer, and Thomas S. Huang, Image-based human age estimation by manifold learning and locally adjusted robust regression, IEEE Transactions on Image Processing 17 (2008), 1178–1188.
[24] Richard S. Hamilton, Harnack estimate for the mean curvature flow, Journal of Differential Geometry 41 (1995), 215–226.
[25] Gerhard Huisken and Carlo Sinestrari, Mean curvature flow singularities for mean convex surfaces, Calculus of Variations and Partial Differential Equations 8 (1999), 1–14.
[26] John M. Lee, Introduction to Smooth Manifolds, Graduate Texts in Mathematics, vol. 218, Springer, New York, 2013.
[27] Tong Lin, Hanlin Xue, Ling Wang, Bo Huang, and Hongbin Zha, Supervised learning via Euler’s elastica models, Journal of Machine Learning Research 16 (2015), 3637–3686.
[28] Calangelo Liverani, Implicit function theorem (a quantitative version), https://www.mat.uniroma2.it/~liverani/Calcolo1-2016/implicit.pdf.
[29] Yunqian Ma and Yun Fu (eds.), Manifold Learning and Applications, CRC Press, Boca Raton, 2011.
[30] Uwe F. Mayer, Gradient flows on nonpositively curved metric spaces and harmonic maps, Communications in Analysis and Geometry 6 (1998), no. 2, 199–253.
[31] John Milnor, Morse Theory, Princeton University Press, Princeton, NJ, 1969.
[32] Marston Morse, The foundations of the calculus of variations in m-space. Part I, Trans. Amer. Math. Soc. 31 (1929), 379–404.
[33] David Mumford and Jayant Shah, Optimal approximations by piecewise smooth functions and associated variational problems, Communications in Pure and Applied mathematics 42 (1989), no. 5, 577–685.
[34] Hideki Omori, Infinite-dimensional Lie groups, Translations of Mathematical Monographs, vol. 158, American Mathematical Society, Providence, RI, 1997.
[35] Stanley Osher and James. A Sethian, Fronts propogating with curvature dependant speed: Algorithms based on Hamilton-Jacobi formulations, Journal of Computational Physics 79 (1988), 12–49.
[36] Sam Roweis and Lawrence Saul, Nonlinear dimensionality reduction by locally linear embedding, Science 290 (2000), no. 5500, 2323–2326.
[37] Melanie Rupflin and Peter M. Top**, Flowing maps to minimal surfaces, American Journal of Mathematics 138 (2016), no. 4, 1095–1115.
[38] James A. Sethian, Level Set Methods and Fast Marching Methods: Evolving Interfaces in Computational Geometry, Fluid Mechanics, Computer Vision, and Materials Science, vol. 3, Cambridge University Press, 1999.
[39] Alexander Smola, Sebastian Mika, Bernhard Schölkopf, and Robert Williamson, Regularized principal manifolds, JMLR 1 (2001), 179–209.
[40] Joshua Tenenbaum, Vin de Silva, and John Langford, A global geometric framework for nonlinear dimensionality reduction, Science 290 (2000), 2319–2323.
[41] Kush Varshney and Alan Willsky, Classification using geometric level sets, Journal of Machine Learning Research 11 (2010), 491–516.
[42] Xiaohui Wang and J. S. Marron, A scale-based approach to finding effective dimensionality in manifold learning, Electron. J. Stat. 2 (2008), 127–148.
[43] Ling Xiao, Gradient estimates and lower bound for the blow-up time of star shaped mean curvature flow, https://arxiv.longhoe.net/pdf/1311.3721.pdf.
[44] Ye Yuan and Chuanjiang He, Variational level set methods for image segmentation based on both L2 and Sobolev gradients, Nonlinear Analysis: Real World Applications 13 (2012), 959–966.

	$\displaystyle\left\|R^{(p,r)}_{j}(q,v)(s-s_{0})^{j}\right\|$	$\displaystyle\leq\left\|\int_{0}^{1}(1-t)\partial_{j}f^{r}_{p}((1-t)(q_{0},v_{0% })+t(q,v))dt\right\|\cdot\left\|(s-s_{0})^{j}\right\|$
		$\displaystyle\leq\max\left\{\left\|\partial_{j}f^{r}_{p}(q,v)\right\|:1\leq j% \leq N,\|v\|\leq.999K^{-1},q\in\phi(M)\right\}\|s-s_{0}\|$
		$\displaystyle\stackrel{{\scriptstyle\rm def}}{{=}}L^{(p,r)}_{j}\|s-s_{0}\|.$

	$\displaystyle\|E(s)-E(s_{0})\|^{2}$	$\displaystyle=$	$\displaystyle\sum\limits_{p=1}^{N}\left(\sum\limits_{j}R^{p}_{j}(s)(s-s_{0})^{% j}\right)^{2}\leq\sum\limits_{p=1}^{N}\left(\sum\limits_{j}\|R^{p}_{j}(s)\|^{2}% \right)\|s-s_{0}\|^{2}$
		$\displaystyle\leq$	$\displaystyle N\left(\sum\limits_{p=1}^{N}\|{S}^{p}\|^{2}\right)\|s-s_{0}\|^{2}% \leq\sum\limits_{p=1}^{N}\sum\limits_{j}\|{S}^{p}\delta^{2}(q_{0},v_{0})\|^{2}.$

	$\displaystyle{0=\lim_{\|h\|\longrightarrow 0}\|h\|^{-1}\|F(g(\lambda+h),\lambda+h)-% F(g(\lambda),\lambda)\|}$
	$\displaystyle=\lim_{\|h\|\longrightarrow 0}\|h\|^{-1}\|F(g(\lambda+h),\lambda+h)-F(% g(\lambda),\lambda+h)+F(g(\lambda),\lambda+h)-F(g(\lambda),\lambda)\|$
	$\displaystyle=\lim_{\|h\|\longrightarrow 0}\|h\|^{-1}\|\partial_{x}F(g(\lambda),% \lambda+h)(g(\lambda+h)-g(\lambda))+\partial_{\lambda}F(g(\lambda),\lambda)(% \lambda+h-\lambda)\|$
	$\displaystyle=\partial_{x}F(g(\lambda),\lambda)\lim_{h\longrightarrow 0}\|h\|^{-% 1}\|g(\lambda+h)-g(\lambda)+(\partial_{x}F(g(\lambda),\lambda))^{-1}\|\partial_{% \lambda}F(g(\lambda),\lambda)\|.$