Skip to main content

Showing 1–3 of 3 results for author: Sheen, H

Searching in archive stat. Search in all archives.
.
  1. arXiv:2403.08699  [pdf, ps, other

    cs.LG cs.AI math.OC stat.ML

    Implicit Regularization of Gradient Flow on One-Layer Softmax Attention

    Authors: Heejune Sheen, Siyu Chen, Tianhao Wang, Harrison H. Zhou

    Abstract: We study gradient flow on the exponential loss for a classification problem with a one-layer softmax attention model, where the key and query weight matrices are trained separately. Under a separability assumption on the data, we show that when gradient flow achieves the minimal loss value, it further implicitly minimizes the nuclear norm of the product of the key and query weight matrices. Such i… ▽ More

    Submitted 13 March, 2024; originally announced March 2024.

    Comments: 34 pages

  2. arXiv:2402.19442  [pdf, other

    cs.LG cs.AI math.OC math.ST stat.ML

    Training Dynamics of Multi-Head Softmax Attention for In-Context Learning: Emergence, Convergence, and Optimality

    Authors: Siyu Chen, Heejune Sheen, Tianhao Wang, Zhuoran Yang

    Abstract: We study the dynamics of gradient flow for training a multi-head softmax attention model for in-context learning of multi-task linear regression. We establish the global convergence of gradient flow under suitable choices of initialization. In addition, we prove that an interesting "task allocation" phenomenon emerges during the gradient flow dynamics, where each attention head focuses on solving… ▽ More

    Submitted 10 June, 2024; v1 submitted 29 February, 2024; originally announced February 2024.

    Comments: 141 pages, 7 figures

  3. arXiv:2011.12151  [pdf, other

    stat.ML cs.LG

    Tensor Kernel Recovery for Spatio-Temporal Hawkes Processes

    Authors: Heejune Sheen, Xiaonan Zhu, Yao Xie

    Abstract: We estimate the general influence functions for spatio-temporal Hawkes processes using a tensor recovery approach by formulating the location dependent influence function that captures the influence of historical events as a tensor kernel. We assume a low-rank structure for the tensor kernel and cast the estimation problem as a convex optimization problem using the Fourier transformed nuclear norm… ▽ More

    Submitted 28 November, 2022; v1 submitted 24 November, 2020; originally announced November 2020.

    Comments: 24 pages