Skip to main content

Showing 1–1 of 1 results for author: Presser, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2101.00027  [pdf, other

    cs.CL

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    Authors: Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, Connor Leahy

    Abstract: Recent work has demonstrated that increased training dataset diversity improves general cross-domain knowledge and downstream generalization capability for large-scale language models. With this in mind, we present \textit{the Pile}: an 825 GiB English text corpus targeted at training large-scale language models. The Pile is constructed from 22 diverse high-quality subsets -- both existing and new… ▽ More

    Submitted 31 December, 2020; originally announced January 2021.