Reduce: A Framework for Reducing the Overheads of Fault-Aware Retraining

Hanif, Muhammad Abdullah; Shafique, Muhammad

Computer Science > Hardware Architecture

arXiv:2305.12595 (cs)

[Submitted on 21 May 2023]

Title:Reduce: A Framework for Reducing the Overheads of Fault-Aware Retraining

Authors:Muhammad Abdullah Hanif, Muhammad Shafique

View PDF

Abstract:Fault-aware retraining has emerged as a prominent technique for mitigating permanent faults in Deep Neural Network (DNN) hardware accelerators. However, retraining leads to huge overheads, specifically when used for fine-tuning large DNNs designed for solving complex problems. Moreover, as each fabricated chip can have a distinct fault pattern, fault-aware retraining is required to be performed for each chip individually considering its unique fault map, which further aggravates the problem. To reduce the overall retraining cost, in this work, we introduce the concept of resilience-driven retraining amount selection. To realize this concept, we propose a novel framework, Reduce, that, at first, computes the resilience of the given DNN to faults at different fault rates and with different amounts of retraining. Then, based on the resilience, it computes the amount of retraining required for each chip considering its unique fault map. We demonstrate the effectiveness of our methodology for a systolic array-based DNN accelerator experiencing permanent faults in the computational array.

Comments:	2 pages, 3 figures. arXiv admin note: substantial text overlap with arXiv:2304.12949
Subjects:	Hardware Architecture (cs.AR)
Cite as:	arXiv:2305.12595 [cs.AR]
	(or arXiv:2305.12595v1 [cs.AR] for this version)
	https://doi.org/10.48550/arXiv.2305.12595

Submission history

From: Muhammad Abdullah Hanif [view email]
[v1] Sun, 21 May 2023 23:09:21 UTC (352 KB)

Computer Science > Hardware Architecture

Title:Reduce: A Framework for Reducing the Overheads of Fault-Aware Retraining

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Hardware Architecture

Title:Reduce: A Framework for Reducing the Overheads of Fault-Aware Retraining

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators