An FPGA-Based Neuromorphic-Deep Learning Fusion Architecture for Energy Efficient AI Applications

Wednesday, October 8, 2025

3:10pm - 5:00pm MT

Location: North Building, 100 Level, Ballroom 120BC

Poster Presenter (TV)(s)

Jaelyn S. Liang

Research Intern
UCSD Health
San Diego, CA, United States

Session Slides

Recent progress in artificial intelligence has produced rapid gains in model capability, but at a substantial resource cost. Large language models illustrate this trend. Training GPT-3 required approximately 1,287 MWh of electricity, comparable to the annual use of more than 100 average U.S. homes and produced nearly 550 metric tons of CO2. These costs raise environmental and economic concerns, particularly for smaller organizations that face the price of GPUs and cooling infrastructure.

Neuromorphic computing offers an energy-efficient alternative through event-driven processing inspired by biological systems. Purely neuromorphic designs, however, often encounter limits in adaptability, memory capacity, and compatibility with complex workloads. As model size and algorithmic complexity continue to grow, these constraints reduce the practical applicability of neuromorphic systems deployed in isolation.

This work proposes a hybrid hardware architecture that targets both efficiency and flexibility. The design is implemented on a Zynq UltraScale+ FPGA using Vivado. It combines neuromorphic processing with a hierarchical memory system that prioritizes locality and selective access. The memory system includes primary memory, a high-speed cache, and a replication buffer. These components reduce off-chip memory traffic and shorten data retrieval. A fusion-based execution model assigns tasks to specialized branches: spiking neural network units, dense matrix engines, convolutional accelerators, and transformer attention mechanisms. Workloads are matched to the resources that execute them most effectively.

The architecture was evaluated against an Nvidia A100 GPU, an Nvidia L4 GPU, a Google V2-8 TPU, and a CPU on two tasks. For DeepSeek V3 inference, the FPGA achieved a latency of 756,000 ns at 2.9 W. For a convolutional neural network training task, it achieved 3,400,000 ns at 3.5 W. The measured performance and power usage exceeded the compared devices in both tasks. These results indicate that the proposed approach can reduce energy and cost while preserving the versatility required by modern workloads. The findings support hybrid, memory-aware designs as a practical direction for sustainable high-performance machine learning.