Superconductor single-flux-quantum (SFQ) logic family has been recognized as a highly promising solution for the post-Moore's era, thanks to its ultra-fast and low-power switching characteristics. Therefore, researchers have made a tremendous amount of effort in various aspects to promote the technology and automate its circuit design process (e.g., low-cost fabrication, design tool development). However, there has been no progress in designing a convincing SFQ-based architectural unit due to the architects' lack of understanding of the technology's potentials and limitations at the architecture level.In this paper, we present how to architect an SFQ-based architectural unit by providing design principles with an extreme-performance neural processing unit (NPU). To achieve the goal, we first implement an architecture-level simulator to model an SFQ-based NPU accurately. We validate this model using our die-level prototypes, design tools, and logic cell library. This simulator accurately measures the NPU's performance, power consumption, area, and cooling overheads. Next, driven by the modeling, we identify key architectural challenges for designing a performance-effective SFQ-based NPU (e.g., expensive on-chip data movements and buffering). Lastly, we present SuperNPU, our example SFQ-based NPU architecture, which effectively resolves the challenges. Our evaluation shows that the proposed design outperforms a conventional state-of-the-art NPU by 23 times. With free cooling provided as done in quantum computing, the performance per chip power increases up to 490 times. Our methodology can also be applied to other architecture designs with SFQ-friendly characteristics.