Skip to content

MCU hardware benchmark compared to TFLite #17157

@Faboulou

Description

@Faboulou

Hello,

I’m currently evaluating ExecuTorch (Tag V1.0.0 export & Runner) vs TFLite on real hardware (not FVP).

The goal is purely technology validation for industrial use cases, before selecting a runtime for deployment on microcontroller-class devices.

Target: Cortex-M55 + Ethos-U55

Model: MobileNetV2 (int8, Ethos-U compatible)

For ExecuTorch: from torchvision.models import mobilenet_v2

For TFLite: tf.keras.applications.MobileNetV2

The same model architecture and equivalent quantization were used for both runtimes (both with INT8 input/output).

ExecuTorch

PTE size: 3.5 MB (3 512 288 bytes)

Memory:

  • Planned (with Input / Output): 256 KiB
  • Runtime / temporary: 1.6 MiB
  • Method: 30 KiB

Total :
256 + 1600 + 30 = 1886 KiB
≈ 1.84 MiB

After run:
Runtime (used): 1.44 MiB (1474 KiB)
Method (used): 0.24 KiB

Total real:
1464 + 0.24 + 256 = 1720 KiB
= 1.679 MiB

Inference time:

  • One image: 113 ms
  • 100 images: 113.748 ms

Timing only on:
g_method->execute();


TensorFlow Lite

Vela TFLite size: 3.7 MB (3 684 976 bytes)

Arena memory before running (with inputs/outputs): 1.7 MiB

Arena used (after run): 1.51 MiB

Inference time:

  • One image: 91.221 ms
  • 100 images: 91.204 ms

Timing only on:
g_interpreter->Invoke();

As I understand it, when comparing memory usage, the TFLite arena is roughly equivalent to Planned + Runtime.

Do these values seem correct to you?
Based on these measurements, TFLite currently shows lower latency and memory usage than ExecuTorch, which was unexpected on our side and would be interesting to understand.

cc @psiddh @AdrianLundell @digantdesai @psiddh

Metadata

Metadata

Assignees

Labels

module: microcontrollersFor embedded MCUs like Cortex-M, or RTOS like Zephyr, does not track NPU backend like Arm Ethos.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions