SaMD and GDPR data minimisation: the architectural decision that protects the entire stack

Software as a Medical Device is a regulated product category that happens to live inside a regulated data context. The Medical Device Regulation and EU MDR rules govern the product. The GDPR governs the data. Most HealthTech scale-ups I work with design for the first regulation first, and treat the second as a layer to add later. That sequencing produces a specific architectural failure that is expensive to correct once the product is in production — and it almost always traces back to one principle: Article 5(1)(c), data minimisation.

This article is about why minimisation in a clinical context is structurally harder than in other SaaS domains, the default development pattern that silently violates it, and the architectural decision that has to be made early enough to matter.

The principle, stated plainly

Article 5(1)(c) of the GDPR requires personal data to be “adequate, relevant and limited to what is necessary in relation to the purposes for which they are processed.” In consumer SaaS this is a tractable constraint. You ask what data the feature needs, you collect only that, you do not store what you do not need. The principle has teeth but the engineering is not complicated.

SaMD breaks this neat model in two ways, and both of them matter.

First, the purpose is not fixed at the moment of collection. A clinical decision support tool, a diagnostic algorithm, a patient-monitoring pipeline — these are designed to evolve. The model improves as it sees more data. The indications may expand as clinical evidence accumulates. The regulatory classification may change. Data that is excessive for today’s purpose may be strictly necessary for tomorrow’s. The GDPR does not accept “we might need it later” as a legal basis — but your clinical development team will absolutely need it later, and the question is how you hold both positions at once.

Second, the signal in clinical data is often in what looks like noise. A cardiologist will tell you that the useful information for arrhythmia detection is not just in the obvious waveform — it is in the subtle variations, the timing relationships, the patient’s demographics and comorbidities and medications and history. Stripping any one of those to satisfy minimisation can destroy the clinical utility. Data that would be excessive in any other context is genuinely necessary here. But the GDPR does not adjudicate medical questions. It adjudicates data-processing questions.

These two tensions — temporal and informational — produce the architectural failure mode I want to describe.

The default development pattern

Here is what happens in most SaMD teams without deliberate architectural choices. The engineering team builds a clinical-data intake pipeline that captures the full clinical record — rich, longitudinal, detailed. The ML team trains on this corpus because it is the highest-signal data available. The product works, and it works well, because the model has everything it needs. The data lake accumulates years of patient records. The team adds pseudonymisation, applies access controls, maybe implements some anonymisation for research datasets.

At some point — usually during the first serious procurement review with an enterprise buyer, or the first DPIA, or the first regulatory inspection — the question arrives: what is the legal basis for processing this volume of personal health data, and how does the architecture satisfy data minimisation?

The answer that emerges is usually some combination of consent, contract, and legitimate interest arguments. The answer is almost never the architecture enforces minimisation. It is almost always the controls wrap the architecture. And the problem is that GDPR Article 5 is not a controls question. It is a design question. Article 25 — data protection by design and by default — specifies that the principle must be implemented through appropriate technical and organisational measures, integrated into the processing itself.

A data lake that holds the full clinical record, with access controls around it, does not satisfy minimisation by design. It satisfies access control by design. The difference is not rhetorical. Under inspection, the data protection authority is asking whether the processing itself is limited — whether data you do not need is, in fact, not being processed. If the data is there and the controls are the only thing preventing its use, you are defending a position that has already been lost architecturally.

The architectural decision that has to be made early

The reconciliation is not contractual — you cannot consent your way out of the tension between clinical utility and minimisation. It is architectural, and it lives in three specific design choices.

Separation of operational and research data paths. The inference pipeline — the data needed to run the clinical tool in production — and the training pipeline — the data needed to develop and improve the model — are two different processing activities with two different purposes, two different minimisation standards, and two different legal bases. Designing them as one pipeline with different access policies is what produces the problem. Designing them as separate pipelines, with data routed on ingestion based on purpose, is what solves it. The production pipeline receives only what the clinical tool needs at inference time. The research pipeline receives only what the development team needs, under a distinct legal basis with distinct safeguards, often after pseudonymisation or synthetic generation.

Purpose-bound ingestion and retention. Every data element entering the system should be tagged with its purpose at the moment of ingestion, not inferred later from access patterns. Retention schedules follow the tag, not the storage layer. When the purpose expires, the data is deleted — not moved to a colder tier, not access-restricted further, deleted. This is expensive engineering, but it is the only architecture that survives the “what is being processed and why” question when an auditor asks it.

Aggressive use of privacy-enhancing technologies at the right layer. Differential privacy on aggregate analytics, federated learning where feasible for multi-site training, secure enclaves for computation on sensitive fields, synthetic data generation for development and testing environments. None of these is a silver bullet. Each has clinical and statistical trade-offs that the ML and clinical teams have to own, not outsource to compliance. But deployed intentionally, they let you run the product at the clinical utility the use case demands while shrinking the surface of actual personal-data processing.

The political question inside the team

None of this is primarily a legal problem. It is a politics-of-engineering problem. The ML team wants maximum data for maximum model performance. The clinical team wants maximum data for clinical defensibility. The product team wants maximum data for future features. The compliance team wants minimum data for regulatory survival. The tension is real and it does not disappear because someone writes a DPIA.

The architectural decision has to be made by leadership, at a specific moment in the product lifecycle, and it has to be defended against the pressure to collect first and minimise later. My observation from multiple SaMD engagements is that the decision is easiest to make before the first data lake exists, harder after the first model is trained on the uncontrolled corpus, and extremely expensive after the product is in market. The companies that make it early ship a more defensible product at comparable clinical performance — because the pressure on the architecture forces better data-engineering discipline than the uncontrolled case ever produces.

What this looks like in practice

In the SaMD engagements I run, the work sequence is usually:

Map every data element the product touches, tagged by the processing purpose and the minimum necessary scope for that purpose. Most teams discover they cannot do this without doing real data-lineage work, which is itself the first sign that the architecture does not yet support minimisation.
Draw the line between operational processing and development processing. Decide where the interface sits, what crosses it, under what safeguards, and what gets deleted versus retained.
Specify retention for each purpose, with deletion jobs wired to the retention metadata, tested, and auditable.
Build the DPIA around the architecture, not around controls. A DPIA that says “we collect X, process it for Y, and delete it after Z” is a document an inspector can validate. A DPIA that says “we collect everything and apply access policies” is not.
Align the clinical and ML teams behind the architecture. This is the step that takes real executive time. It is also the step that determines whether the architecture holds when product priorities shift.

For a SaMD programme, the minimisation question is not whether you can defend your data practices to an inspector today. It is whether the architecture of your product still makes sense when you are scaling into new markets, expanding indications, and integrating with enterprise health systems that will run their own DPIAs against you. Getting that decision right early protects the entire stack — the product, the regulatory file, the commercial roadmap, and the investor narrative.

It is the kind of architectural call that is easy to defer and expensive to defer. And it almost always has to be made before the first fully trained model goes into production.

The principle, stated plainly

The default development pattern

The architectural decision that has to be made early

The political question inside the team

What this looks like in practice

Third-party risk management for AI integrations: the diligence most programmes are not running

NIS2 scope creep: why companies that thought they were out are finding themselves in

Prefer a conversation to an article?