Successful Completion of the Medications Data Interoperability Project (MDIP)

Entering our ninth month of the NGI Sargasso MDIP project, we are very happy to announce a successful completion of all tasks.
Subjectively, we consider that the project has been quite successful based upon the stated objectives, but of course we also want to hear feedback from stakeholders.
Headline information
- Regions in scope: EU, US, UK, Canada.
- Datasets received: 31 of 33 in scope. Two entities did not provide the requested data.
- Contributors: Medications Regulators across all the EU, US and Canada. Also the NHS England Terminology Server for UK data and the European Medicines Agency.
- Total individual medications records gathered: 950,197
- Total valid for mapping: 930,871
- Total invalid for mapping: 20,115 could not be processed due to lack of ATC code, lack of active substance or malformed content in file
- Total mapped to Pharmacogenomic (PGx) flags: 223,152, or 24%
- Searchable Database: The searchable database went live on the www.mdip.eu domain as of Monday 1st September.
- Data written to Blockchain: Unique identifying data was written to a blockchain ledger successfully as of August. As the blockchain is expensive (and in the course of a project, serves no function other than an example), a video was made of the ledger being populated, and then the Blockchain service was discontinued.
- Data provided to Open Source: Data from the project will be added to a GitHub repository and will cover all code and data artefacts created by the project. This will be communicated on the Technology Page of the website this week.
Challenges
Data modelling
In our initial approach we wanted to use a generic data model incorporating these concepts:
- VTM - Virtual Therapeutic Moiety
- VMP - Virtual Medicinal Product
- AMP - Actual Medicinal Product
- VMPP - Virtual Medicinal Product Pack
- AMPP - Actual Medicinal Product Pack
However, this quickly became impossible to do. Most regulatory authorities do not issue the data according to these values, and the values they use are in local language. It was not possible for us to assess whether these were equivalent, so instead we simply ingested everything and presented it. However, it does raise an interesting recommendation for harmonisation of datasets in the future.
Malformed, or data which is not minimally interoperable
Over 20,000 records could not be processed due to a lack of quality. In two countries the public portal gives a dataset which, when clicked on, gives the ATC code and active substance, but the actual downloadable file has neither. Requests were made in writing to both entities, but nothing was received. Some files we could rectify when the errors were repetitive. For example, some ATC codes were presented with spaces and full stops which is not the norm. These were eliminated and the files then could be ingested. But in other cases, the malformation could not be easily rectified, and we would have required somebody with local pharmaceutical knowledge to quality control. The project budget did not extend to this, so we had to restrict ourselves to very high confidence matching – i.e. a direct mapping to ATC code or Active Substance.
Refusal to provide (full) data
Three entities, all based in Europe, refused to provide the necessary data. One stated that the data was evolving, therefore could not be provided. Another ignored specific request for files and ultimately offered a file that was not requested. One other entity provided only truncated data. We used letters, emails and phones calls to request a database extract but nothing was provided.
Inconvenient Data
Inconvenient data is how we term data which is fragmented and needs to be reconstituted to create a minimally interoperable file. Examples of this are several fragmented files, such as extracted raw from a database. This occurred in three countries, which was somewhat tedious to deal with but was successfully overcome. However, it raises questions as to how an average citizen or researcher would access this data.
Our Recommendations
- We suspect that the countries with smaller indexes may not be tracking the complete list of regulated medications in their markets. We believe this based on our experiences with hospital formularies from both Ireland and the United Kingdom, where the files had approximately 20,000 listed items of circa 13,000 were pharmaceutical products. It does not seem likely that lists of less than 10,000 items – especially those that hold multiple iterations of the same drug by pack size and dose – represent everything that is, or can be, on the local market. Especially when neighbouring countries may have up 150,000 medications listed on the regulator site.
- At the same time, and related to the above, we also suspect that many national governments are not getting the optimal stocks of generic and brand name medications, as to do that would require knowing the universe of medications available, not just nationally but across the EU. This likely has an impact on national medications costs.
- The ISO IDMP (Identification of Medicinal Products) is a set of International Organization for Standardization (ISO) standards providing a global framework for uniquely identifying and describing medicinal products. In 2015 the European Medicines Agency declared that they would be gradually adopting this data model[1]. The data model inside the IDMP is quite comprehensive and goes far beyond the data that most regulators actually publish. However, regulators should do the following to achieve permanent cross-border interoperability within Europe:
- Release an IDMP data model compliant version, where the same data already provided is given again under IDMP data headings.
- Release a version of the above in the English language.
- Ensure that the files are given multiple formats, including .json, .csv, .txt and .xlsx formats, and more as they become popular.
- Ensure that the files are as rich as possible regarding content and are not truncated or divided.
This is not an expensive endeavour and would allow any user (including regulators themselves) to combine datasets easily from counterparty sources. This is essentially what our MDIP project has done, but without using the IDMP data model, and this would have required a very significant amount of work.
- Regulators should begin to issue unique identifiers for each drug using clinically accepted terminologies. This already occurs for most ATC codes/Active substances, but should also include the specific physical drug package. This should include at least ICD and SNOMED, and the clinical identifiers should be mapped between each terminology and to UMLS as well as Unified Medical Language System (UMLS). This has already been done in the United Kingdom where NHS England applied a subset of SNOMED to provide a unique string of numerals per physical drug package. Doing the above combines clinical identifiers with regulatory identifiers.
- Enriched, international medications indexes of the type we recommend above are an extremely useful contribution to the European Health Data Space (EHDS) as it would become a definitive dataset for EU regulated medicines, as they form the nucleus of an off-the-shelf knowledge graph for medications, which can then be exploited in any direction, yet remain interoperable at core.
Reference
- European Medicines Agency, 2015, Introduction to ISO Identification of Medicinal Products, SPOR programme, Reference Number: EMA/732656/2015, link.