The Business Cases
The demonstrator applications aim at providing beyond the state of the art solutions to the following business needs of an evolving data driven entrepreneurship:
Joint Data usage within corporate environments: It addresses the need of secure joint data usage between different units in a corporate environment.
Joint Data usage between different enterprises in the same domain: It addresses the secure data exchange need between different enterprises who find it useful to analyse each other’s data e.g. MPC/PSI usage between telecom and banking companies aiming at intersecting their datasets.
Joint Data usage between different enterprises in different domains: Enterprises would like to establish standards processes to be able to share their data with specialised external entities e.g. data analysis consultants, to perform the analysis.
Data valuation: Safe-DEED aims at providing tools to facilitate the assessment of data value, thus incentivizing data owners to make use of the cryptographic protocols to create value for their companies and their clients.
The demonstrator applications
Private set intersection application
This part of the demonstrator exemplifies the power of private set intersection (PSI) protocols in the context of marketing. In particular, we enable two companies – in our case our “own” company, and another company “C” (e.g. a Bank operating at the same territory as our company) – to improve their own marketing strategy for customers they have in common.
Performs a de-anonymisability analysis of the dataset. Despite datasets not containing any personally identifying information (PII), such as name, address, etc., individuals can be identified through their quasi-identifiers (QIs). QIs are the attributes whose combination can serve as a unique identifier for individuals.
Data valuation application
This part of the demonstrator describes the initial implementation of the Data Valuation Component (DVC).
The supported algorithms are:
- selected regression, classification and clustering algorithms (at ADAS level)-see below for the algorithms explanation;
- a rule based algorithm for generating the economic value of the input data set (at S2VM level).
Scenario 1 – MPC
The following architecture has been chosen:
- The PSI library (respectively its wrapper)
- The demonstrator UI
- Company’s demographic data
In this version of the demonstrator the demographic data is preloaded. Then, one company starts the PSI library as a server and the other company as a client. Once the connection between the two companies is established, the PSI library runs and outputs the intersection of the two sets to the party that has initiated the interaction. The demonstrator UI then displays the resulting set to the user.
Scenario 2 – De-anonymisability
Scenario 2.1 – De-anonymisation risk analysis
The goal of this trial was to apply a battery of de-anonymisation tests on company’s data in order to raise privacy red flags.
Scenario 2.2 – K-anonymisation with Privacy Tool
The goal of this trial was to reduce the de-anonymisation risks in case the company would decide to release, exchange or sell their dataset to a third party(-ies).
Scenario 3 – Data valuation
For this trial we used a sample dataset.
There are two types of outputs being generated: the one is the “console output” and the second the “profiles data set “.
For the profile files, they include the following sections:
- Dataset info: size, shape, duplicate percentage
- Variables types: how many columns of each type
- Warnings: if columns contain a large proportion of only 1 value, or large proportion of 0s, etc.
- Variables: a summary of the descriptive stats for each column, including histogram of the distribution, number of unique values in that column, number of missing values in that column, mean, std, max, min. Then, if you toggle the details of each column, you can see detailed quantile stats, descriptive stats, the histogram of the distribution, common values for each given column and extreme values (top-5 max and min).
- Correlations: a set of correlation matrices based on different correlation coefficients between all pairs of columns. This informs the suggestion to discard from model design those columns that have a high correlation coefficient.
- Missing values: a histogram of how many values are present in each of the columns.
- Sample: a sample from the head and tail of the dataset.