Module `tiresias.core.regression`

Expand source code

import numpy as np
import diffprivlib.models as dp
from tiresias.core.mechanisms import approximate_bounds

class LinearRegression(dp.LinearRegression):

    def fit(self, X, y, sample_weight=None):
        # TODO: concat X and y for norm, specify ranges
        if not self.data_norm:
            self.epsilon /= 2.0
            row_norms = np.linalg.norm(X, axis=1)
            _, max_norm = approximate_bounds(row_norms, self.epsilon)
            self.data_norm = max_norm
            for i in range(X.shape[0]):
                if np.linalg.norm(X[i]) > self.data_norm:
                    X[i] = X[i] * (self.data_norm - 1e-5) / np.linalg.norm(X[i])
        return super().fit(X, y, sample_weight=sample_weight)

Classes

class LinearRegression (epsilon=1.0, data_norm=None, range_X=None, range_y=None, fit_intercept=True, copy_X=True, **unused_args)

Ordinary least squares Linear Regression with differential privacy.

LinearRegression fits a linear model with coefficients w = (w1, …, wp) to minimize the residual sum of squares between the observed targets in the dataset, and the targets predicted by the linear approximation. Differential privacy is guaranteed with respect to the training sample.

Differential privacy is achieved by adding noise to the second moment matrix using the :class:.Wishart mechanism. This method is demonstrated in [She15], but our implementation takes inspiration from the use of the Wishart distribution in [IS16] to achieve a strict differential privacy guarantee.

Parameters

epsilon : float, optional, default 1.0

Privacy parameter :math:\epsilon.

data_norm : float, default: None

The max l2 norm of any row of the concatenated dataset A = [X; y]. This defines the spread of data that will be protected by differential privacy.

If not specified, the max norm is taken from the data when .fit() is first called, but will result in a :class:.PrivacyLeakWarning, as it reveals information about the data. To preserve differential privacy fully, data_norm should be selected independently of the data, i.e. with domain knowledge.

range_X : array_like

Range of each feature of the training sample X. Its non-private equivalent is np.ptp(X, axis=0).

If not specified, the range is taken from the data when .fit() is first called, but will result in a :class:.PrivacyLeakWarning, as it reveals information about the data. To preserve differential privacy fully, range_X should be selected independently of the data, i.e. with domain knowledge.

range_y : array_like

Same as range_X, but for the training label set y.

fit_intercept : bool, optional, default True

Whether to calculate the intercept for this model. If set to False, no intercept will be used in calculations (i.e. data is expected to be centered).

copy_X : bool, optional, default True

If True, X will be copied; else, it may be overwritten.

Attributes

coef_ : array of shape (n_features, ) or (n_targets, n_features): Estimated coefficients for the linear regression problem. If multiple targets are passed during the fit (y 2D), this is a 2D array of shape (n_targets, n_features), while if only one target is passed, this is a 1D array of length n_features.
rank_ : int: Rank of matrix X.
singular_ : array of shape (min(X, y),): Singular values of X.
intercept_ : float or array of shape of (n_targets,): Independent term in the linear model. Set to 0.0 if fit_intercept = False.

References

.. [She15] Sheffet, Or. "Private approximations of the 2nd-moment matrix using existing techniques in linear regression." arXiv preprint arXiv:1507.00056 (2015).

.. [IS16] Imtiaz, Hafiz, and Anand D. Sarwate. "Symmetric matrix perturbation for differentially-private principal component analysis." In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2339-2343. IEEE, 2016.

Expand source code

class LinearRegression(dp.LinearRegression):

    def fit(self, X, y, sample_weight=None):
        # TODO: concat X and y for norm, specify ranges
        if not self.data_norm:
            self.epsilon /= 2.0
            row_norms = np.linalg.norm(X, axis=1)
            _, max_norm = approximate_bounds(row_norms, self.epsilon)
            self.data_norm = max_norm
            for i in range(X.shape[0]):
                if np.linalg.norm(X[i]) > self.data_norm:
                    X[i] = X[i] * (self.data_norm - 1e-5) / np.linalg.norm(X[i])
        return super().fit(X, y, sample_weight=sample_weight)

Ancestors

diffprivlib.models.linear_regression.LinearRegression
sklearn.linear_model.base.LinearRegression
sklearn.linear_model.base.LinearModel
abc.NewBase
sklearn.base.BaseEstimator
sklearn.base.RegressorMixin

Methods

def fit(self, X, y, sample_weight=None)

Fit linear model.

Parameters

X : array-like or sparse matrix, shape (n_samples, n_features): Training data
y : array_like, shape (n_samples, n_targets): Target values. Will be cast to X's dtype if necessary
sample_weight : ignored: Ignored by diffprivlib. Present for consistency with sklearn API.

Returns

self : returns an instance of self.

Expand source code

def fit(self, X, y, sample_weight=None):
    # TODO: concat X and y for norm, specify ranges
    if not self.data_norm:
        self.epsilon /= 2.0
        row_norms = np.linalg.norm(X, axis=1)
        _, max_norm = approximate_bounds(row_norms, self.epsilon)
        self.data_norm = max_norm
        for i in range(X.shape[0]):
            if np.linalg.norm(X[i]) > self.data_norm:
                X[i] = X[i] * (self.data_norm - 1e-5) / np.linalg.norm(X[i])
    return super().fit(X, y, sample_weight=sample_weight)