fit function

l0learn.fit(X: Union[np.ndarray, csc_matrix], y: np.ndarray, unicode loss: str = u'SquaredError', unicode penalty: str = u'L0', unicode algorithm: str = u'CD', max_support_size: int = 100, num_lambda: Optional[int] = 100, num_gamma: Optional[int] = 1, double gamma_max: float = 10., double gamma_min: float = .0001, partial_sort: bool = True, max_iter: int = 200, double rtol: float = 1e-6, double atol: float = 1e-9, active_set: bool = True, active_set_num: int = 3, max_swaps: int = 100, double scale_down_factor: float = 0.8, screen_size: int = 1000, lambda_grid: Optional[List[Sequence[float]]] = None, exclude_first_k: int = 0, intercept: bool = True, lows: Union[np.ndarray, float] = -float(u'inf'), highs: Union[np.ndarray, float] = +float(u'inf')) l0learn.models.FitModel

Computes the regularization path for the specified loss function and penalty function.

Parameters
  • X (np.ndarray or csc_matrix of shape (N, P)) – Data Matrix where rows of X are observations and columns of X are features

  • y (np.ndarray of shape (P)) – The response vector where y[i] corresponds to X[i, :] For classification, a binary vector (-1, 1) is requried .

  • loss (str) –

    The loss function. Currently supports the choices:

    ”SquaredError” (for regression), “Logistic” (for logistic regression), and “SquaredHinge” (for smooth SVM).

  • penalty (str) –

    The type of regularization. This can take either one of the following choices:

    ”L0”, “L0L2”, and “L0L1”

  • algorithm (str) – The type of algorithm used to minimize the objective function. Currently “CD” and “CDPSI” are are supported. “CD” is a variant of cyclic coordinate descent and runs very fast. “CDPSI” performs local combinatorial search on top of CD and typically achieves higher quality solutions (at the expense of increased running time).

  • max_support_size (int) – Must be greater than 0. The maximum support size at which to terminate the regularization path. We recommend setting this to a small fraction of min(n,p) (e.g. 0.05 * min(n,p)) as L0 regularization typically selects a small portion of non-zeros.

  • num_lambda (int, optional) – The number of lambda values to select in the regularization path. This value must be None if lambda_grid is supplied.When supplied, must be greater than 0. Note: lambda is the regularization parameter corresponding to the L0 norm.

  • num_gamma (int, optional) – The number of gamma values to select in the regularization path. This value must be None if lambda_grid is supplied. When supplied, must be greater than 0. Note: gamma is the regularization parameter corresponding to L1 or L2, depending on the chosen penalty).

  • gamma_max (float) –

    The maximum value of gamma when using the L0L2 penalty. This value must be greater than 0.

    Note: For the L0L1 penalty this is automatically selected.

  • gamma_min (float) – The minimum value of Gamma when using the L0L2 penalty. This value must be greater than 0 but less than gamma_max. Note: For the L0L1 penalty, the minimum value of gamma in the grid is set to gammaMin * gammaMax.

  • partial_sort (bool) – If TRUE partial sorting will be used for sorting the coordinates to do greedy cycling (see our paper for for details). Otherwise, full sorting is used. #TODO: Add link for paper

  • max_iter (int) – The maximum number of iterations (full cycles) for CD per grid point. The algorithm may not use the full number of iteration per grid point if convergence is found (defined by rtol and atol parameter) Must be greater than 0

  • rtol (float) – The relative tolerance which decides when to terminate optimization as based on the relative change in the objective between iterations. Must be greater than 0 and less than 1.

  • atol (float) – The absolute tolerance which decides when to terminate optimization as based on the absolute L2 norm of the residuals Must be greater than 0

  • active_set (bool) – If TRUE, performs active set updates. (see our paper for for details). #TODO: Add link for paper

  • active_set_num (int) –

    The number of consecutive times a support should appear before declaring support stabilization. (see our paper for for details). #TODO: Add link for paper

    Must be greater than 0.

  • max_swaps (int) – The maximum number of swaps used by CDPSI for each grid point. Must be greater than 0. Ignored by CD algorithims.

  • scale_down_factor (float) –

    Roughly amount each lambda value is scaled by between grid points. Larger values lead to closer lambdas and typically to smaller gaps between the support sizes.

    For details, see our paper - Section 5 on Adaptive Selection of Tuning Parameters). #TODO: Add link for paper

    Must be greater than 0 and less than 1 (strictly for both.)

  • screen_size (int) –

    The number of coordinates to cycle over when performing initial correlation screening. #TODO: Add link for paper

    Must be greater than 0 and less than number of columns of X.

  • lambda_grid (list of list of floats) –

    A grid of lambda values to use in computing the regularization path. This is by default an empty list and is ignored. When specified, lambda_grid should be a list of list of floats, where the ith element

    (corresponding to the ith gamma) should be a decreasing sequence of lambda values. The length of this sequence is directly the number of lambdas to be tried for that gamma.

    In the the “L0” penalty case, lambda_grid should be a list of 1. In the “L0LX” penalty cases, lambda_grid can be a list of any length. The length of lambda_grid will be the number of gamma values tried.

    See the example notebook for more details.

    Note: When lambda_grid is supplied, num_gamma and num_lambda must be None.

  • exclude_first_k (int) –

    The first exclude_first_k features in X will be excluded from variable selection. In other words, the first exclude_first_k variables will not be included in the L0-norm penalty however they will be included in the L1 or L2 norm penalties, if they are specified.

    Must be a positive integer less than the columns of X.

  • intercept (bool) – If False, no intercept term is included or fit in the regularization path Intercept terms are not regularized by L0 or L1/L2.

  • lows (np array or float) –

    Lower bounds for coefficients. Either a scalar for all coefficients to have the same bound or a vector of size p (number of columns of X) where lows[i] is the lower bound for coefficient i.

    Lower bounds can not be above 0 (i.e. we can not specify that all coefficients must be larger than a > 0). Lower bounds can be set to 0 iff the corresponding upper bound for that coefficient is also not 0.

  • highs (np array or float) –

    Upper bounds for coefficients. Either a scalar for all coefficients to have the same bound or a vector of size p (number of columns of X) where highs[i] is the upper bound for coefficient i.

    Upper bounds can not be below 0 (i.e. we can not specify that all coefficients must be smaller than a < 0). Upper bounds can be set to 0 iff the corresponding lower bound for that coefficient is also not 0.

Returns

fit_model – FitModel instance containing all relevant information from the solution path.

Return type

l0learn.models.FitModel

Examples

>>>fit_model = l0learn.fit(X, y, penalty=”L0”, max_support_size=20)

cvfit function

l0learn.cvfit(X: Union[np.ndarray, csc_matrix], y: np.ndarray, unicode loss: str = u'SquaredError', unicode penalty: str = u'L0', unicode algorithm: str = u'CD', num_folds: int = 10, seed: int = 1, max_support_size: int = 100, num_lambda: Optional[int] = 100, num_gamma: Optional[int] = 1, double gamma_max: float = 10., double gamma_min: float = .0001, partial_sort: bool = True, max_iter: int = 200, double rtol: float = 1e-6, double atol: float = 1e-9, active_set: bool = True, active_set_num: int = 3, max_swaps: int = 100, double scale_down_factor: float = 0.8, screen_size: int = 1000, lambda_grid: Optional[List[Sequence[float]]] = None, exclude_first_k: int = 0, intercept: bool = True, lows: Union[np.ndarray, float] = -float(u'inf'), highs: Union[np.ndarray, float] = +float(u'inf')) l0learn.models.CVFitModel

Computes the regularization path for the specified loss function and penalty function and performs K-fold cross-validation.

Parameters
  • X (np.ndarray or csc_matrix of shape (N, P)) – Data Matrix where rows of X are observations and columns of X are features

  • y (np.ndarray of shape (P)) – The response vector where y[i] corresponds to X[i, :] For classification, a binary vector (-1, 1) is requried .

  • loss (str) –

    The loss function. Currently supports the choices:

    ”SquaredError” (for regression), “Logistic” (for logistic regression), and “SquaredHinge” (for smooth SVM).

  • penalty (str) –

    The type of regularization. This can take either one of the following choices:

    ”L0”, “L0L2”, and “L0L1”

  • algorithm (str) – The type of algorithm used to minimize the objective function. Currently “CD” and “CDPSI” are are supported. “CD” is a variant of cyclic coordinate descent and runs very fast. “CDPSI” performs local combinatorial search on top of CD and typically achieves higher quality solutions (at the expense of increased running time).

  • num_folds (int) – Must be greater than 1 and less than N (number of . The number of folds for cross-validation.

  • max_support_size (int) – Must be greater than 0. The maximum support size at which to terminate the regularization path. We recommend setting this to a small fraction of min(n,p) (e.g. 0.05 * min(n,p)) as L0 regularization typically selects a small portion of non-zeros.

  • num_lambda (int, optional) – The number of lambda values to select in the regularization path. This value must be None if lambda_grid is supplied.When supplied, must be greater than 0. Note: lambda is the regularization parameter corresponding to the L0 norm.

  • num_gamma (int, optional) – The number of gamma values to select in the regularization path. This value must be None if lambda_grid is supplied. When supplied, must be greater than 0. Note: gamma is the regularization parameter corresponding to L1 or L2, depending on the chosen penalty).

  • gamma_max (float) –

    The maximum value of gamma when using the L0L2 penalty. This value must be greater than 0.

    Note: For the L0L1 penalty this is automatically selected.

  • gamma_min (float) – The minimum value of Gamma when using the L0L2 penalty. This value must be greater than 0 but less than gamma_max. Note: For the L0L1 penalty, the minimum value of gamma in the grid is set to gammaMin * gammaMax.

  • partial_sort (bool) – If TRUE partial sorting will be used for sorting the coordinates to do greedy cycling (see our paper for for details). Otherwise, full sorting is used. #TODO: Add link for paper

  • max_iter (int) – The maximum number of iterations (full cycles) for CD per grid point. The algorithm may not use the full number of iteration per grid point if convergence is found (defined by rtol and atol parameter) Must be greater than 0

  • rtol (float) – The relative tolerance which decides when to terminate optimization as based on the relative change in the objective between iterations. Must be greater than 0 and less than 1.

  • atol (float) – The absolute tolerance which decides when to terminate optimization as based on the absolute L2 norm of the residuals Must be greater than 0

  • active_set (bool) – If TRUE, performs active set updates. (see our paper for for details). #TODO: Add link for paper

  • active_set_num (int) –

    The number of consecutive times a support should appear before declaring support stabilization. (see our paper for for details). #TODO: Add link for paper

    Must be greater than 0.

  • max_swaps (int) – The maximum number of swaps used by CDPSI for each grid point. Must be greater than 0. Ignored by CD algorithims.

  • scale_down_factor (float) –

    Roughly amount each lambda value is scaled by between grid points. Larger values lead to closer lambdas and typically to smaller gaps between the support sizes.

    For details, see our paper - Section 5 on Adaptive Selection of Tuning Parameters). #TODO: Add link for paper

    Must be greater than 0 and less than 1 (strictly for both.)

  • screen_size (int) –

    The number of coordinates to cycle over when performing initial correlation screening. #TODO: Add link for paper

    Must be greater than 0 and less than number of columns of X.

  • lambda_grid (list of list of floats) –

    A grid of lambda values to use in computing the regularization path. This is by default an empty list and is ignored. When specified, lambda_grid should be a list of list of floats, where the ith element

    (corresponding to the ith gamma) should be a decreasing sequence of lambda values. The length of this sequence is directly the number of lambdas to be tried for that gamma.

    In the the “L0” penalty case, lambda_grid should be a list of 1. In the “L0LX” penalty cases, lambda_grid can be a list of any length. The length of lambda_grid will be the number of gamma values tried.

    See the example notebook for more details.

    Note: When lambda_grid is supplied, num_gamma and num_lambda must be None.

  • exclude_first_k (int) –

    The first exclude_first_k features in X will be excluded from variable selection. In other words, the first exclude_first_k variables will not be included in the L0-norm penalty however they will be included in the L1 or L2 norm penalties, if they are specified.

    Must be a positive integer less than the columns of X.

  • intercept (bool) – If False, no intercept term is included or fit in the regularization path Intercept terms are not regularized by L0 or L1/L2.

  • lows (np array or float) –

    Lower bounds for coefficients. Either a scalar for all coefficients to have the same bound or a vector of size p (number of columns of X) where lows[i] is the lower bound for coefficient i.

    Lower bounds can not be above 0 (i.e. we can not specify that all coefficients must be larger than a > 0). Lower bounds can be set to 0 iff the corresponding upper bound for that coefficient is also not 0.

  • highs (np array or float) –

    Upper bounds for coefficients. Either a scalar for all coefficients to have the same bound or a vector of size p (number of columns of X) where highs[i] is the upper bound for coefficient i.

    Upper bounds can not be below 0 (i.e. we can not specify that all coefficients must be smaller than a < 0). Upper bounds can be set to 0 iff the corresponding lower bound for that coefficient is also not 0.

Returns

fit_model – FitModel instance containing all relevant information from the solution path.

Return type

l0learn.models.FitModel

Examples

>>>fit_model = l0learn.fit(X, y, penalty=”L0”, max_support_size=20)

FitModels

class l0learn.models.FitModel(settings: Dict[str, Any], lambda_0: List[List[float]], gamma: List[float], support_size: List[List[int]], coeffs: List[scipy.sparse.csc.csc_matrix], intercepts: List[List[float]], converged: List[List[bool]])

FitModel returned by calling l0learn.fit(…)

CVFitModels

class l0learn.models.CVFitModel(settings: Dict[str, Any], lambda_0: List[List[float]], gamma: List[float], support_size: List[List[int]], coeffs: List[scipy.sparse.csc.csc_matrix], intercepts: List[List[float]], converged: List[List[bool]], cv_means: List[numpy.ndarray], cv_sds: List[numpy.ndarray])

Generating Functions

l0learn.models.gen_synthetic(n: int, p: int, k: int, seed: Optional[int] = None, rho: float = 0, b0: float = 0, snr: float = 1) Dict[str, Union[float, numpy.ndarray]]
Generates a synthetic dataset as follows:
  1. Sample every element in data matrix X from N(0,1).

  2. Generate a vector B with the first k entries set to 1 and the rest are zeros.

  3. Sample every element in the noise vector e from N(0,A) where A is selected so y, X, B have snr as specified.

  4. Set y = XB + b0 + e.

Parameters
  • n (int) – Number of samples

  • p (int) – Number of features

  • k (int) – Number of non-zeros in true vector of coefficients

  • seed (int, optional) – The seed used for randomly generating the data If None, numbers will be random

  • rho (float, default 0.) – The threshold for setting X values to 0. if |X[i, j]| < rho => X[i, j] <- 0

  • b0 (float, default 0) – The intercept value to translate y by.

  • snr (float, default 1) –

    Desired Signal-to-Noise ratio. This sets the magnitude of the error term ‘e’.

    Note: SNR is defined as SNR = Var(XB)/Var(e)

Returns

Data

A dict containing:

the data matrix X, the response vector y, the coefficients B, the error vector e, the intercept term b0.

Return type

Dict

Examples

>>>data = gen_synthetic(n=500,p=1000,k=10,seed=1)

l0learn.models.gen_synthetic_high_corr(n: int, p: int, k: int, seed: Optional[int] = None, rho: float = 0, b0: float = 0, snr: float = 1, mu: float = 0, base_cor: float = 0.8) Dict[str, Union[float, numpy.ndarray]]
Generates a synthetic dataset as follows:
  1. Generate a correlation matrix, SIG, where item [i, j] = base_core^|i-j|.

  2. Draw from a Multivariate Normal Distribution using (mu and SIG) to generate X.

  3. Generate a vector B with every ~p/k entry set to 1 and the rest are zeros.

  4. Sample every element in the noise vector e from N(0,A) where A is selected so y, X, B have snr as specified.

  5. Set y = XB + b0 + e.

Parameters
  • n (int) – Number of samples

  • p (int) – Number of features

  • k (int) – Number of non-zeros in true vector of coefficients

  • seed (int) – The seed used for randomly generating the data

  • rho (float) – The threshold for setting values to 0. if |X[i, j]| < rho => X[i, j] <- 0

  • b0 (float) – intercept value to scale y by.

  • snr (float) – desired Signal-to-Noise ratio. This sets the magnitude of the error term ‘e’. SNR is defined as SNR = Var(XB)/Var(e)

  • mu (float) – The mean for drawing from the Multivariate Normal Distribution. A scalar of vector of length p.

  • base_cor (float) – The base correlation, A in [i, j] = A^|i-j|.

Returns

data

A dict containing:

the data matrix X, the response vector y, the coefficients B, the error vector e, the intercept term b0.

Return type

dict

Examples

>>>gen_synthetic_high_corr(n=500,p=1000,k=10,seed=1)

l0learn.models.gen_synthetic_logistic(n: int, p: int, k: int, seed: Optional[int] = None, rho: float = 0, b0: float = 0, s: float = 1, mu: Optional[float] = None, base_cor: Optional[float] = None) Dict[str, Union[float, numpy.ndarray]]
Generates a synthetic dataset as follows:
  1. Generate a data matrix, X, drawn from either N(0, 1) (see gen_synthetic) or a multivariate_normal(mu, sigma)

    (See gen_synthetic_high_corr) gen_synthetic_logistic delegates these caluclations to the respective functions.

  2. Generate a vector B with k entries set to 1 and the rest are zeros.

  3. Every coordinate yi of the outcome vector y exists in {0, 1}^n is sampled independently from a Bernoulli
    distribution with success probability:

    P(yi = 1|xi) = 1/(1 + exp(-s<xi, B>))

    Source https://arxiv.org/pdf/2001.06471.pdf Section 5.1 Data Generation

Parameters
  • n (int) – Number of samples

  • p (int) – Number of features

  • k (int) – Number of non-zeros in true vector of coefficients

  • seed (int) – The seed used for randomly generating the data

  • rho (float) – The threshold for setting values to 0. if |X[i, j]| > rho => X[i, j] <- 0

  • b0 (float) – The intercept value to scale the log odds of y by. As b0 -> +inf, y will contain more 1s As b0 -> -inf, y will contain more 0s

  • s (float) – Signal-to-noise parameter. As s -> +Inf, the data generated becomes linearly separable.

  • mu (float, optional) – The mean for drawing from the Multivariate Normal Distribution. A scalar of vector of length p. If mu and base_cor are not specified, will be drawn from N(0, 1) using gen_synthetic. If mu and base_cor are specified will be drawn from multivariate_normal(mu, sigma) see gen_synthetic_high_corr

  • base_cor (float) – The base correlation, A in [i, j] = A^|i-j|. If mu and base_cor are not specified, will be drawn from N(0, 1) If mu and base_cor are specified will be drawn from multivariate_normal(mu, sigma) see gen_synthetic_high_corr

Returns

data

A dict containing:

the data matrix X the response vector y, the coefficients B, the intercept term b0.

Return type

dict

Scoring Functions

These functions are called by l0learn.models.FitModel.score() and l0learn.models.CVFitModel.score().

l0learn.models.regularization_loss(coeffs: scipy.sparse.csc.csc_matrix, l0: Union[float, Sequence[float]] = 0, l1: Union[float, Sequence[float]] = 0, l2: Union[float, Sequence[float]] = 0) Union[float, numpy.ndarray]

Calculates the regularization loss for a path of (or individual) solution(s).

Parameters
  • coeffs

  • l0

  • l1

  • l2

Returns

loss

Let the shape of coeffs be (P, K) and l0, l1, and l2 by either a float or a sequence of length K
Then loss will be of shape: (K, ) with the following layout:

loss[i] = l0[i]*||coeffs[:, i||_0 + l1[i]*||coeffs[:, i||_1 + l2[i]*||coeffs[:, i||_2

Return type

float or np.ndarray of shape (K,) where K is the number of solutions in coeffs

Notes

l0learn.models.squared_error(y_true: numpy.ndarray, y_pred: numpy.ndarray, coeffs: Optional[scipy.sparse.csc.csc_matrix] = None, l0: Union[float, Sequence[float]] = 0, l1: Union[float, Sequence[float]] = 0, l2: Union[float, Sequence[float]] = 0) numpy.ndarray

Calculates Squared Error loss of solution with optional regularization

Parameters
  • y_true (np.ndarray of shape (m, )) –

  • y_pred (np.ndarray of shape (m, ) or (m, k)) –

  • coeffs (np.ndarray of shape (p, k), optional) –

  • l0 (float or sequence of floats of shape (l)) –

  • l1 (float or sequence of floats of shape (l)) –

  • l2 (float or sequence of floats of shape (l)) –

Returns

squared_error – Shape (,) if y_pred is 1D or = (k,) if y_pred is 2D

Return type

np.ndarray

l0learn.models.logistic_loss(y_true: numpy.ndarray, y_pred: numpy.ndarray, coeffs: Optional[scipy.sparse.csc.csc_matrix] = None, l0: float = 0, l1: float = 0, l2: float = 0, eps: float = 1e-15) numpy.ndarray

Calculates Logistic Loss of solution with optional regularization

Parameters
  • y_true (np.ndarray of shape (m, )) –

  • y_pred (np.ndarray of shape (m, ) or (m, k)) –

  • coeffs (np.ndarray of shape (p, k), optional) –

  • l0 (float or sequence of floats of shape (l)) –

  • l1 (float or sequence of floats of shape (l)) –

  • l2 (float or sequence of floats of shape (l)) –

  • eps (float, default=1e-15) – Logistic loss is undefined for p=0 or p=1, so probabilities are clipped to max(eps, min(1 - eps, p)).

Returns

logistic_loss – Shape (,) if y_pred is 1D or = (k,) if y_pred is 2D

Return type

np.ndarray

l0learn.models.squared_hinge_loss(y_true: numpy.ndarray, y_pred: numpy.ndarray, coeffs: Optional[scipy.sparse.csc.csc_matrix] = None, l0: float = 0, l1: float = 0, l2: float = 0) numpy.ndarray

Calculates Logistic Loss of solution with optional regularization

Parameters
  • y_true (np.ndarray of shape (m, )) –

  • y_pred (np.ndarray of shape (m, ) or (m, k)) –

  • coeffs (np.ndarray of shape (p, k), optional) –

  • l0 (float or sequence of floats of shape (l)) –

  • l1 (float or sequence of floats of shape (l)) –

  • l2 (float or sequence of floats of shape (l)) –

Returns

squared_hinge_loss – Shape (,) if y_pred is 1D or = (k,) if y_pred is 2D

Return type

np.ndarray