cacp.comparison

cacp.comparison.process_comparison(datasets: typing.List[cacp.dataset.ClassificationDatasetBase], classifiers: typing.List[typing.Tuple[str, typing.Callable]], result_dir: pathlib.Path, metrics: typing.Sequence[typing.Tuple[str, typing.Callable]] = (('AUC', <function auc>), ('Accuracy', <function accuracy>), ('Precision', <function precision>), ('Recall', <function recall>), ('F1', <function f1>)), n_folds: typing.Literal[5, 10] = 10, custom_fold_modifiers: typing.Optional[typing.List[cacp.dataset.ClassificationFoldDataModifierBase]] = None, dob_scv: bool = True, categorical_to_numerical=True, normalized: bool = False, progress=<function <lambda>>)[source]

Runs comparison for provided datasets and classifiers.

Parameters
  • datasets – dataset collection

  • classifiers – classifiers collection

  • result_dir – results directory

  • metrics – metrics collection

  • n_folds – number of folds {5,10}

  • custom_fold_modifiers – custom fold modifiers that can change fold data before usage

  • dob_scv – if folds distribution optimally balanced stratified cross-validation (DOB-SCV) should be used

  • categorical_to_numerical – if dataset categorical values should be converted to numerical

  • normalized – if the data should be normalized in range [0..1]

  • progress – function that can be used to monitor progress

cacp.comparison.process_comparison_single(classifier_factory, classifier_name, dataset: cacp.dataset.ClassificationDatasetBase, fold: cacp.dataset.ClassificationFoldData, metrics: Sequence[Tuple[str, Callable]]) dict[source]

Runs comparison on single classifier and dataset.

Parameters
  • classifier_factory – classifier factory

  • classifier_name – classifier name

  • dataset – single dataset

  • fold – fold data

  • metrics – metrics collection

Returns

dictionary of calculated metrics and metadata

cacp.comparison.process_incremental_comparison(datasets: typing.List[typing.Union[cacp.dataset.ClassificationDatasetBase, river.datasets.base.Dataset]], classifiers: typing.List[typing.Tuple[str, typing.Callable]], result_dir: pathlib.Path, metrics: typing.Sequence[typing.Tuple[str, typing.Callable]] = (('AUC', <class 'river.metrics.roc_auc.ROCAUC'>), ('Accuracy', <class 'river.metrics.accuracy.Accuracy'>), ('Precision', <class 'river.metrics.precision.Precision'>), ('Recall', <class 'river.metrics.recall.Recall'>), ('F1', <class 'river.metrics.fbeta.F1'>)), progress=<function <lambda>>)[source]

Runs comparison for provided datasets and incremental classifiers.

Parameters
  • datasets – dataset collection

  • classifiers – classifiers collection

  • result_dir – results directory

  • metrics – metrics collection

  • progress – function that can be used to monitor progress

cacp.comparison.process_incremental_comparison_single(classifier_factory, classifier_name, dataset: typing.Union[cacp.dataset.ClassificationDatasetBase, river.datasets.base.Dataset], number_of_classes: int, incremental_comparison_dir: pathlib.Path, metrics: typing.Sequence[typing.Tuple[str, typing.Callable]] = (('AUC', <class 'river.metrics.roc_auc.ROCAUC'>), ('Accuracy', <class 'river.metrics.accuracy.Accuracy'>), ('Precision', <class 'river.metrics.precision.Precision'>), ('Recall', <class 'river.metrics.recall.Recall'>), ('F1', <class 'river.metrics.fbeta.F1'>))) dict[source]

Runs comparison on single classifier and dataset.

Parameters
  • classifier_factory – classifier factory

  • classifier_name – classifier name

  • dataset – single dataset

  • number_of_classes – number of classes

  • incremental_comparison_dir – incremental single results directory

  • metrics – metrics collection

Returns

dictionary of calculated metrics and metadata

cacp.dataset

class cacp.dataset.ClassificationDataset(name: Literal['abalone', 'appendicitis', 'australian', 'automobile', 'balance', 'banana', 'bands', 'breast', 'bupa', 'car', 'chess', 'cleveland', 'coil2000', 'contraceptive', 'crx', 'dermatology', 'ecoli', 'flare', 'german', 'glass', 'haberman', 'hayes-roth', 'heart', 'hepatitis', 'housevotes', 'ionosphere', 'iris', 'kr-vs-k', 'led7digit', 'letter', 'lymphography', 'magic', 'mammographic', 'marketing', 'monk-2', 'movement_libras', 'mushroom', 'newthyroid', 'nursery', 'optdigits', 'page-blocks', 'penbased', 'phoneme', 'pima', 'post-operative', 'ring', 'saheart', 'satimage', 'segment', 'shuttle', 'sonar', 'spambase', 'spectfheart', 'splice', 'tae', 'texture', 'thyroid', 'tic-tac-toe', 'titanic', 'twonorm', 'vehicle', 'vowel', 'wdbc', 'wine', 'winequality-red', 'winequality-white', 'wisconsin', 'yeast', 'zoo'], files_cache_path=PosixPath('/home/docs/cacp_files'), seed=1)[source]

Bases: cacp.dataset.ClassificationDatasetBase

Class that represents KEEL single dataset.

property classes: int
property features: int
folds(n_folds: Literal[5, 10] = 10, dob_scv: bool = True, categorical_to_numerical=True) Iterator[cacp.dataset.ClassificationFoldData][source]
property instances: int
property name: str
property origin: str
property output_name: str
class cacp.dataset.ClassificationDatasetBase(seed=1)[source]

Bases: cacp.dataset.ClassificationDatasetMinimalBase

Base class for classification dataset that represents single dataset.

abstract property classes: int
abstract property features: int
abstract property instances: int
abstract property name: str
class cacp.dataset.ClassificationDatasetDownloadProgressBar(*_, **__)[source]

Bases: tqdm.std.tqdm

update_to(b=1, bsize=1, t_size=None)[source]
class cacp.dataset.ClassificationDatasetMinimalBase(seed=1)[source]

Bases: abc.ABC

Minimal base class for classification dataset that represents single dataset.

abstract folds(n_folds: Literal[5, 10] = 10, dob_scv: bool = True, categorical_to_numerical=True) Iterable[cacp.dataset.ClassificationFoldData][source]
class cacp.dataset.ClassificationFoldData(index: int, labels: numpy.ndarray, x_train: numpy.ndarray, y_train: numpy.ndarray, x_test: numpy.ndarray, y_test: numpy.ndarray)[source]

Bases: object

Class that represents single dataset fold.

index: int
labels: numpy.ndarray
x_test: numpy.ndarray
x_train: numpy.ndarray
y_test: numpy.ndarray
y_train: numpy.ndarray
class cacp.dataset.ClassificationFoldDataModifierBase[source]

Bases: abc.ABC

abstract modify(fold: cacp.dataset.ClassificationFoldData) cacp.dataset.ClassificationFoldData[source]
class cacp.dataset.ClassificationFoldDataNormalizer[source]

Bases: cacp.dataset.ClassificationFoldDataModifierBase

modify(fold: cacp.dataset.ClassificationFoldData) cacp.dataset.ClassificationFoldData[source]
class cacp.dataset.LocalClassificationDataset(name: str, dataset_directory: pathlib.Path)[source]

Bases: cacp.dataset.ClassificationDataset

Class that represents single local dataset that has similar structure to KEEL dataset.

class cacp.dataset.LocalCsvClassificationDataset(name: str, dataset_path: pathlib.Path)[source]

Bases: cacp.dataset.ClassificationDatasetBase

Class that represents single local dataset that is SCV with header.

property classes: int
property features: int
folds(n_folds: Literal[5, 10] = 10, dob_scv: bool = True, categorical_to_numerical=True) Iterable[cacp.dataset.ClassificationFoldData][source]
property instances: int
property name: str
cacp.dataset.all_datasets() List[cacp.dataset.ClassificationDataset][source]

Gets all available datasets

Returns

all classification datasets

cacp.info

cacp.info.classifier_info(classifiers: Iterable[Tuple[str, Callable]], result_dir: pathlib.Path)[source]

Produces results files with list of all classifiers used in experiment along with their attributes.

Parameters
  • classifiers – classifiers collection

  • result_dir – results directory

cacp.info.dataset_info(datasets: Iterable[Union[cacp.dataset.ClassificationDatasetBase, river.datasets.base.Dataset]], result_dir: pathlib.Path)[source]

Produces results files with list of all datasets used in experiment alog with their attributes.

Parameters
  • datasets – dataset collection

  • result_dir – results directory

cacp.plot

class cacp.plot.Line(x: numpy.ndarray, y: numpy.ndarray, label: str = '')[source]

Bases: object

label: str = ''
x: numpy.ndarray
y: numpy.ndarray
cacp.plot.process_comparison_results_incremental_plot(file_name: str, y_label: str, lines: List[cacp.plot.Line], plot_dir: pathlib.Path)[source]
cacp.plot.process_comparison_results_incremental_plots(result_dir: pathlib.Path, metrics: typing.Sequence[typing.Tuple[str, typing.Callable]] = (('AUC', <class 'river.metrics.roc_auc.ROCAUC'>), ('Accuracy', <class 'river.metrics.accuracy.Accuracy'>), ('Precision', <class 'river.metrics.precision.Precision'>), ('Recall', <class 'river.metrics.recall.Recall'>), ('F1', <class 'river.metrics.fbeta.F1'>)))[source]

Generates plots from incremental comparison results.

Parameters
  • result_dir – results directory

  • metrics – metrics collection

cacp.plot.process_comparison_results_plots(result_dir: pathlib.Path, metrics: typing.Sequence[typing.Tuple[str, typing.Callable]] = (('AUC', <function auc>), ('Accuracy', <function accuracy>), ('Precision', <function precision>), ('Recall', <function recall>), ('F1', <function f1>)))[source]

Generates plots from comparison results.

Parameters
  • result_dir – results directory

  • metrics – metrics collection

cacp.plot.process_comparison_results_single_incremental_plot(classifier_name: str, dataset_name: str, metric: str, df: pandas.core.frame.DataFrame, incremental_plot_dir: pathlib.Path)[source]

Generates plots from single incremental comparison results.

Parameters
  • classifier_name – classifier name

  • dataset_name – dataset name

  • metric – metric name

  • df – result dataframe

  • incremental_plot_dir – output plot directory

cacp.result

cacp.result.process_comparison_results(result_dir: pathlib.Path, metrics: typing.Sequence[typing.Tuple[str, typing.Callable]] = (('AUC', <function auc>), ('Accuracy', <function accuracy>), ('Precision', <function precision>), ('Recall', <function recall>), ('F1', <function f1>)))[source]

Processes comparison results, computes mean values for all metrics.

Parameters
  • result_dir – results directory

  • metrics – metrics collection

cacp.run

cacp.run.run_experiment(datasets: typing.List[cacp.dataset.ClassificationDatasetBase], classifiers: typing.List[typing.Tuple[str, typing.Callable]], results_directory: typing.Union[str, os.PathLike] = './result', metrics: typing.Sequence[typing.Tuple[str, typing.Callable]] = (('AUC', <function auc>), ('Accuracy', <function accuracy>), ('Precision', <function precision>), ('Recall', <function recall>), ('F1', <function f1>)), n_folds: typing.Literal[5, 10] = 10, custom_fold_modifiers: typing.Optional[typing.List[cacp.dataset.ClassificationFoldDataModifierBase]] = None, dob_scv: bool = True, categorical_to_numerical=True, normalized: bool = False, seed: int = 1, progress=<function <lambda>>)[source]

[Main CACP Function] Runs automatic comparison of the performance evaluation of supervised classification algorithms by evaluating metrics on multiple datasets.

Parameters
  • datasets – dataset collection

  • classifiers – classifiers collection

  • results_directory – results directory

  • metrics – metrics collection

  • n_folds – number of folds {5,10}

  • custom_fold_modifiers – custom fold modifiers that can change fold data before usage

  • dob_scv – if folds distribution optimally balanced stratified cross-validation (DOB-SCV) should be used

  • categorical_to_numerical – if dataset categorical values should be converted to numerical

  • normalized – if the data should be normalized in range [0..1]

  • seed – random seed value

  • progress – function that can be used to monitor progress

cacp.run.run_incremental_experiment(datasets: typing.List[typing.Union[cacp.dataset.ClassificationDatasetBase, river.datasets.base.Dataset]], classifiers: typing.List[typing.Tuple[str, typing.Callable]], results_directory: typing.Union[str, os.PathLike] = './result', metrics: typing.Sequence[typing.Tuple[str, typing.Callable]] = (('AUC', <class 'river.metrics.roc_auc.ROCAUC'>), ('Accuracy', <class 'river.metrics.accuracy.Accuracy'>), ('Precision', <class 'river.metrics.precision.Precision'>), ('Recall', <class 'river.metrics.recall.Recall'>), ('F1', <class 'river.metrics.fbeta.F1'>)), seed: int = 1, progress=<function <lambda>>)[source]

[Main CACP Function] Runs automatic comparison of the performance evaluation of supervised classification algorithms by evaluating metrics on multiple datasets.

Parameters
  • datasets – dataset collection

  • classifiers – classifiers collection

  • results_directory – results directory

  • metrics – metrics collection

  • seed – random seed value

  • progress – function that can be used to monitor progress

cacp.time

cacp.time.process_times(result_dir: pathlib.Path)[source]

Processes comparison results times.

Parameters

result_dir – results directory

cacp.util

cacp.util.accuracy(y_true, y_pred, labels)[source]
cacp.util.auc(y_true, y_pred, labels)[source]
cacp.util.auc_score(y_true: numpy.ndarray, y_pred: numpy.ndarray, average=None, multi_class=None, labels: Optional[numpy.ndarray] = None) float[source]

Calculates multiclass AUC score.

Parameters
  • y_true – real labels

  • y_pred – predicted labels

  • average – sklearn roc_auc_score param

  • multi_class – sklearn roc_auc_score param

  • labels – sklearn roc_auc_score param

Returns

AUC value

cacp.util.f1(y_true, y_pred, labels)[source]
cacp.util.matthews_corrcoef(y_true, y_pred, labels)[source]
cacp.util.precision(y_true, y_pred, labels)[source]
cacp.util.recall(y_true, y_pred, labels)[source]
cacp.util.seed_everything(seed=1)[source]

Sets up seed for random and numpy random.

Parameters

seed – random seed

cacp.util.to_latex(df: pandas.core.frame.DataFrame, **kwargs) str[source]

Converts Pandas DateFrame to latex table string.

Parameters
  • df – Pandas DateFrame with data to be converted

  • kwargs – other pandas df.to_latex args

Returns

Latex string

cacp.wilcoxon

cacp.wilcoxon.bold_large_p_value(data: float, format_string='%.4f') str[source]

Makes large p-value in Latex table bold

Parameters
  • data – value

  • format_string

Returns

bolded values string

cacp.wilcoxon.process_wilcoxon(classifiers: typing.List[typing.Tuple[str, typing.Callable]], result_dir: pathlib.Path, metrics: typing.Sequence[typing.Tuple[str, typing.Callable]] = (('AUC', <function auc>), ('Accuracy', <function accuracy>), ('Precision', <function precision>), ('Recall', <function recall>), ('F1', <function f1>)))[source]

Calculates the Wilcoxon signed-rank test for comparison results.

Parameters
  • classifiers – classifiers collection

  • result_dir – results directory

  • metrics – metrics collection

cacp.wilcoxon.process_wilcoxon_for_metric(current_algorithm: str, metric: str, result_dir: pathlib.Path) pandas.core.frame.DataFrame[source]

Calculates the Wilcoxon signed-rank test for comparison results single metric.

Parameters
  • current_algorithm – current algorithm

  • metric – comparison metric {auc, accuracy, precision, recall, f1}

  • result_dir – results directory

Returns

DateFrame with wilcoxon values for metric

cacp.winner

cacp.winner.process_comparison_result_winners(result_dir: pathlib.Path, metrics: typing.Sequence[typing.Tuple[str, typing.Callable]] = (('AUC', <function auc>), ('Accuracy', <function accuracy>), ('Precision', <function precision>), ('Recall', <function recall>), ('F1', <function f1>)))[source]

Processes comparison results, finds winners.

Parameters
  • result_dir – results directory

  • metrics – metrics collection

cacp.winner.process_comparison_result_winners_for_metric(metric: str, result_dir: pathlib.Path) pandas.core.frame.DataFrame[source]

Processes comparison results, finds winners for metric.

Parameters
  • metric – comparison metric {auc, accuracy, precision, recall, f1}

  • result_dir – results directory

Returns

DateFrame with winners for metric