Current popular phishing prevention techniques mainly utilize reactive
blocklists, which leave a “window of opportunity” for attackers during which
victims are unprotected. One possible approach to shorten this window aims to
detect phishing attacks earlier, during website preparation, by monitoring
Certificate Transparency (CT) logs. Previous attempts to work with CT log data
for phishing classification exist, however they lack evaluations on actual CT
log data. In this paper, we present a pipeline that facilitates such
evaluations by addressing a number of problems when working with CT log data.
The pipeline includes dataset creation, training, and past or live
classification of CT logs. Its modular structure makes it possible to easily
exchange classifiers or verification sources to support ground truth labeling
efforts and classifier comparisons. We test the pipeline on a number of new and
existing classifiers, and find a general potential to improve classifiers for
this scenario in the future. We publish the source code of the pipeline and the
used datasets along with this paper
(, thus making future research in
this direction more accessible.

