Mutual TLS Analysis

[Artifact] Code for analyzing information types in mTLS certificates

We provide the code used for conducting the privacy analysis presented in Section 6 of the paper. This code allows you to retrieve the count of each information type (e.g., domain name, IP address, MAC address, personal name, etc.) in the CN and SAN fields of certificates, as shown in Table 8.

Classifying types of information in CN and SAN

Prerequisites

  • CCADB: We used CCADB to determine whether certificate authorities are listed as the Common Name (CN) or Subject Alternative Name (SAN).
    • All Certificate Information (root and intermediate) in CCADB - CSV
  • spaCy pretrained pipeline: We employed spaCy’s pre-trained model for Named Entity Recognition (NER), which categorizes text into labels such as ‘PERSON’, ‘ORG’, ‘PRODUCT’, and ‘DATE’.
  • Company name datasets: We utilized publicly available company name lists from Kaggle to further classify entries that were not recognized by the spaCy NER model.

Code

You can download the code from our GitHub repository - Link

*Please note that, due to privacy concerns regarding our datasets—collected from within a campus network and containing sensitive user information—we are unable to provide the exact code used in the paper. Some parameters and functions have been modified accordingly.

How to run?

  1. Download the prerequisites and the code (analysis_code.py).

  2. Set the paths for the prerequisite files:

    SPACY_NER_PATH = "/path/to/en_core_web_trf-3.7.3/"
    CCADB_PATH = "/path/to/AllCertificateRecordsReport.csv"
    COMPANY_DATA1_PATH = "/path/to/companies-2023-q4-sm.csv"
    COMPANY_DATA2_PATH = "/path/to/companies_sorted.csv"
    
  3. Set the paths for intermediate files (used to save some intermediate results and reduce future running time):

    COMPANY_SIM_SEED_PATH = "/path/to/company_sim_seed.csv" #Can be any name, but in CSV format
    NER_SEED_PATH = "/path/to/ner_seed.csv" #Can be any name, but in CSV format
    
  4. Load datasets as Pandas dataframe:

    def load_certificate_data():
        path = "/path/to/certificate_dataset" #we do not provide the raw data due to privacy concerns
        # df = load data from path #load the data using pandas
        return df
    
  5. Run the script.

    For example, you can see that the following output is printed, showing the count of each information type:

    - [CN] Public CA
    Domain, 3153
    Email, 2
    IP, 1
    MAC, 0
    SIP, 0
    Localhost, 1
    PERSON, 133
    Campus_person, 0  # Personal names included in certificates issued by the Campus CA
    Gov_person, 0   # Personal names associated with government-related domains' certificates
    Campus_ID, 0   # User accounts (IDs) used within the campus
    CA, 1       # Certificate authority names
    ORG, 40     # Organization names classified by spaCy's model
    Company, 16
    PRODUCT, 5603   
    Unknown, 13397
    Empty, 1
    Total: 22461
    

Dataset

Unfortunately, we are unable to provide the raw certificate data used in the analysis in Section 6 due to its sensitive nature.

*Section 1-5: for the same reason, we are also unable to provide the artifacts used in the analyses for Sections 1-5.