Airline Survey Analysis

Leveraging Latent Space Clustering on High Dimensional and Sparse Survey Data

Pre & Post Covid Analysis

Survey responses were gathered both before and after the Covid-19 lockdown in 2020, offering a unique opportunity to explore how customer segments might have shifted during this period. However, as with our previous analysis, applying unsupervised learning to these subsets is challenging due to high-dimensional data and relatively small sample sizes. To address these issues, I will employ the same autoencoder latent space clustering framework that I used earlier, ensuring a consistent and robust analysis across both time periods.

Training & Elbow

The training histories and inertia plots for the Pre and Post Covid data splits are shown below, generated using the same methods as the combined dataset analysis.

Pre-Covid

Post-Covid

Latent Space Cluster Visualizations

The scatter plots below compare the latent space clustering results from both the Pre and Post Covid splits in the data. Each point represents the collection of responses to the same survey from one individual. There were no repeated respondents, so all the people in the Pre-Covid plot are different from the people in the Post-Covid plot. However, it is interesting to see that the autoencoder recognized similar patterns amongst respondents and gave similar shaped clusters in both outputs.

One difference that is worth noting is how the Post-Covid clusters are more dense with dinstinct groupings while the Pre-Covid clusts are visually more spread out. The higher silhouette score further supports this observation.

Pre-Covid
Pre-Covid cluster visualization
Silhouette Score: 0.76
Post-Covid
Silhouette Score: 0.86

Airline Frequency Heatmap

Below are the frequency heatmaps for each of the commercial airlines broken out by cluster. Lower numbers indicate that respondents in that cluster fly with a given airline more frequently.

Notice: Spirit and Hawaiian airlines are not included in the Pre-Covid heatmap because none of the respondents provided answers to either of those survey questions in that split of the dataset.

Differentiating Variables Before/After

To understand which variables differed between the clusters, I used the heatmaps shown below. I first calculated the average value for each of the survey questions grouped by each cluster. Then, for each variable, if the clusters with the highest and lowest values didn’t differ by at least a specified threshold (these thresholds are listed in the plot titles), they were filtered out. In other words, only variables that had higher differences between clusters were included to show which survey responses differentiated the four groups.

Nominal Variables

Nominal variables were factored into dummy variables on a binary scale. Thus, they would not have been included in the previous heatmaps. Following the same setup, the heatmaps below show the same comparisons using only these variables.

Pre-Covid Cluster Profiles

Cluster 0: Loyal and Frequent Flyers

Key Characteristics

Actionable Insights

Cluster 1: Occasional, Price-Conscious Travelers

Key Characteristics

Actionable Insights

Cluster 2: Critical, Experience-Focused Flyers

Key Characteristics

Actionable Insights

Cluster 3: Balanced, Nuanced Travelers

Key Characteristics

Actionable Insights

Post-Covid Cluster Profiles

Cluster 0: Loyal, Frequent Flyers

Key Characteristics

Actionable Insights

Cluster 1: Occasional, Price-Conscious Travelers

Key Characteristics

Cluster 2: Critical, Experience-Focused Flyers

Key Characteristics

Cluster 3: Balanced, Nuanced Travelers

Key Characteristics

Actionable Insights