An Exploratory Data Science Project by Ben Wetherfield
In trying to determine best-fit colleges for them, juniors in high school are faced with a plethora of available information, forum discussions and hearsay. In order to break through the noise, parents and guardians often turn to education consultants and counselors to help their children find the best school that will have them and at which they will succeed. They must navigate concrete gateways to entry like testing requirements and availability of financial aid, while simultaneously having to juggle the intangibles of "feel" and "prestige". Especially for International students with a limited frame of reference for universities, it becomes hard to even conjure up a shortlist of schools let alone visit campuses in search of "gut feelings" and "instincts". The global coronavirus pandemic has only increased the barriers to access to campuses for International students, and indeed, even students on American soil. In trying to gain better recommendations or insights, students turn to their counselors and consultants, who offer a valuable source of insight through their knowledge and anecdotal experience. There are still limits however, to the lived experience of different campuses that these professionals can have, and hence, limitations on the level of certainty with which they can recommend certain campuses for certain students. Inevitably, there are campuses they know extremely well, and sometimes biases towards them in their recommendations. Students meanwhile, can have a very strong feeling for one campus that they happened to visit, but may miss out on a better fit school that they were not able to reach on a college tour.
For professionals and clients alike, this project aims to provide a better means of navigating clusters of "similar" schools. In particular, based on the surroundings of different colleges and their class size, we aim to characterize the "feel" of different schools and group them accordingly. By grouping our similar feeling universities, and presenting them with college rank information (a proxy for prestige), we provide a useful tool for cutting through the noise and developing a better college shortlist for students around the world.
We use three sources of data. Our initial port of call is a popular college ranking website (niche.com), from which we extract a list of around 80 top universities. As well as a mere list of schools, we collect information on university size, grouping in three buckets (small, medium and large or <5000, 5000-15000 and 15000+ respectively). This information could be easily gathered by hand, but we use some simple web-scraping and HTML destructuring of downloaded pages (as we discovered that the website had web-scraper blocking functionality). All the work done in this project is done for educational purposes! Moreover, the scraping is done at a small enough scale that the same information could have been gathered by hand quite simply.
Finally, we use the Foursquare Places API to characterize the compositions of the college towns of the university list compiled in the first and second parts. The Foursquare Places API returns venues surrounding a point on the globe (characterized by latitude and longitude) together with categories for each venue (such as "Coffee Shop", "Park" etc.). We use these categories to create groupings and measure similarities between the areas surrounding different schools in our list. Longitudes and latitudes are gathered for each university using geocoding tools - namely the geocoder API for the Nominatim open source geocoding platform (which uses OpenStreetView data).
Since Nominatim returns more than one longitude/latitude pair for each university, we collect the full set, handle outliers and take an average to create an approximation for the undergraduate living environment. Universities can sometimes sprawl, with various labs and research outcrops set apart from the central campus. Taking a median is more suitable for mitigating the effects of these more spread-out outposts, since undergraduates more commonly live closer to the central mass of buildings. Various outliers that are outside of the United States entirely are easy to eliminate as a preprocessing step.
Our overall approach will be to run a k-means clustering algorithm. This algorithm groups datapoints in the same cluster (1 of , an integer chosen before running the algorithm), based on how close they are to each other, such that two points in the same cluster tend to have similar values for most attributes. As such, it is a good measure of similarity between two entries in a dataset.
For us, a datapoint will record the categories of venues that are found around a given university. There are some trivial differences between university vicinities that we want to avoid measuring. For example, if one university is reported by Foursquare to have many "cafés" around it, and another to neighbor many "coffee shops" we would want these to be measured as close rather than distant. The fix needed is to make appropriate groupings of venue categories.
The final list of grouped category definitions used is:
'Arts & Crafts Store', 'Clothing Store', 'Gift Shop', 'Park', 'Salon / Barbershop', 'Smoke Shop', 'Student Center', 'Tea Room', 'MUSEUM', 'BOOKSTORE', 'COMMON_RESTAURANT', 'NOVEL_RESTAURANT', 'RESTAURANT', 'GYM', 'THEATER', 'STADIUM', 'VENUE', 'LIQUOR', 'PUB', 'FANCY_BAR', 'TRAIL', 'HISTORIC', 'SQUARE', 'BEAUTY', 'ART', 'FANCY_FOOD', 'JUICE', 'EASY_FOOD', 'GROCERY', 'SWEETS', 'CONVENIENCE', 'BREAKFAST', 'COFFEE', 'SELF_CARE'
with venue categories grouped into these categories or dropped if they occur in insignificant numbers. 'BEAUTY'
, for example, is supposed to capture the presence of a beautiful setting, encompassing various category types, such as 'Lake'
, 'Beach'
, 'Scenic Lookout'
and 'Garden'
. 'COMMON_RESTAURANT'
, 'NOVEL_RESTAURANT'
and 'RESTAURANT'
are grouped based on how many of each restaurant category are observed in the FourSquare data across all the universities, with thresholds for bucketing based on the data. In this sense 'Tibetan Restaurant'
venues are less common and fall within a threshold that places them in the 'NOVEL_RESTAURANT'
meta-category, while 'Italian Restaurants'
are recorded in their hundreds and get grouped as 'COMMON_RESTAURANT'
! 'EASY_FOOD'
encompasses types like 'Fast Food Place'
, 'Burger Joint'
, 'Pizza Joint'
, 'Sandwich Place'
etc.. Full details of the groupings can be viewed in the 'Foursquare_Usage.ipynb'
notebook accompanying this report.
We are more interested in measuring distance in terms of variance from a mean, rather than raw numbers, since some category types occur in much larger numbers, for instance, universities tend to have a large number of coffee shops surrounding them, so a difference of one or two should not have the same influence as a difference of one or two lakes in the vicinity of a college! As such we use a StandardScaler
from the sklearn.preprocessing
library in Python, which recomputes each attribute in terms of the number of positive or negative standard deviations from the mean.
Even though we have grouped venues, cutting the number of venue categories from 150 to 34, we still have a problem of relative weighting based on the prevalence of, say, food and beverage venues over outdoors-y venues. We can further group the categories in the following sets
shops=['Arts & Crafts Store', 'Clothing Store', 'Gift Shop', 'Salon / Barbershop', 'Smoke Shop', 'GROCERY', 'CONVENIENCE', 'LIQUOR', 'BOOKSTORE', 'SWEETS', 'COFFEE' ]
food_and_beverage = ['Tea Room', 'RESTAURANT', 'COMMON_RESTAURANT', 'NOVEL_RESTAURANT', 'PUB', 'FANCY_BAR', 'FANCY_FOOD', 'JUICE', 'EASY_FOOD', 'BREAKFAST' ]
recreation = ['GYM', 'THEATER', 'STADIUM', 'VENUE', 'ART', 'MUSEUM', 'SELF_CARE', 'Student Center']
surroundings = ['TRAIL','HISTORIC','SQUARE','BEAUTY', 'Park']
with attributes scaled so that shop
, food_and_beverage
, recreation
and surroundings
attributes each have the same aggregate influence on distance or similarity measurement.
We achieve this scaling by taking the spread of the attributes in each category and then taking the sum total, then scaling attributes in a given category inversely proportional to that sum total. Hence since 'surroundings'
venues occur in smaller numbers, we end up scaling up the influence of these items on the similarity of different colleges relative to the other attribute groups.
Finally we can add size variables to our universities, based on 'Small'
, 'Medium'
and 'Large'
bucketing from the rankings website data source. These data are one-hot encoded, meaning that a small university has values 1
, 0
and 0
respectively for its 'Small'
, 'Medium'
and 'Large'
attributes. These attributes were scaled so as to have a strong, but not completely dominating influence on clustering so that universities of different sizes still can wind up in the same cluster, but there will often be a general trend of colleges grouped according to size. This balance seems right based on the groupings that emerge and the established wisdom that a larger university can still have a "small feel".
The results of the different clusterings are given below, with the ten most common venues given as a means of concretely grasping what similarities we have measured between different colleges and their vicinities.
Finally, we can render the clusters on a map of the United States.
The results stand up to various measures of common sense scrutiny. Barnard and Columbia along with New York University and Cooper Union, both of which pairs sit almost on top of each other, are grouped together, which respects the similarity of their feels in the urban New York City environment. Meanwhile, various more urban schools are grouped together, with Boston University alongside Columbia and Barnard, in spite of the fact that no measure of city centeredness was explicitly given. The small liberal arts colleges of the North East together with some other "classic" college towns are grouped together in group 1. In cluster 5 we have a few campuses that could easily be described as sheltered but urban. USC stands alone, but this could be to be expected, since the Foursquare data is uniquely rich in Museums and Fast Food!
It is interesting to note that many of the clusters feature schools across a wide range of rankings, so, in a practical sense, a student could use a cluster of schools that seem to appeal to them as a starting off point for constructing a sensible list of schools to apply to.
In future work it would be worthwhile to use a more process and less automated method for identifying latitudes and longitude geolocations for undergraduates at the listed institutions. Inaccuracies in this phase of data wrangling can result in enormous disparities of work, as various universities can even sprawl across multiple cities, with widely varying amenities in each. Feature selection could be improved by surveying college grads or current students on which "venues" were or are the most significant in their college experience, in order to ensure that similarity in college experience were being measured in the right dimensions with the right amount of emphasis.
Ultimately, both venues-in-the-vicinity and student-body-size are only proxies for feel. The hope in this clustering exercise was that it could bring up unusual connections between seemingly dissimilar, and provide a starting off point in college research for students and professionals with less familiarity with certain schools. By examining the top 10 most commonly encountered venues in the vicinity of schools clustered together, we have a lens through which to see the similarities in experience we might have as a student-about-town during our college years. In some cases, the differences between certain colleges are smaller than we might think when we really get down to the gritty details of what experiences you can have while studying there. In the age of Covid-19, and less access to college tours for would-be students, tools like the clustering formulated in this report may prove to be helpful for students trying to grasp what their college experience might be like once in-person teaching resumes. At the same time, there are many other criteria that students should be taking into account, including the quality of programming in majors of interest, faculty-to-student ratio, diversity and extracurricular offerings, but this clustering, and the methods that created it, are a great way to get a sense of what is out there, which campuses are similar and, using the map visualization, where they sit geographically.