Finding the Right College Environment

An Exploratory Data Science Project by Ben Wetherfield

Introduction

In trying to determine best-fit colleges for them, juniors in high school are faced with a plethora of available information, forum discussions and hearsay. In order to break through the noise, parents and guardians often turn to education consultants and counselors to help their children find the best school that will have them and at which they will succeed. They must navigate concrete gateways to entry like testing requirements and availability of financial aid, while simultaneously having to juggle the intangibles of "feel" and "prestige". Especially for International students with a limited frame of reference for universities, it becomes hard to even conjure up a shortlist of schools let alone visit campuses in search of "gut feelings" and "instincts". The global coronavirus pandemic has only increased the barriers to access to campuses for International students, and indeed, even students on American soil. In trying to gain better recommendations or insights, students turn to their counselors and consultants, who offer a valuable source of insight through their knowledge and anecdotal experience. There are still limits however, to the lived experience of different campuses that these professionals can have, and hence, limitations on the level of certainty with which they can recommend certain campuses for certain students. Inevitably, there are campuses they know extremely well, and sometimes biases towards them in their recommendations. Students meanwhile, can have a very strong feeling for one campus that they happened to visit, but may miss out on a better fit school that they were not able to reach on a college tour.

For professionals and clients alike, this project aims to provide a better means of navigating clusters of "similar" schools. In particular, based on the surroundings of different colleges and their class size, we aim to characterize the "feel" of different schools and group them accordingly. By grouping our similar feeling universities, and presenting them with college rank information (a proxy for prestige), we provide a useful tool for cutting through the noise and developing a better college shortlist for students around the world.

Data

We use three sources of data. Our initial port of call is a popular college ranking website (niche.com), from which we extract a list of around 80 top universities. As well as a mere list of schools, we collect information on university size, grouping in three buckets (small, medium and large or <5000, 5000-15000 and 15000+ respectively). This information could be easily gathered by hand, but we use some simple web-scraping and HTML destructuring of downloaded pages (as we discovered that the website had web-scraper blocking functionality). All the work done in this project is done for educational purposes! Moreover, the scraping is done at a small enough scale that the same information could have been gathered by hand quite simply.

Finally, we use the Foursquare Places API to characterize the compositions of the college towns of the university list compiled in the first and second parts. The Foursquare Places API returns venues surrounding a point on the globe (characterized by latitude and longitude) together with categories for each venue (such as "Coffee Shop", "Park" etc.). We use these categories to create groupings and measure similarities between the areas surrounding different schools in our list. Longitudes and latitudes are gathered for each university using geocoding tools - namely the geocoder API for the Nominatim open source geocoding platform (which uses OpenStreetView data).

Methodology

Geolocation

Since Nominatim returns more than one longitude/latitude pair for each university, we collect the full set, handle outliers and take an average to create an approximation for the undergraduate living environment. Universities can sometimes sprawl, with various labs and research outcrops set apart from the central campus. Taking a median is more suitable for mitigating the effects of these more spread-out outposts, since undergraduates more commonly live closer to the central mass of buildings. Various outliers that are outside of the United States entirely are easy to eliminate as a preprocessing step.

Clustering

Our overall approach will be to run a k-means clustering algorithm. This algorithm groups datapoints in the same cluster (1 of k, an integer chosen before running the algorithm), based on how close they are to each other, such that two points in the same cluster tend to have similar values for most attributes. As such, it is a good measure of similarity between two entries in a dataset.

For us, a datapoint will record the categories of venues that are found around a given university. There are some trivial differences between university vicinities that we want to avoid measuring. For example, if one university is reported by Foursquare to have many "cafés" around it, and another to neighbor many "coffee shops" we would want these to be measured as close rather than distant. The fix needed is to make appropriate groupings of venue categories.

The final list of grouped category definitions used is:

'Arts & Crafts Store', 'Clothing Store', 'Gift Shop', 'Park', 'Salon / Barbershop', 'Smoke Shop', 'Student Center', 'Tea Room', 'MUSEUM', 'BOOKSTORE', 'COMMON_RESTAURANT', 'NOVEL_RESTAURANT', 'RESTAURANT', 'GYM', 'THEATER', 'STADIUM', 'VENUE', 'LIQUOR', 'PUB', 'FANCY_BAR', 'TRAIL', 'HISTORIC', 'SQUARE', 'BEAUTY', 'ART', 'FANCY_FOOD', 'JUICE', 'EASY_FOOD', 'GROCERY', 'SWEETS', 'CONVENIENCE', 'BREAKFAST', 'COFFEE', 'SELF_CARE'

with venue categories grouped into these categories or dropped if they occur in insignificant numbers. 'BEAUTY', for example, is supposed to capture the presence of a beautiful setting, encompassing various category types, such as 'Lake', 'Beach', 'Scenic Lookout' and 'Garden'. 'COMMON_RESTAURANT', 'NOVEL_RESTAURANT' and 'RESTAURANT' are grouped based on how many of each restaurant category are observed in the FourSquare data across all the universities, with thresholds for bucketing based on the data. In this sense 'Tibetan Restaurant' venues are less common and fall within a threshold that places them in the 'NOVEL_RESTAURANT' meta-category, while 'Italian Restaurants' are recorded in their hundreds and get grouped as 'COMMON_RESTAURANT'! 'EASY_FOOD' encompasses types like 'Fast Food Place', 'Burger Joint', 'Pizza Joint', 'Sandwich Place' etc.. Full details of the groupings can be viewed in the 'Foursquare_Usage.ipynb' notebook accompanying this report.

We are more interested in measuring distance in terms of variance from a mean, rather than raw numbers, since some category types occur in much larger numbers, for instance, universities tend to have a large number of coffee shops surrounding them, so a difference of one or two should not have the same influence as a difference of one or two lakes in the vicinity of a college! As such we use a StandardScaler from the sklearn.preprocessing library in Python, which recomputes each attribute in terms of the number of positive or negative standard deviations from the mean.

Even though we have grouped venues, cutting the number of venue categories from 150 to 34, we still have a problem of relative weighting based on the prevalence of, say, food and beverage venues over outdoors-y venues. We can further group the categories in the following sets

shops=['Arts & Crafts Store', 'Clothing Store', 'Gift Shop', 'Salon / Barbershop', 'Smoke Shop', 'GROCERY', 'CONVENIENCE', 'LIQUOR', 'BOOKSTORE', 'SWEETS', 'COFFEE' ]

food_and_beverage = ['Tea Room', 'RESTAURANT', 'COMMON_RESTAURANT', 'NOVEL_RESTAURANT', 'PUB', 'FANCY_BAR', 'FANCY_FOOD', 'JUICE', 'EASY_FOOD', 'BREAKFAST' ]

recreation = ['GYM', 'THEATER', 'STADIUM', 'VENUE', 'ART', 'MUSEUM', 'SELF_CARE', 'Student Center']

surroundings = ['TRAIL','HISTORIC','SQUARE','BEAUTY', 'Park']

with attributes scaled so that shop, food_and_beverage, recreation and surroundings attributes each have the same aggregate influence on distance or similarity measurement.

We achieve this scaling by taking the spread of the attributes in each category and then taking the sum total, then scaling attributes in a given category inversely proportional to that sum total. Hence since 'surroundings' venues occur in smaller numbers, we end up scaling up the influence of these items on the similarity of different colleges relative to the other attribute groups.

Finally we can add size variables to our universities, based on 'Small', 'Medium' and 'Large' bucketing from the rankings website data source. These data are one-hot encoded, meaning that a small university has values 1, 0 and 0 respectively for its 'Small', 'Medium' and 'Large' attributes. These attributes were scaled so as to have a strong, but not completely dominating influence on clustering so that universities of different sizes still can wind up in the same cluster, but there will often be a general trend of colleges grouped according to size. This balance seems right based on the groupings that emerge and the established wisdom that a larger university can still have a "small feel".

Results

The results of the different clusterings are given below, with the ten most common venues given as a means of concretely grasping what similarities we have measured between different colleges and their vicinities.

Cluster 1

Rank Size 1st Most Common Venue 2nd Most Common Venue 3rd Most Common Venue 4th Most Common Venue 5th Most Common Venue 6th Most Common Venue 7th Most Common Venue 8th Most Common Venue 9th Most Common Venue 10th Most Common Venue
University
Amherst College 22 Small COMMON_RESTAURANT RESTAURANT SWEETS EASY_FOOD PUB MUSEUM SELF_CARE Student Center THEATER BOOKSTORE
Bates College 47 Small SWEETS COMMON_RESTAURANT VENUE SELF_CARE BOOKSTORE THEATER GYM RESTAURANT NOVEL_RESTAURANT Tea Room
Boston College 48 Medium CONVENIENCE SWEETS COFFEE BEAUTY STADIUM EASY_FOOD COMMON_RESTAURANT MUSEUM RESTAURANT NOVEL_RESTAURANT
Bowdoin College 21 Small COMMON_RESTAURANT PUB RESTAURANT EASY_FOOD COFFEE MUSEUM THEATER CONVENIENCE Park GROCERY
Bryn Mawr College 71 Small COMMON_RESTAURANT BREAKFAST EASY_FOOD SELF_CARE MUSEUM GYM RESTAURANT NOVEL_RESTAURANT BOOKSTORE Student Center
Case Western Reserve University 79 Medium EASY_FOOD RESTAURANT BREAKFAST COFFEE COMMON_RESTAURANT PUB GROCERY GYM FANCY_BAR SELF_CARE
Colgate University 57 Small TRAIL COFFEE GYM BEAUTY SELF_CARE STADIUM RESTAURANT NOVEL_RESTAURANT COMMON_RESTAURANT BOOKSTORE
Colorado College 67 Small LIQUOR MUSEUM CONVENIENCE COMMON_RESTAURANT COFFEE ART GYM TRAIL SELF_CARE Clothing Store
Duke University 6 Medium COMMON_RESTAURANT GYM THEATER COFFEE MUSEUM BEAUTY ART EASY_FOOD BREAKFAST Smoke Shop
Emory University 30 Medium COFFEE GYM EASY_FOOD MUSEUM BREAKFAST Park RESTAURANT FANCY_FOOD NOVEL_RESTAURANT COMMON_RESTAURANT
Grinnell College 69 Small GYM ART THEATER SELF_CARE RESTAURANT NOVEL_RESTAURANT COMMON_RESTAURANT BOOKSTORE MUSEUM VENUE
Johns Hopkins University 27 Medium COMMON_RESTAURANT EASY_FOOD SELF_CARE SWEETS LIQUOR CONVENIENCE BREAKFAST GROCERY Gift Shop Park
Kenyon College 72 Small EASY_FOOD CONVENIENCE COFFEE GROCERY COMMON_RESTAURANT TRAIL SELF_CARE MUSEUM RESTAURANT NOVEL_RESTAURANT
Middlebury College 33 Small VENUE MUSEUM COFFEE GYM ART Smoke Shop Student Center Tea Room Salon / Barbershop STADIUM
Northwestern University 11 Medium COFFEE GYM EASY_FOOD ART COMMON_RESTAURANT MUSEUM BEAUTY RESTAURANT JUICE SWEETS
Rensselaer Polytechnic Institute 73 Medium PUB GYM EASY_FOOD LIQUOR RESTAURANT Smoke Shop Student Center THEATER COFFEE Clothing Store
Rice University 10 Medium ART COFFEE STADIUM THEATER PUB VENUE Park Salon / Barbershop Smoke Shop Student Center
Smith College 62 Small RESTAURANT COMMON_RESTAURANT NOVEL_RESTAURANT SELF_CARE FANCY_BAR Tea Room MUSEUM BOOKSTORE VENUE COFFEE
Swarthmore College 40 Small EASY_FOOD TRAIL SWEETS FANCY_FOOD COMMON_RESTAURANT RESTAURANT PUB BREAKFAST GYM GROCERY
University of California - Santa Barbara 75 Large EASY_FOOD COMMON_RESTAURANT BEAUTY GYM COFFEE SWEETS NOVEL_RESTAURANT JUICE THEATER MUSEUM
University of Florida 59 Large BREAKFAST COMMON_RESTAURANT BEAUTY MUSEUM VENUE GROCERY EASY_FOOD Salon / Barbershop ART PUB
University of Georgia 66 Large GYM ART VENUE COFFEE FANCY_BAR MUSEUM STADIUM EASY_FOOD Student Center Tea Room
University of Illinois at Urbana-Champaign 64 Large COFFEE STADIUM ART MUSEUM GYM EASY_FOOD RESTAURANT NOVEL_RESTAURANT COMMON_RESTAURANT BOOKSTORE
University of Miami 74 Medium COMMON_RESTAURANT EASY_FOOD BREAKFAST GYM COFFEE SELF_CARE CONVENIENCE JUICE PUB MUSEUM
University of Notre Dame 17 Medium EASY_FOOD COMMON_RESTAURANT ART BREAKFAST COFFEE MUSEUM Tea Room RESTAURANT NOVEL_RESTAURANT BOOKSTORE
University of Richmond 61 Medium COFFEE VENUE ART MUSEUM PUB GYM CONVENIENCE Smoke Shop Student Center Tea Room
University of Virginia 29 Large COMMON_RESTAURANT EASY_FOOD BREAKFAST CONVENIENCE COFFEE RESTAURANT GYM MUSEUM PUB ART
University of Wisconsin - Madison 63 Large BEAUTY COFFEE GYM BOOKSTORE THEATER RESTAURANT NOVEL_RESTAURANT COMMON_RESTAURANT SELF_CARE STADIUM
Vassar College 65 Small ART EASY_FOOD PUB TRAIL MUSEUM FANCY_FOOD GYM Student Center Smoke Shop THEATER
Wake Forest University 42 Medium COFFEE EASY_FOOD Student Center MUSEUM BREAKFAST CONVENIENCE COMMON_RESTAURANT RESTAURANT NOVEL_RESTAURANT BOOKSTORE
Washington University in St. Louis 14 Medium VENUE COMMON_RESTAURANT BREAKFAST COFFEE EASY_FOOD THEATER BOOKSTORE GYM RESTAURANT NOVEL_RESTAURANT
Wellesley College 39 Small COMMON_RESTAURANT Clothing Store FANCY_FOOD PUB BOOKSTORE TRAIL MUSEUM GYM ART EASY_FOOD
Wesleyan University 54 Small COFFEE EASY_FOOD RESTAURANT COMMON_RESTAURANT GYM BREAKFAST CONVENIENCE NOVEL_RESTAURANT THEATER MUSEUM

Cluster 2

Rank Size 1st Most Common Venue 2nd Most Common Venue 3rd Most Common Venue 4th Most Common Venue 5th Most Common Venue 6th Most Common Venue 7th Most Common Venue 8th Most Common Venue 9th Most Common Venue 10th Most Common Venue
University
Brown University 7 Medium RESTAURANT COMMON_RESTAURANT EASY_FOOD COFFEE NOVEL_RESTAURANT SWEETS BREAKFAST GYM ART MUSEUM
Cooper Union 68 Small RESTAURANT COMMON_RESTAURANT SWEETS COFFEE EASY_FOOD NOVEL_RESTAURANT SELF_CARE GYM GROCERY PUB
Lehigh University 55 Medium COMMON_RESTAURANT EASY_FOOD COFFEE PUB CONVENIENCE RESTAURANT SWEETS NOVEL_RESTAURANT SQUARE Salon / Barbershop
New York University 46 Large COMMON_RESTAURANT COFFEE SWEETS NOVEL_RESTAURANT RESTAURANT GYM EASY_FOOD SELF_CARE VENUE THEATER
Pomona College 13 Small COMMON_RESTAURANT EASY_FOOD RESTAURANT COFFEE SWEETS BREAKFAST NOVEL_RESTAURANT FANCY_FOOD THEATER SQUARE
University of California - Berkeley 41 Large RESTAURANT COMMON_RESTAURANT COFFEE EASY_FOOD NOVEL_RESTAURANT SWEETS PUB GYM BREAKFAST THEATER
University of Michigan - Ann Arbor 23 Large EASY_FOOD RESTAURANT COFFEE THEATER PUB NOVEL_RESTAURANT BREAKFAST COMMON_RESTAURANT ART MUSEUM
Virginia Tech 58 Large COMMON_RESTAURANT EASY_FOOD PUB COFFEE BREAKFAST RESTAURANT CONVENIENCE BOOKSTORE SWEETS NOVEL_RESTAURANT
Yale University 3 Medium COMMON_RESTAURANT EASY_FOOD RESTAURANT SWEETS COFFEE BREAKFAST THEATER MUSEUM ART NOVEL_RESTAURANT

Cluster 3

Rank Size 1st Most Common Venue 2nd Most Common Venue 3rd Most Common Venue 4th Most Common Venue 5th Most Common Venue 6th Most Common Venue 7th Most Common Venue 8th Most Common Venue 9th Most Common Venue 10th Most Common Venue
University
Brigham Young University 78 Large SWEETS EASY_FOOD SQUARE COMMON_RESTAURANT MUSEUM Tea Room COFFEE Salon / Barbershop NOVEL_RESTAURANT RESTAURANT
California Institute of Technology 16 Small COFFEE SWEETS EASY_FOOD COMMON_RESTAURANT SQUARE BREAKFAST RESTAURANT GYM LIQUOR FANCY_BAR
Carnegie Mellon University 28 Medium RESTAURANT COFFEE EASY_FOOD COMMON_RESTAURANT MUSEUM SQUARE ART BEAUTY VENUE BOOKSTORE
Claremont McKenna College 49 Small COFFEE COMMON_RESTAURANT RESTAURANT CONVENIENCE EASY_FOOD SQUARE SELF_CARE SWEETS Park MUSEUM
College of William and Mary 56 Medium COMMON_RESTAURANT EASY_FOOD COFFEE PUB SQUARE BEAUTY MUSEUM THEATER CONVENIENCE ART
Cornell University 20 Large COFFEE EASY_FOOD COMMON_RESTAURANT GYM BEAUTY MUSEUM THEATER BOOKSTORE VENUE PUB
Dartmouth College 15 Medium RESTAURANT EASY_FOOD COFFEE COMMON_RESTAURANT CONVENIENCE Clothing Store BOOKSTORE VENUE SQUARE SWEETS
Georgia Institute of Technology 38 Medium COMMON_RESTAURANT COFFEE EASY_FOOD SQUARE THEATER BREAKFAST RESTAURANT VENUE ART GYM
Harvey Mudd College 50 Small COFFEE COMMON_RESTAURANT SQUARE CONVENIENCE SWEETS EASY_FOOD BEAUTY SELF_CARE Tea Room NOVEL_RESTAURANT
Massachusetts Institute of Technology 1 Medium EASY_FOOD GYM RESTAURANT COMMON_RESTAURANT Park PUB COFFEE SQUARE NOVEL_RESTAURANT Tea Room
Northeastern University 44 Medium EASY_FOOD COMMON_RESTAURANT VENUE COFFEE ART SWEETS RESTAURANT MUSEUM NOVEL_RESTAURANT GROCERY
Princeton University 5 Medium SQUARE COFFEE COMMON_RESTAURANT BEAUTY RESTAURANT VENUE PUB BOOKSTORE MUSEUM THEATER
Southern Methodist University 77 Medium EASY_FOOD COMMON_RESTAURANT SWEETS RESTAURANT SQUARE COFFEE SELF_CARE Salon / Barbershop CONVENIENCE GYM
Stanford University 2 Medium COFFEE ART EASY_FOOD BEAUTY SQUARE MUSEUM STADIUM HISTORIC GYM CONVENIENCE
Tufts University 25 Medium COMMON_RESTAURANT COFFEE EASY_FOOD RESTAURANT SWEETS SQUARE LIQUOR VENUE THEATER Salon / Barbershop
University of Chicago 18 Medium COFFEE EASY_FOOD RESTAURANT MUSEUM BOOKSTORE BREAKFAST CONVENIENCE SWEETS THEATER COMMON_RESTAURANT
University of North Carolina at Chapel Hill 43 Large EASY_FOOD COFFEE COMMON_RESTAURANT GYM THEATER VENUE HISTORIC SQUARE BOOKSTORE BEAUTY
University of Rochester 76 Medium COFFEE EASY_FOOD SQUARE GYM CONVENIENCE Park Student Center THEATER COMMON_RESTAURANT BOOKSTORE
University of Texas - Austin 51 Large COFFEE GYM EASY_FOOD ART MUSEUM VENUE COMMON_RESTAURANT THEATER STADIUM SQUARE

Cluster 4

Rank Size 1st Most Common Venue 2nd Most Common Venue 3rd Most Common Venue 4th Most Common Venue 5th Most Common Venue 6th Most Common Venue 7th Most Common Venue 8th Most Common Venue 9th Most Common Venue 10th Most Common Venue
University
University of Southern California 19 Large MUSEUM EASY_FOOD COMMON_RESTAURANT COFFEE THEATER RESTAURANT NOVEL_RESTAURANT STADIUM GROCERY SWEETS

Cluster 5

Rank Size 1st Most Common Venue 2nd Most Common Venue 3rd Most Common Venue 4th Most Common Venue 5th Most Common Venue 6th Most Common Venue 7th Most Common Venue 8th Most Common Venue 9th Most Common Venue 10th Most Common Venue
University
Harvard University 4 Medium COMMON_RESTAURANT EASY_FOOD RESTAURANT SQUARE PUB BOOKSTORE VENUE COFFEE Park SWEETS
University of California - Los Angeles 26 Large COFFEE EASY_FOOD SQUARE COMMON_RESTAURANT ART BEAUTY VENUE MUSEUM RESTAURANT GYM
University of Pennsylvania 9 Medium EASY_FOOD COMMON_RESTAURANT COFFEE SQUARE RESTAURANT BREAKFAST SWEETS SELF_CARE BOOKSTORE ART

Cluster 6

1st Most Common Venue 2nd Most Common Venue 3rd Most Common Venue 4th Most Common Venue 5th Most Common Venue 6th Most Common Venue 7th Most Common Venue 8th Most Common Venue 9th Most Common Venue 10th Most Common Venue
University
Barnard College COMMON_RESTAURANT EASY_FOOD RESTAURANT COFFEE CONVENIENCE Park NOVEL_RESTAURANT GROCERY PUB BOOKSTORE
Boston University COMMON_RESTAURANT COFFEE EASY_FOOD RESTAURANT BREAKFAST CONVENIENCE SWEETS GYM PUB THEATER
Columbia University COMMON_RESTAURANT EASY_FOOD NOVEL_RESTAURANT RESTAURANT COFFEE CONVENIENCE Park GROCERY GYM PUB
Georgetown University EASY_FOOD COFFEE Park CONVENIENCE BEAUTY PUB TRAIL GYM GROCERY BREAKFAST

Map rendering

Finally, we can render the clusters on a map of the United States.

Discussion

The results stand up to various measures of common sense scrutiny. Barnard and Columbia along with New York University and Cooper Union, both of which pairs sit almost on top of each other, are grouped together, which respects the similarity of their feels in the urban New York City environment. Meanwhile, various more urban schools are grouped together, with Boston University alongside Columbia and Barnard, in spite of the fact that no measure of city centeredness was explicitly given. The small liberal arts colleges of the North East together with some other "classic" college towns are grouped together in group 1. In cluster 5 we have a few campuses that could easily be described as sheltered but urban. USC stands alone, but this could be to be expected, since the Foursquare data is uniquely rich in Museums and Fast Food!

It is interesting to note that many of the clusters feature schools across a wide range of rankings, so, in a practical sense, a student could use a cluster of schools that seem to appeal to them as a starting off point for constructing a sensible list of schools to apply to.

In future work it would be worthwhile to use a more process and less automated method for identifying latitudes and longitude geolocations for undergraduates at the listed institutions. Inaccuracies in this phase of data wrangling can result in enormous disparities of work, as various universities can even sprawl across multiple cities, with widely varying amenities in each. Feature selection could be improved by surveying college grads or current students on which "venues" were or are the most significant in their college experience, in order to ensure that similarity in college experience were being measured in the right dimensions with the right amount of emphasis.

Conclusion

Ultimately, both venues-in-the-vicinity and student-body-size are only proxies for feel. The hope in this clustering exercise was that it could bring up unusual connections between seemingly dissimilar, and provide a starting off point in college research for students and professionals with less familiarity with certain schools. By examining the top 10 most commonly encountered venues in the vicinity of schools clustered together, we have a lens through which to see the similarities in experience we might have as a student-about-town during our college years. In some cases, the differences between certain colleges are smaller than we might think when we really get down to the gritty details of what experiences you can have while studying there. In the age of Covid-19, and less access to college tours for would-be students, tools like the clustering formulated in this report may prove to be helpful for students trying to grasp what their college experience might be like once in-person teaching resumes. At the same time, there are many other criteria that students should be taking into account, including the quality of programming in majors of interest, faculty-to-student ratio, diversity and extracurricular offerings, but this clustering, and the methods that created it, are a great way to get a sense of what is out there, which campuses are similar and, using the map visualization, where they sit geographically.