Data Science Capstone

5 min readDec 14, 2020

Introduction

New York City, being the most populous city in the United States, naturally has significant competition for any business looking to get its start. This is only compounded when considering the restaurant industry, specifically Italian food and pizza, of which the city is world renowned.

In this scenario, famous Italian brothers Mario and Luigi are looking to relocate their beloved restaurant, Mario & Luigi’s Italian Ristorante & Pizzeria, to the borough of Brooklyn in New York City but are weary of the city’s reputation for having so many established Italian Restaurants and Pizzerias competing with their business. They contacted IBM to help them utilize the power of data science to solve their business problem.

Data

In order to gain an understanding of the surrounding neighborhoods, we need data on their locations, specifically latitude and longitude coordinates for more precise geolocating. NYU has a public JSON dataset which contains all of this information for the 5 boroughs and 306 neighborhoods of New York City (https://geo.nyu.edu/catalog/nyu_2451_34572).

For restaurant data, we shall utilize the Foursquare API that allows for searching of businesses within specific latitude and longitude constraints. We will tie this to the previous step with the coordinates obtained through the neighborhood data in order to generate a dataset of businesses by neighborhood.

Methodology

Using the data from the NYU neighborhoods dataset, we defined a method to call 100 business listings of proximity to the neighborhood coordinates provided. From here, we add businesses by category including location coordinates to our neighborhoods dataset and analyze for unique business categories. An additional dataset is then formed in order to analyze the frequency of the captured categories within each neighborhood. Finally, we use this information to form a dataset of the top 10 most common categories of businesses for each neighborhood of Brooklyn.

Now that we have the information for the most common categories of businesses for each neighborhood, we can begin the process of K-means clustering in order to create a more generalized grouping of specific neighborhoods that may hold a different makeup of categories of businesses. We decided upon 4 clusters as that would give more of a direct answer on business composition differences.

After clustering our neighborhoods, we use mapping plug-in Folium to visualize the location and size of our clusters to determine relevancy to our business problem. Finally, we will examine the exact neighborhoods that make up each cluster to understand the makeup of the cluster and if the cluster would be a good place for our client to establish their business.

Results

The resulting map of our clustering process is shown above. As we can see from the purple and yellow clusters, there is a significant overlap between the composition of categories of businesses in Brooklyn, which is expected as the borough is well-known for its Italian food and pizza. Shown below is the makeup of Cluster 1 (marked in purple), which as you can see holds many different types of restaurants, most of which having Italian or Pizza in the top 5, in some cases both.

Cluster 3 (marked in yellow) is largely the same as Cluster 1, the main difference being the presence of bars in addition to restaurants. From this information, we can rule out both of these clusters as viable locations for opening an Italian restaurant. Interestingly, Cluster 0 (marked in red) has only Fast Food in their top 5, though upon further review the Mill Island neighborhood appears to hold significant industrial land areas so we will rule out this cluster as well. Finally, Cluster 3 (marked in teal, shown below) seems to be the perfect solution to our business problem as it holds commercial districts as well as other restaurants, however Italian and Pizza restaurants are not in the top 10 most common for either neighborhood.

Discussion

Some limitations that were faced include that there is little demographic information considered in our initial NYU dataset and upon further research the Mill Island neighborhood may not be the best location for a restaurant as it is not as residential as other neighborhoods and the population is relatively low.

Conclusion

In this report, we have utilized geolocation and business category data in order to analyze the composition of businesses in various neighborhoods of Brooklyn in order to help a fledgling restaurant find the location with the least amount of localized competition. To conclude, we have decided to recommend the Mario brothers to open their restaurant within one of the Cluster 3 neighborhoods.

Python source code with full PDF report on GitHub : https://github.com/zacksheehan/DataScienceCapstone