Develop end-to-end ML solutions at scale on Google Cloud


Joshua Herrera 
Data Scientist

This blog provides an in-depth overview of how Google’s pre-trained Machine Learning (ML) APIs and AutoML Natural Language models can be used to rapidly develop end-to-end ML solutions at scale. We are using the Yelp Academic Dataset for our demonstration as it is a messy, real-world dataset similar to what would be asked to analyze on an actual engagement and lends itself to analyses that are in high-demand from clients. In the description of the dataset release, Yelp states “The Yelp dataset is a subset of our businesses, reviews, and user data” consisting of over 6,685,900 reviews, 192,609 businesses, 200,000 pictures, 10 metropolitan areas, 1,223,094 tips by 1,637,138 users, over 1.2 million business attributes like hours, parking, availability, and ambience, and aggregated check-ins over time for each of the 192,609 businesses. Thus, this dataset is an actual subset of the data being used in production today by Yelp and lends itself to several different use cases like image classification, natural language processing, network analysis, recommendation engines, and marketing analytics. 

Since the Yelp Dataset Challenge is technically geared towards students looking to conduct research or analysis, we will be using the version of the dataset made available on kaggle. At the time of writing, the most recent version of the dataset available is version 9. The last updates to the dataset were made on 02/05/2019. The dataset is 8GB and consists of 5 different json files corresponding to different data models—businesses, reviews, users, tips, and check-ins. Details on the data models can be found on the Yelp Dataset documentation site. We reference this documentation heavily during the data exploration phase to perform an initial data validation. 

For this demonstration we use a standard data science and ML tech stack of Python for development and git for version control (checked into a GitLab remote repository). Since we will be using several Google products and services, we have decided to keep all development work in the Google Cloud ecosystem. Data will be hosted in a dedicated Google Cloud Storage (GCS) bucket and all development work will take place on AI Platform Notebook instances. This is in line with Google recommended best practices of keeping compute and storage separate. Lastly, we will also utilize simple shell scripts to perform several data processing pipeline steps at a time. 

The goal of this demonstration is to create an opportunity for and enable  market research and location analysis. This demonstration can be packaged into a service offered to prospective entrepreneurs who are looking to open a business in an area. With the sentiment and classification information generated by our machine learning pipeline, we can understand the current climate of the area and derive actionable decisions, through gap analysis to understand which businesses would stand out from the crowd, or highlighting which kinds of businesses people tend to like and are doing well in the area.

This demo will involve thorough usage of Google Cloud’s features, involving AutoML and Natural Language API, as well as dashboarding services provided by Data Studio. This end-to-end pipeline and walkthrough is outlined here, and the code for the machine learning part and how to send an HTTP request can be found on gitlab, linked here.  Pandera has documented all un-original (borrowed or modified) code by outlining and linking the original sources if and when applicable as per MIT’s writing code handbook, and retained any comments from original developers. 

Data Exploration and Engineering 

We used Yelp’s business review dataset for this demo. Data files were originally in json format, containing json files for business, check ins, reviews, tips and users. These files were then converted into csv where we flattened the columns that were nested json representations of normalized children columns. These flattened csv files total 8 Gb and can be downloaded from this link.

For this demonstration, we don’t need data from all the files, since we are mainly interested in the restaurant reviews. The final dataset is created by combining business, user and review data files.  While ratings are a useful way to convey the overall experience, they do not convey the context which led a reviewer to that experience. By categorizing reviews into relevant categories, the  sentiment can help us understand why the reviewer rated the restaurant as high or low. 

Before diving in to explore the data, the Pandera team decided to do some initial data preparation that would put the data in a format that more easily lends itself to exploratory data analysis (EDA). The Yelp dataset consists of 5 json files, 4 of which follow a traditional, relational data model and 1 of which follows a less-structured, dynamic schema. The business dataset was the only one that comes with nested columns (object fields)—attributes and hours. In our opinion, working with flattened csv files is much easier during the EDA phase than nested json, so we converted the json files into normalized and flattened csv files. As mentioned in the Overview, more details on the provided data models can be found on the Yelp dataset main documentation page.

The provided Yelp dataset is not limited to merely restaurants, but rather, contains data on every kind of business available on Yelp. In order to subset our data to restaurants, we must first clean the dataset, starting with the encoding issues generated by converting json to csv. After the execution of the json to csv script, many records have extra characters which are representative of an encoding error. 

Within the image, the encoding errors can be seen as the b’, b”, and b”u’ characters prepended to the beginning of the values. We define a function called clean_byte_unicode_chars that uses regular expressions to search input text for each of the cases of incorrect byte and unicode identifiers we found and removes them. We apply this function to all object type columns containing strings.

 

def clean_byte_unicode_chars(text):
    if isinstance(text, str):
        # case 1
        if re.search(“^b'”, text):
            # return text with first 2 and last char removed
            return text[2:-1]
        # case 2
        if re.search(‘^b”u’, text):
            # return text w/ first 4 and last 2 chars removed
            return text[4:-2]
        # case 3
        if re.search(‘^b”{‘, text):
            # return text with first 2 and last char removed
            return text[2:-1]
        # case 4
        if re.search(‘b”\”, text):
            # return text w/ first 3 and last 2 chars removed
            return text[3:-2]
    else:
        return text

 

# apply our cleaning function to each object column in the dataframe and assign back
for col in df_business.select_dtypes(‘object’).columns:
    df_business[col] = df_business[col].apply(clean_byte_unicode_chars)

 

After successfully removing the extra b’, b”, and b”u’ characters, it is discovered that there are other types of encoding errors present within the dataset, such as the escaped hex characters, an example of which is shown below.

 

df_business.name[df_business.name.str.contains(‘\\’, regex=False, na=False)].head(10)

 

Using this first record as an example:

example = df_business.name[df_business.name.str.contains(‘\\’, regex=False, na=False)].iloc[0]

 

codecs.decode(example, ‘unicode-escape’)

 

The example is ‘Flyjin Caf\\xc3\\xa9’, and attempting to decode this example using the second code cell above, results in ‘Flyjin Café’, which does not look correct. Using a hex conversion table, we know that \xc3\xa9 should render as é instead of é. This is a good indication that we have encoded our data with one standard and decoded with another, resulting in a nonsense character sequence referred to as mojibake. We create a function to decode the escaped character sequences and then use the ftfy package to fix the unicode.

 

def fix_encoding(text):
    if text is not None:
        text = codecs.decode(text, ‘unicode_escape’)
        text = ftfy.fix_text(text)
    return text

 

Applying this function to our example returns ‘Flyjin Café’, as we would expect it to appear, and this fix is applied to the dataframe. 

We want to extract a subset of only restaurants within the dataset. The categories feature describes attributes of the business, one of which is Restaurant. There are over 37k unique categories containing Restaurant because the records are multiclass, comma separated strings. A particular business will most likely have more than one category associated with it.

 

 

After hot-encoding the categories column, it is discovered that there are some businesses who are not primarily restaurants, but have food offerings. We would like to exclude these from our data due to not being pertinent to the business goal.

 

df_food_dummy = df_business_food[‘categories’].str.get_dummies(sep=‘, ‘).add_prefix(‘categories_’)
df_food_dummy

 

When looking at the data, there are over 37k unique values within categories. In order to exclude the non-restaurant businesses, we subset the data to only include records who have 10 instances of the unique category combinations. 

 

resbyCat = df_business_food.groupby(‘categories’)
size = resbyCat.size()

size.sort_values(ascending=False,inplace=True)
size

size10 = size[size >= 10]
size10.sort_values(ascending=False,inplace=True)
size10

 

By limiting the data to unique combinations that show up at least 10 times, we reduce the number of categories from 37,000 to 511. While this is 1.36% of the data, it contains 39.9% of the records.

 

pop = size.sum()
sample = size10.sum()
print(‘Number of records in the population: ‘ + str(pop))
print(‘Number of records in the subset: ‘ + str(sample))
print(‘The percentage of subset: ‘ + str(sample/pop))

 

 

df_final_business = df_business_food[df_business_food.categories.isin(size10.index.values)]
df_final_business

 

With the data selected, it is time to label the records so that we can train the model. Capitalizing upon AutoML’s transfer learning, we do not need an extremely large amount of data to have good predictive power, and the labeling of these records will be manual for the training dataset. We will not be labeling 30 thousand records, but instead, we apply a stratified sampling method to this dataset to have a more manageable size to label.

 

df_stratified = df_merged.groupby(‘categories’, group_keys=False).apply(lambda x: x.sample(int(np.rint(10000*len(x)/len(df_merged))))).sample(frac=1).reset_index(drop=True)

 

This stratified sampling aims to bring the distribution of categories within our sample in line with the distribution with which they appear in the entire dataset. The stratified sampling resulted in over 1400 records that were split between Pandera’s data science team for review. These reviews were coded based on the presence of the four labels: service, ambiance, value, and location.

 

Label Definition
Service Whether the review mentioned good or bad customer service, including time spent waiting to be attended to, and the demeanor of staff
Ambiance Whether the review mentioned the interior design or atmosphere of the location, good or bad.
Value Whether the review mentioned the pricing of the food in relation to the quality: great tasting cheap food, or overpriced. 
Location Whether the review mentions the restaurant’s physical location, relative to other attractions and important locations. 

 

The Model
AutoML

With the training data labeled, it was saved within a cloud storage bucket and brought into AutoML Text and Document Classification as a dataset. AutoML splits our training dataset into training (80%), testing (10%), and validating (10%) sets by default, this can be manually controlled if specified during a CSV upload. 

Leveraging Google’s transfer learning, we are able to provide limited training data and still return valid results. Transfer learning is a technique that takes advantage of Google’s pre-trained models that have been trained on similar, larger data sets. Because the model learned via transfer learning doesn’t have to learn from scratch, it can generally reach higher accuracy with much less data and computation time than models that don’t use transfer learning. During training, AutoML uses Neural Architecture Search technology. At a high level, neural search can be defined as a technique to automatically design artificial neural networks; a framework for multiple machine learning algorithms to work together and understand complex data inputs “as a brain would.” 

The AutoML service provides an aggregate set of evaluation metrics indicating how well the model performs overall (on the test dataset), as well as evaluation metrics for each category label indicating how well the model performs for that label. It provides the following evaluation metrics:

  • AuPRC: Area under Precision/Recall Curve, also referred to as “average precision.” Generally between 0.5 and 1.0. Higher values indicate more accurate models.
  • Confidence Threshold Curves: Show how different confidence score thresholds would affect precision, recall, true and false positive rates.

Because the AutoML modelling is automated and taken care of under the hood, the only way to improve the performance metrics of a model while keeping the confidence threshold the same is to improve the training data, be it quantity of records, quality of records, or features used for training. Because this model is a natural language model, we use the text to train the model, but AutoML Tables supports the use of multiple features to train a classification model.

During the course of this demonstration, there have been multiple AutoML models created on slightly different training datasets in order to address the shortcomings of previous models. One of the more troubling features was Location, having 25% precision and 7.14% recall in an early model. This feature is the rarest of the four, occurring 145 times out of the 1400 records used to train that early model. In order to improve our performance within this feature, the data was explored and there was an inconsistency with how the records were being labeled. After correcting the mislabeled location records, and adding in another small sample, the number of reviews with location mentioned dropped from 145 to 83, however, the precision of this Location feature increased to 100% and the recall to 37.5% at a confidence threshold of .5. This latest model has a precision of 91.04%, a recall of 82.43% when evaluation upon the randomly sampled 10% test dataset. Graphs of these metrics are shown below.

Natural Language Processing API

The other half of this guide’s output is sentiment analysis through google’s Natural Language API. Google’s NLP API has many features, ranging from content classification to syntax analysis, however, for this use case, we want feedback on whether the review was positive or negative. Using the sentiment analysis feature of the NLP API returns a score and a magnitude for each record. The score ranges from -1 to 1, and describes the emotion present within the text, -1 being negative, 1 being positive, and 0 being neutral or non-emotional. The magnitude describes the intensity of the emotion, from 0 to 1, with 1 being strong emotional sentiment.

Integration within GCP

The overall process for applying our machine learning to the review data is split into three steps: fetching and preprocessing the data, applying sentiment analysis, and applying classification. These three steps are each accomplished in their own cloud functions which use the Python 3.7 Runtime to execute the scripts. The three functions are connected to each other with Google Pub/Sub such that as one function finishes, it publishes a topic, and the following function which is subscribed to that topic is notified and kicks off its workload. The three functions culminate in a dataset that is appended to a BigQuery table of your designation. It is here that all data is stored for later querying from analysts. 

This entire process is started with an HTTP request whose payload contains the zipcode for the desired area where the restaurant reviews will be gathered from. This HTTP request hits the endpoint exposed by the first cloud function, whose trigger is HTTP. After sending the request, the function will respond with a message detailing location of the results table in BigQuery, and the job is passed to the next function, which applies the sentiment analysis, and the next function which classifies the review text into the proper categories.

def scrape_and_clean(request):

  request_json = request.get_json(silent=True)
  request_args = request.args

  if request_args and ‘zip’ in request_args:
      zip_code = request_args[‘zip’]
  elif request_json and ‘zip’ in request_json:
      zip_code = request_json[‘zip’]
  else:
      return f”No field ‘zip’ found.”

  # pull data
  final_df = customfunctions.scrape_main(zip_code)
  # clean up data
  final_df = customfunctions.clean_main(final_df)

  # add timestamp
  final_df[‘timestamp’] = dt.datetime.now()

  # Developer: Edit Here
  final_df.to_csv(“gs://YOUR-BUCKET/YOUR/DIRECTORIES/final_df.csv”)


  # trigger next Cloud Function
  pubsub_publisher.publish_topic(topic)
 
  return f”Assuming ‘zip’ is valid, data will be appended to BigQuery shortly.”

 

This shown sample of the first cloud function code shows the handling of the HTTP payload and the application of the scrape_main and clean_main functions, which are part of the code. Code for these cloud functions can be found in the github readme, which acts as a guide for replicating this demonstration in your Google Cloud Platform project.

Accessing the Model and Results

As described before, the model appends any recently processed data into the designated table, from which further analysis can be done in BigQuery, or the data can be exported for analysis elsewhere. Within this demonstration, we have also created a dashboard in Google’s Data Studio, to highlight key findings within the data as an example of what kind of insights an analyst can derive from the model output. As new data is appended to the table, the dashboard will continuously update with the new data.

 

 

From this dashboard, we can see an overall view of the reviews and their associated sentiment over time. The drop down selection boxes near the top allow the user to filter based on zip code, rating, date, or even specific restaurants. The pie chart in the top left gives an overview of the general ratio of sentiment, while the horizontal bar chart shows sentiment by class. For a detailed look at reviews, they are listed at the bottom of the dashboard, along with the associated sentiment.

Conclusion and Future State

This blog covered the retrieval of data from Yelp using their API to scrape restaurant data and the process by which we used various Google APIs to provide value and enrich the data. The following diagram outlines the future state architecture and design for expanding this current demonstration into a viable product.

 

 

The solution implemented here is a customer demographic centric, data-driven reporting service for site analytics. We will productize an application in Looker that restraunteneurs can leverage for market and competitor analysis at a particular site. The product will provide customer review data for a particular zip code or set of codes as a one-time transaction. An end user will be charged for each zip code they would like to research. 

Want to learn more about Pandera?