Word Embeddings
In 2003, Bengio et al. proposed an ingenious idea to learn representations of words that capture semantic meaning. In 2013, a team of researchers at Google made this possible with the Word2Vecalgorithm. Word2Vec and other recent approaches (such as GloVe from Stanford) learn from massive corpora of text, e.g. millions of Google News articles or all of Wikipedia. The representations they learn after chomping through all of that text are numeric vectors of, say, 300 dimensions. Simply put, this means that the representation for a single word is a long list of numbers. The beauty of it is that the mathematical relationship between those vectors manages to capture the semantic relationship between the words. The classic example given is king - man + woman = queen.
What’s really neat is that once those representations have been learned from a massive dataset, they can then be used in other tasks like classifying content into categories.
Learn from the best, transfer to the rest.
The ML technique known as transfer learning is about knowledge gained through training on one task being reused in solving another task. Using pre-trained word embeddings is an example of that. We can take the word embeddings trained by Google or Stanford and transfer them for use in our own tasks.
One such task is similarity-based content recommendations. If we have numeric representations of our content that captures semantics, then we automatically have a measure of similarity between pieces of content. Even if two pieces of content are talking about the exact same topic but using different words, they will still be identified as similar due to the nature of these representations. This is not true of traditional approaches to representing words as numbers in machine learning because the numbers in question were related to counts of particular words in documents.
Learning from very few examples
You may have already heard the phrase “data is the new oil,” however, someone took it a step further at the 2017 O’Reilly’s AI conference by proposing that “labeled data is the new ‘new oil.’ ” For classification tasks, few-shot learning is an approach that stands in contrast to standard deep learning approaches because deep learning requires enormous quantities of labeled data.
The key to being able to learn from very few examples is having great representations of your data. For this reason, transfer learning and few shot learning often go hand-in-hand. You transfer the knowledge from some previous task and use it to create representations of your data. Just labeling one or two examples then allows all the others to be labeled automatically. This is our approach to automated content tagging.
‘Human in the Loop’ ML
A solution to the problem of lack of labeled training data is to get humans to label your data. This is called human-in-the-loop (HitL) ML, a term that may well have been coined by the founder of a company called CrowdFlower, which specializes in a crowdsourced approach to this technique. They’ll take your unlabeled data and get humans to label all of it for you. Another company, Mighty AI, is focused specifically on training data for autonomous vehicles. Anyone with an iPhone can earn a few cents a go by labeling pedestrians, lamp posts, parked cars etc. in images.
Humans can be made part of the loop in other, less straight-forward ways than labeling entire training sets to feed into ML algorithms. Any application or service that explicitly asks users for feedback in the form of ratings - Netflix movie ratings for example - can be thought of as employing HitL. The company StitchFix, which provides a clothing service where they send customers a regular “fix” of clothing items selected by a stylist, gets a lot of upfront data from users by asking them to rate styles through a series of photos. The more data they can get from their users up front, the less they have to infer through purchasing behavior. This is important to the success of their service because without HitL initial “fixes” would stand a poor chance of being purchased. Companies that use HitL understand that the UI they present to the human in their loop is of vital importance.
Recognizing where UX and Engineering play their role
In the current wave of excitement over ML, a lot of advice is being offered to companies on how to incorporate these techniques to improve their business. Depending on who’s offering it, the advice differs greatly. Those in the business of training and recruiting data scientists will tell you you need lots of data scientists, whereas those in the business of selling “Machine Learning as a Service” (MLaaS) solutions will say you don’t need any data scientists at all. The reality of course is somewhere in between.
It is definitely important to have people who know how to frame your business’ problems as data science or machine learning problems and make sure the data needed to solve them is available. Simply getting your engineers to feed masses of data into Amazon or Google’s MLaaS is not going to achieve very much. On the other hand, data scientists alone probably can’t do everything. If you’re building a product, just one or two data scientists working with engineers and UX professionals will be far more effective than 10 data scientists. The right mix depends on what you’re trying to accomplish
At Acquia, we’re using ML to enhance our SaaS offerings and have built a team focused specifically on this area. It includes data scientists, data engineers, front-end engineers, and back-end engineers. The team also works very closely with our UX team. Where we’re using HitL, UX is absolutely vital to ensuring we get the data we need to support our learning algorithms to make them as accurate as possible. Other efforts don’t entail a HitL aspect but require skilled engineers to ensure that services delivering ML predictions are performant and scalable.
We don’t have anyone on the team with a PhD in artificial intelligence or machine learning. Perhaps, one day we will. In the meantime, we have smart people who are familiar with the types of solutions that machine learning research has developed (many of which are available in open source libraries) and the types of problems to which they are best applied. This expertise, coupled with strong engineering and UX skills, is what we need to execute our ML strategy. If we didn’t have a well-thought-out strategy on how to play to our strengths, make use of publicly available datasets and open source libraries, and incorporate the other necessary technical functions in our efforts, an AI PhD would struggle to add value.