Machine learning in e-commerce product categorization

Product categorization, product classification or product taxonomy is about organizing the products' placement in an online shop. It may sound simple: put sneakers in the 'footwear' category shirts in 'clothing', and so on. Nevertheless, the reality is far more complicated and product taxonomy is, in fact, a separate field of study within the science of Natural Language Processing. In this article, we will discuss the importance of product categorization, the challenges it brings to e-commerce businesses and how machine learning (ML) can help such businesses.

What is product categorization?

Each product in a shop falls into a certain category or department. Categories usually have a nesting (parent category > sub-category 1 > sub-category 2 and so on). Here are just a few examples from the Google Product Taxonomy:

  • Electronics > Audio > Audio Accessories > Speaker Accessories > Speaker Bags, Covers & Cases

  • Home & Garden > Kitchen & Dining > Barware > Corkscrews

  • Sporting Goods > Athletics > Field Hockey & Lacrosse > Field Hockey Sticks

Additionally, many products have attributes: color, size, material, etc. These are not unique for a specific category and are shared between different ones.

Clear and logical categories and attributes help improve user experience and achieve better results with external search engines. Visitors can quickly find the goods they need by navigating the catalog or using the on-site search engine. Google robots will index the pages correctly, allowing prospective customers to find the store that sells the product they need.

There are two distinct tasks in product categorization. First, the store needs to create, support and expand the catalog structure; ML is of no help here, at least not yet. The other (rather tedious and time-consuming) task is tagging goods with the correct categories and attributes, which can be automated.

How to evaluate the need for ML categorization?

Multiple factors should be considered when deciding whether an ML can help an e-commerce business. For example:

  • inventory size

  • size and complexity of the categories structure

  • whether this structure is static or dynamic

  • number of products that are added daily

  • number and/or percentage of external sellers on the platform

  • can the same item in the catalog belong to different categories (e.g. 'sneakers' may appear in both 'casual footwear' and 'sporting goods'), etc.

Amazon has a staggering inventory of more than 350 million products; most of which are sold by external providers. Google Taxonomy sports 5582 categories at the moment and eBay has almost 20 thousand, including 'Weird stuff' with sub-categories of 'Slightly Unusual', 'Really Weird' and 'Totally Bizarre'. 80% of Amazon sellers also use other marketplaces to sell their goods. eBay updates its taxonomy structure twice a year.

The amount of data created and processed by different parties requires much manual work. Sellers work with multiple marketplaces, each with its own product taxonomy. These factors increase the risk of human error. It may be useful to deploy an ML algorithm to ease the burden and raise the quality of categorization. For example, if ML can raise the accuracy of product tagging by 1% for Amazon, it will result in an additional 3.5m of correct classifications.

How does the product categorization ML work?

ML relies on big data and the reason we are seeing the rise of ML now and not ten years earlier lies in the ability of modern computers to process many data very fast. And ML does just that: it processes lots, and lots, and lots of data. It does not 'think'; it only finds some connections in seemingly irrelevant facts and remembers them.

An AI consists of two parts: the ML engine itself and a dataset. The ML engine requires an example dataset to learn from; once mature, it uses the actual working data to improve itself.

The first dataset is composed of randomly selected data from each category and attribute. 10 to 100 product examples are needed per category/attribute; the more complex the catalog structure, the deeper your categorization goes, the fewer examples are needed. Historically, the dataset consisted of product names and text descriptions. Modern ML also uses images and various metadata (like different languages and prices) in the sets to make predictions more accurate. All the data in the learning dataset must be manually categorized and attributed.

Once the ML engine processed this data, which may take a couple of hours for a medium-sized marketplace, it can be tried on the real-life database with the same content structure. When making predictions, ML marks them as 'strong' and 'weak'. The strong ones are then manually checked for correctness; a 95% threshold of correct categorization is considered good enough for production. Categories/attributes with lower percentages and 'weak' predictions require additional ML training on new data sets. The process is repeated until almost everything can be categorized automatically (there may be a few exceptions where an ML is unable to perform the task; these will have to remain in manual mode).

The initial dataset can be created in-house or a specialized external service may be employed. For new businesses, there are ready-made datasets available, both commercial and opensource. The latter are useful in ML evaluation for a particular business; note though that the commercial use of free datasets is usually prohibited by the license.

Will ML replace humans in e-commerce?

No. ML is there to improve the categorization accuracy and generate more value through better UX and SEO, not to cut the costs.

An ML algorithm is essentially useless without verified high-quality data. One can find dozens of open source ML programs, but most of the free datasets are limited to non-commercial use. That is because a good dataset requires a lot of work; the work may seem tedious, basic, and simple, but there is a lot of it. An e-commerce business may want to change the algorithm weekly - no problem, this can be done - but it cannot change the datasets, because the datasets are getting increasingly valuable as they grow over time. And the results of the ML work must be checked manually at regular intervals, that is how it continues to learn. The machine may wander astray pretty fast if left uncontrolled, as the infamous Microsoft chatbot Tay clearly showed.

As with many other IT gimmicks, ML can make lives better and free up valuable time so humans can do the stuff they do best. Such as inventing things, dreaming up new business strategies, making money or creating better categorization structures and product descriptions for stores and marketplaces.

Share. Spread tech knowledge & news:

About the author of this article

Dmitrii Reznikov
Copywriter and researcher

Dmitrii has almost 30 years of experience as a journalist, editor and publisher, specializing on IT and technology topics. He used to run multiple paper magazines, then switched to digital media. Dmitrii joined VirtoCommerce as blog writer and researcher.