How does the product categorization ML work?
ML relies on big data and the reason we are seeing the rise of ML now and not ten years earlier lies in the ability of modern computers to process many data very fast. And ML does just that: it processes lots, and lots, and lots of data. It does not 'think'; it only finds some connections in seemingly irrelevant facts and remembers them.
An AI consists of two parts: the ML engine itself and a dataset. The ML engine requires an example dataset to learn from; once mature, it uses the actual working data to improve itself.
The first dataset is composed of randomly selected data from each category and attribute. 10 to 100 product examples are needed per category/attribute; the more complex the catalog structure, the deeper your categorization goes, the fewer examples are needed. Historically, the dataset consisted of product names and text descriptions. Modern ML also uses images and various metadata (like different languages and prices) in the sets to make predictions more accurate. All the data in the learning dataset must be manually categorized and attributed.
Once the ML engine processed this data, which may take a couple of hours for a medium-sized marketplace, it can be tried on the real-life database with the same content structure. When making predictions, ML marks them as 'strong' and 'weak'. The strong ones are then manually checked for correctness; a 95% threshold of correct categorization is considered good enough for production. Categories/attributes with lower percentages and 'weak' predictions require additional ML training on new data sets. The process is repeated until almost everything can be categorized automatically (there may be a few exceptions where an ML is unable to perform the task; these will have to remain in manual mode).
The initial dataset can be created in-house or a specialized external service may be employed. For new businesses, there are ready-made datasets available, both commercial and opensource. The latter are useful in ML evaluation for a particular business; note though that the commercial use of free datasets is usually prohibited by the license.