A Data Centric MLOps suite for
Named Entity Recognition

How CleanML helps?

Project Managers

Create multiple projects and track their progress independently


Easily experiment with a new model/algorithm by training it in CleanML and comparing its performance with the other models in your project.

  • Train multiple algorithms and compare them
  • Track progress of data annotation
  • Train custom word-embeddings for domain-specific applications
  • Analyze over-fitting and under-fitting per entity based on training output
  • Include the data from production and analyze the accuracy of the model in production
  • Compare the model in production with a model freshly developed and trained

Data Scientists

Gain insights about training & test data, distribution of annotated entities, and decide how to curate more data for better accuracy


Analyze your data annotations, identify missing, incorrect & multi-classifications, and improve your dataset quality

  • Update training and test data based on training and experiment with selective records for training
  • Edit data content, upload custom data and export annotated data in multiple data formats
  • Improve your dataset with insights from training with record and entity level accuracy drill-down for individual algorithms trained. Modify & curate your data accordingly

Annotators

Speed up and improves the annotation process with CleanML's helpful features, all from a single window


Features for annotators include

  • Annotation suggestions based on previous annotations
  • Show previous classifications of a word across different records
  • Suggest annotations based on training runs, even if the training is done on partially annotated data
  • Ability to configure 3rd party custom dictionaries (e.g. a medical terminology dictionary) to help with similar words

Developers

Experiment with multiple algorithms using different libraries irrespective of them being on GPU, CPU, on-prem or cloud

Algorithms are easy to scale and replicate via configuration files and they are run on Docker containers

  • Connect your model's git repo and train it. When done with the development branch, it makes experimenting easy in the dev cycle
  • Train a custom word-embedding for the domain specific data
  • Cache outputs of training and word-embeddings resulting in both time and cost savings
  • Train as per your convenience - locally, on-prem, CPU or GPU or on a remote Docker system
  • Write code independent of data format, CleanML will convert the training data to the supported format specified before the data is sent for training
  • Compare two different training runs for code as well as data changes, resulting in faster diagnosis of drop in accuracy

Features

Data-Centric dashboard

Identify and fix data & data-classification issues, and perform drill-down analytics on the dataset. Gain insights about data classified across multiple categories/classes, missed classifications and anomalies in classifications.

Read more
Advanced Workbench

Workbench provides useful features including annotating text, entity renaming across records, editing content in-place, tag suggestions, auto-labeling suggestions, previous classifications and an ability to add a custom dictionary.

Read more
In-built data versioning

CleanML does data versioning by default. This helps with training reproducibility. CleanML also provides capability to compare a model training with a future version of the same model, with a model that uses a different algorithm and even with a model deployed in production.

Data versioning
Train, Test, Compare, Repeat

Train and compare models of different algorithms with the same dataset. CleanML versions all the training and helps compare between versions of training and data. The ability to perform comparison of both models and data at a record level significantly increases your productivity.

Auto labeling suggestions

Get labeling suggestions based on the trained algorithms which can assist the annotators and speed up new data annotations.

NER Auto-labeling
Support for multiple data formats

Import data in CoNLL-2003, IOB (IOB1/2, BILOU, IOBES), JSONL, and txt. Import data from UI, API, command-line and Singer Taps. Also export annotated data to multiple data formats via command-line.

Product Tour

Community & Support

Questions, best-practices and brainstorming, join us on our discord.