Create custom NER in Spacy
Named Entity Recognition (NER)
NER is also known as entity identification or entity extraction. It is a process of identifying predefined entities present in a text such as person name, organisation, location, etc. It is a statistical model which is trained on a labelled data set and then used for extracting information from a given set of data.
Sometimes we want to extract the information based on our domain or industry. For example : in medical domain, we want to extract disease or symptom or medication etc, in that case we need to create our own custom NER.
Spacy
It is an open source software library for advanced Natural Language Programming (NLP).
The Spacy NER environment uses a word embedding strategy using a sub-word features and Bloom embed and 1D Convolutional Neural Network (CNN).
- Bloom Embedding : It is similar to word embedding and more space optimised representation.It gives each word a unique representation for each distinct context it is in.
- 1D CNN : It is applied over the input text to classify a sentence/ word into a set of predetermined categories
How Spacy works
- It tokenises the text, i.e. broken-up input sentence into words or word embedding
- Words are then broken-up into features and then aggregated to a representative number
- This number is then fed to fully connected neural structure, which makes a classification based on the weight assigned to each features within the text.
How to train Spacy
- Training data : Annotated data contain both text and their labels
- Text : Input text the model should predict a label for.
- Label : The label the model should predict.
- Gradient : Calculate how to change the weights to improve the predictions. (Compare the prediction label with the actual label and adjusts its weights so that the correct action will score higher next time.)
- Finally save the model
Spacy Training Data Format
Spacy needs a particular training/annotated data format :
Code walkthrough
Load the model, or create an empty model
We can create an empty model and train it with our annotated dataset or we can use existing spacy model and re-train with our annotated data.
- We can create an empty model using spacy.black(“en”) or we can load the existing spacy model using spacy.load(“model_name”)
- We can check the list of pipeline component names by using nlp.pipe_names() .
- If we don’t have the entity recogniser in the pipeline, we will need to create the ner pipeline component using nlp.create_pipe(“ner”) and add that in our model pipeline by using nlp.add_pipe method.
Adding Labels or entities
In order to train the model with our annotated data, we need to add the labels (entities) we want to extract from our text.
- We can add the new entity from our annotated data to the entity recogniser using ner.add_label().
- As we are only focusing on entity extraction, we will disable all other pipeline components to train our model for ner only using nlp.disable_pipes().
Training and updating the model
- We will train our model for a number of iterations so that the model can learn from it effectively.
- At each iteration, the training data is shuffled to ensure the model doesn’t make any generalisations based on the order of examples.
- We will update the model for each iteration using nlp.update().
Evaluate the model
Comments
Post a Comment