Create custom NER in Spacy

Named Entity Recognition (NER)

NER is also known as entity identification or entity extraction. It is a process of identifying predefined entities present in a text such as person name, organisation, location, etc. It is a statistical model which is trained on a labelled data set and then used for extracting information from a given set of data.
Sometimes we want to extract the information based on our domain or industry. For example : in medical domain, we want to extract disease or symptom or medication etc, in that case we need to create our own custom NER.

Spacy

It is an open source software library for advanced Natural Language Programming (NLP).
The Spacy NER environment uses a word embedding strategy using a sub-word features and Bloom embed and 1D Convolutional Neural Network (CNN).
Bloom Embedding : It is similar to word embedding and more space optimised representation.It gives each word a unique representation for each distinct context it is in.
1D CNN : It is applied over the input text to classify a sentence/ word into a set of predetermined categories

How Spacy works

It tokenises the text, i.e. broken-up input sentence into words or word embedding
Words are then broken-up into features and then aggregated to a representative number
This number is then fed to fully connected neural structure, which makes a classification based on the weight assigned to each features within the text.

How to train Spacy

Training data : Annotated data contain both text and their labels
Text : Input text the model should predict a label for.
Label : The label the model should predict.
Gradient : Calculate how to change the weights to improve the predictions. (Compare the prediction label with the actual label and adjusts its weights so that the correct action will score higher next time.)
Finally save the model

Spacy Training Data Format

Spacy needs a particular training/annotated data format :
Train_data = [
(
"Free Text 1", entities : {
[(start,end, "TAG 1"), (start,end, "TAG 2"), (start,end, "TAG 3")]
}
),(
"Free Text 2", entities : {
[(start,end, "TAG 1"), (start,end,"TAG 2")]
}
),(
"Free Text 3", entities : {
[(start,end, "TAG 1"), (start,end, "TAG 2"),
(start,end,"TAG 3"),(start,end, "TAG 4 ")]
}
)
]
view raw data.py hosted with by GitHub

Code walkthrough

Load the model, or create an empty model

We can create an empty model and train it with our annotated dataset or we can use existing spacy model and re-train with our annotated data.
if model is not None:
nlp = spacy.load(model) # load existing spaCy model
print("Loaded model '%s'" % model)
else:
nlp = spacy.blank("en") # create blank Language class
print("Created blank 'en' model")

if 'ner' not in nlp.pipe_names :
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner, last=True)
else :
ner = nlp.get_pipe("ner")
view raw load.py hosted with by GitHub
We can create an empty model using spacy.black(“en”) or we can load the existing spacy model using spacy.load(“model_name”)
We can check the list of pipeline component names by using nlp.pipe_names() .
If we don’t have the entity recogniser in the pipeline, we will need to create the ner pipeline component using nlp.create_pipe(“ner”) and add that in our model pipeline by using nlp.add_pipe method.

Adding Labels or entities

# add labels
for _, annotations in train_data:
for ent in annotations.get('entities'):
ner.add_label(ent[2])

other_pipe = [pipe for pipe in nlp.pipe_names if pipe != 'ner']

# Only training NER
with nlp.disable_pipes(*other_pipe) :
if model is None:
optimizer = nlp.begin_training()
else:
optimizer = nlp.resume_training()
view raw label.py hosted with by GitHub
In order to train the model with our annotated data, we need to add the labels (entities) we want to extract from our text.
We can add the new entity from our annotated data to the entity recogniser using ner.add_label().
As we are only focusing on entity extraction, we will disable all other pipeline components to train our model for ner only using nlp.disable_pipes().

Training and updating the model

for int in range(iteration) :
print("Starting iteration" + str(int))
random.shuffle(train_data)
losses = {}

for text, annotation in train_data :
nlp.update(
[text],
[annotation],
drop = 0.2,
sgd = optimizer,
losses = losses
)
#print(losses)
new_model = nlp
view raw iterate.py hosted with by GitHub
We will train our model for a number of iterations so that the model can learn from it effectively.
At each iteration, the training data is shuffled to ensure the model doesn’t make any generalisations based on the order of examples.
We will update the model for each iteration using nlp.update().

Evaluate the model

import spacy
from spacy.gold import GoldParse
from spacy.scorer import Scorer
def evaluate(model, examples):
scorer = Scorer()
for input_, annot in examples:
#print(input_)
doc_gold_text = model.make_doc(input_)
gold = GoldParse(doc_gold_text, entities=annot['entities'])
pred_value = model(input_)
scorer.score(pred_value, gold)
return scorer.scores

test_result = evaluate(new_model, test_data)
view raw evaluate.py hosted with by GitHub

Comments