📅  最后修改于: 2023-12-03 15:29:27.282000             🧑  作者: Mango
Arabert is an open-source PyTorch-based library that provides pretrained language models for Arabic natural language processing tasks. Arabert is based on the BERT architecture, which is a state-of-the-art model for natural language processing tasks.
Arabert comes with pre-trained models for various language processing tasks, such as:
Arabert also supports fine-tuning of pre-trained models on custom datasets for specific tasks.
To install Arabert, you can use pip:
pip install arabert
Note: Arabert requires PyTorch >= 1.0 and transformers >= 3.0.
Arabert provides a tokenizer for Arabic text that converts text into tokens that can be used by a model. Here's an example:
from arabert import ArabertTokenizer
tokenizer = ArabertTokenizer.from_pretrained('bert-base-arabert')
text = "مرحبا بالعالم"
tokens = tokenizer.tokenize(text)
print(tokens)
# Output: ['م', '##رح', '##با', 'ب', 'ال', '##ع', '##ال', '##م']
The tokenize
method returns a list of tokens. The from_pretrained
method loads a pre-trained tokenizer from the Arabert library.
Arabert provides a pre-trained model for sentiment analysis. Here's an example:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
model_name = 'asafaya/bert-base-arabic-sentiment-analysis'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
text = "ايجابي جدا"
inputs = tokenizer.encode(text, return_tensors='pt')
outputs = model(inputs)[0]
probs = torch.nn.functional.softmax(outputs, dim=1).detach().numpy()[0]
sentiment_label = ['Negative', 'Positive'][probs.argmax()]
print(sentiment_label)
# Output: 'Positive'
The code above loads a pre-trained model for sentiment analysis and uses it to classify the sentiment of the given text.
Arabert also supports fine-tuning of pre-trained models on custom datasets for specific tasks. Here's an example:
from transformers import AutoModelForSequenceClassification, AutoTokenizer, TextClassificationPipeline
model_name = 'asafaya/bert-base-arabic-sentiment-analysis'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Load custom dataset
train_dataset = ...
test_dataset = ...
# Fine-tune the model
model.train()
# Train the model
...
# Test the model
model.eval()
pipeline = TextClassificationPipeline(model=model, tokenizer=tokenizer)
results = pipeline(test_dataset)
print(results)
The code above shows how to fine-tune a pre-trained model for a specific task on a custom dataset. The TextClassificationPipeline
is used to classify text using the fine-tuned model.
Arabert is a powerful library for Arabic natural language processing tasks that provides pre-trained models and support for fine-tuning on custom datasets. With its easy-to-use interface, you can quickly get started with Arabert and build state-of-the-art Arabic natural language processing applications.