How to Contribute

Contribution Steps

  1. Clone the repository to your local machine:

git clone https://gitlab.com/TIBHannover/orkg/nlp/orkg-nlp-pypi.git
cd orkg-nlp-pypi
  1. Create a virtual environment with python=3.8, activate it, install the required dependencies and install the pre-commit configuration:

conda create -n orkg-nlp-pypi python=3.8
conda activate orkg-nlp-pypi
pip install -r requirements.txt
pre-commit install
  1. Create a branch and commit your changes:

git switch -c <name-your-branch>
# do your changes
git add .
git commit -m "your commit msg"
git push

Internal Workflow

As the architecture figure suggests, all our services inherit from the ORKGNLPBaseService that is responsible for many common functionalities. Note that we only show the ones that should be used in an individual service implementation. Each service has one or more PipelineExecutors that executes a full pipline of the common service workflow listed below:

  1. Runs the service’s encoder with the user’s input.

  2. The encoded input is passed to the model runner, which in turn is executed.

  3. The model’s output is decoded to a user-friendly format using the service’s decoder.

In case a service has multiple PipelineExecutors, the order of execution of the respective PipelineExecutor is to be handled in the __call__() function of the service.

architecture

Internal architecture of orkgnlp.

Adding a new Service

If you are intending to add a new service to the orkgnlp package, you should simply follow the following instructions:

  1. Can your service be integrated into orkgnlp ? Check this first!

  2. Create a pull request on our repository.

  3. Clone the source code and install the dependencies manually as described in the Installation section.

  4. If your service depends on models or data file,

    3.1. please create an issue linked to your pull request so that the ORKG team creates a specific Huggingface repository for you on our profile and uploads your required files. 3.2. add your file paths and repository name to orkgnlp/huggingface_repos.json and use orkgnlp/huggingface_repos_schema.json to understand its structure.

  5. Service: Create a new module for your service under the appropriate package structure with a new class inheriting from orkgnlp.common.service.base.ORKGNLPBaseService.

    4.1. To implement your service you only need to implement __init__() and __call__() functions. In the __init__() function you need to initiate an encoder, a runner and a decoder for your service and register a new PipelineExecutor in the parent class using self._register_pipeline(). We explain these concepts in the following steps. In the __call__() function you need to call the self._run() function of the parent class with the right arguments. See the already implemented services to get more insights about the workflow.

  6. Encoder: If your service input is the same as the model’s of the service, you can use the orkgnlp.common.service.base.ORKGNLPBaseEncoder as an encoder to your service. Otherwise, create a module in your service package named encoder where you implement a new Encoder class inheriting from the ORKGNLPBaseEncoder.

  7. Runner: Use one of our implemented runners located in orkgnlp.common.service.runners.

  8. Decoder: If the model’s output of your service is the same as the service’s, you can use the orkgnlp.common.service.base.ORKGNLPBaseDecoder as a decoder to your service. Otherwise, create a module in your service package named decoder where you implement a new Decoder class inheriting from the ORKGNLPBaseDecoder.

  9. Finally, do not forget to test your service and submit your pull request for review!

Example Service

We take the BioAssays Semantification service as a simple example implementation. The service needs a specified encoder and decoder but not a runner. You can always use predefined runners from the orkgnlp.common.service.runners package.

BioassaysSemantifier

First we create the BioassaysSemantifier class that inherits from the ORKGNLPBaseService as follows:

class BioassaysSemantifier(ORKGNLPBaseService):

then we implement the __init__() function by creating the needed encoder, runner and decoder, and registering them to a new PipelineExecutor in the base service:

def __init__(self, *args, **kwargs):
    super().__init__(config['service_name'], *args, **kwargs)

    encoder = BioassaysSemantifierEncoder(io.read_onnx(config['paths']['vectorizer']))
    runner = ORKGNLPONNXRunner(io.read_onnx(config['paths']['model']))
    decoder = BioassaysSemantifierDecoder(io.read_json(config['paths']['mapping']))

    self._register_pipeline('main', encoder, runner, decoder)

then we implement the __call__() function by calling the self._run() method with the user’s input that executes the entire pipeline we registered.

def __call__(self, text):
    return self._run(
        raw_input=text
    )

BioassaysSemantifierEncoder

In the encoder class we need to implement the encode(raw_input, **kwargs) function. The class constructor requires a loaded vectorizer model in ONNX format which can be run using our predefined ORKGNLPONNXRunner.

class BioassaysSemantifierEncoder(ORKGNLPBaseEncoder):
    def __init__(self, vectorizer):
        super().__init__()
        self._vectorizer = ORKGNLPONNXRunner(vectorizer)

then we implement the encode function by converting the user’s input text to a TF-IDF vector using the initialized encoder’s runner and returning a specific axis of its output as tuple of arguments. Note that the returned value of the encoder will be used as input to the service’s runner.

def encode(self, raw_input, **kwargs):
    preprocessed_text = self._text_process(raw_input)
    output, _ = self._vectorizer.run(
        inputs=([preprocessed_text],),
        output_names=['variable']
    )
    return (output[0][0], ), kwargs

BioassaysSemantifierDecoder

In the decoder class we need to implement the decode(model_output, **kwargs) function. The class constructor requires a loaded dict object representing the mapping from cluster label to the semantified properties and resources.

class BioassaysSemantifierDecoder(ORKGNLPBaseDecoder):
    def __init__(self, mapping):
        super().__init__()
        self._mapping = mapping

The cluster label can be obtained from the model_output parameter of the decode function, which is obtained of running the model internally by the PipelineExecutor, and can be used to fetch the respective properties and resources and give them back to the service user.

def decode(self, model_output, **kwargs):
    cluster_label = model_output[0][0]
    return self._mapping[str(cluster_label)]['labels']

Integration Requirements

Before starting to implement a new machine learning model that is planned to be integrated into orkgnlp you need to think about some of its limitations. One of the most important challenges of orkgnlp is that its NLP-services are implemented and experimented with different code-bases and frameworks, say scikit-learn or pytorch for example. Exporting a scikit-learn model can usually be done with pickle that already has its drawbacks on production systems due to code-base dependency as well as security ones. On the other hand, loading a typical pytorch Module requires the the Modules class(es) to be present so that a class object can be instantiated, on which we load the model weights. Moreover, implementing different models with different python or package versions makes it even more difficult for a public package!

The golden rule is trained models must NOT be dependent on their training source code! Therefore, we kindly ask you to export your trained model to one of the following currently supported formats, so that they can easily be integrated into orkgnlp and published afterwards!

Format

Useful Links

ONNX (ModelProto).

TorchScript (ScriptModule).

Transformers (PreTrainedModel).

Testing

We implement our tests with the unittests package and either use it or pytest as test runner. For development purposes on your local machine you can run the tests by running the following command:

poetry run test [ -i ignored_dir_1 ignored_dir_2 ...]
# example:
poetry run test -i clustering annotation

or also by simply executing tox with:

tox

Tox tries to test the package on all python environments listed in the tox.ini file as soon as they are already installed on your machine, otherwise tests for not found interpreters will be skipped.

Note

Note that testing in our project is configured using pyproject.toml, tox.ini and pytest.ini to divide responsibilities :) You might need to check all config files in case you need to change anything.

Note

Also note that we ignore some tests by default in our tox configurations for the sake of GitLab CI/CD pipeline. We recommend running poetry run test on your local machine in order to check all tests locally.