Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot download Deep Learning models from SparkNLP model hub #14378

Open
1 task done
olivierr42 opened this issue Aug 23, 2024 · 3 comments
Open
1 task done

Cannot download Deep Learning models from SparkNLP model hub #14378

olivierr42 opened this issue Aug 23, 2024 · 3 comments
Assignees
Labels

Comments

@olivierr42
Copy link

Is there an existing issue for this?

  • I have searched the existing issues and did not find a match.

Who can help?

@maziyarpanahi
I saw you answered to similar requests in the past. Thank you in advance.

What are you working on?

I am working with a in-house dataset. This is not an official exemple. I am trying to use this model specifically:
https://sparknlp.org/api/python/reference/autosummary/sparknlp/annotator/embeddings/xlm_roberta_embeddings/index.html

I got the same issue when trying to load the SentenceDetectorDL model (mentioned on the Hub for this model)

Current Behavior

When I try to instantiate my pipeline:

  document_assembler = DocumentAssembler().setInputCol(input_col).setOutputCol("document")

  sentencer = SentenceDetector().setInputCols(["document"]).setOutputCol("sentence")

  embeddings = (
      XlmRoBertaSentenceEmbeddings.pretrained("multilingual_e5_base", "xx")
      .setInputCols(["sentence"])
      .setOutputCol(output_col)
  )

  pipeline = Pipeline().setStages([document_assembler, sentencer, embeddings])

I get the following error:

answer = 'xro63'
gateway_client = <py4j.clientserver.JavaClient object at 0x13f3dd710>
target_id = 'z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader'
name = 'downloadModel'

    def get_return_value(answer, gateway_client, target_id=None, name=None):
        """Converts an answer received from the Java gateway into a Python object.
    
        For example, string representation of integers are converted to Python
        integer, string representation of objects are converted to JavaObject
        instances, etc.
    
        :param answer: the string returned by the Java gateway
        :param gateway_client: the gateway client used to communicate with the Java
            Gateway. Only necessary if the answer is a reference (e.g., object,
            list, map)
        :param target_id: the name of the object from which the answer comes from
            (e.g., *object1* in `object1.hello()`). Optional.
        :param name: the name of the member from which the answer comes from
            (e.g., *hello* in `object1.hello()`). Optional.
        """
        if is_error(answer)[0]:
            if len(answer) > 1:
                type = answer[1]
                value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
                if answer[1] == REFERENCE_TYPE:
>                   raise Py4JJavaError(
                        "An error occurred while calling {0}{1}{2}.\n".
                        format(target_id, ".", name), value)
E                   py4j.protocol.Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel.
E                   : java.lang.UnsatisfiedLinkError: no jnitensorflow in java.library.path

Expected Behavior

I know support for M1 is experimental, but I would expect it not to crash. Especially since I am able to run Word2Vec models without issue.

Steps To Reproduce

  document_assembler = DocumentAssembler().setInputCol(input_col).setOutputCol("document")

  sentencer = SentenceDetector().setInputCols(["document"]).setOutputCol("sentence")

  embeddings = (
      XlmRoBertaSentenceEmbeddings.pretrained("multilingual_e5_base", "xx")
      .setInputCols(["sentence"])
      .setOutputCol(output_col)
  )

  pipeline = Pipeline().setStages([document_assembler, sentencer, embeddings])

Spark NLP version and Apache Spark

sparknlp = '5.3.3'
pyspark = '3.5.1'

Type of Spark Application

Python Application

Java Version

java version "1.8.0_411"

Java Home Directory

/Library/Internet Plug-Ins/JavaAppletPlugin.plugin/Contents/Home

Setup and installation

poetry add sparknlp=5.3.3

Operating System and Version

Mac M1 Sonomo 14.5

Link to your project (if available)

No response

Additional Information

I do not have issue with Word2Vec models. I also tried with SParkNLP 5.4.1, to no avail.

@maziyarpanahi
Copy link
Member

Hi @olivierr42

The support for Apple Silicon is experimental at this point. This is true for all the DL based models/annotators. The Word2Vec is pure written by using machine learning algorithm so it works independent of the operating system.

@olivierr42
Copy link
Author

It seems like the issue is with downloading the model. There seems to be a way to load the models from local storage, but I cannot seem to be able to make it work (it's trying to find a assets subfolder within the model folder, which does not exist if I download from the provided url).

Do you have any tips to make it work locally?

@maziyarpanahi
Copy link
Member

What is the error when downloading models? You can always test it quickly in Google Colab to be sure whether it's the model or your environment.

Spark NLP works 100% offline, you can follow this instruction that shows how to download any model, extract it, and just use .load() instead of .pretrained(): https://sparknlp.org/docs/en/install#offline

PS: Your Spark application must have access to that local path

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants