New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Feat/parser in ecs #3

Open

ranjan-stha wants to merge 45 commits into develop from feat/parser-in-ecs

Collaborator

ranjan-stha commented Aug 5, 2022

Addresses xxxxxx
Depends on xxxxxx

Changes

ECS to deploy the deepex parser
Lambda to use bucket for file conversion if not in pdf or html
Sentry integration.

Mention related users here if any.

This PR doesn't introduce any:

[ X] temporary files, auto-generated files or secret keys
[ X] n+1 queries
[X ] flake8 issues
print
[X ] typos
[ X] unwanted comments

This PR contains valid:

tests
permission checks (tests here too)

ranjan-stha added 19 commits

April 29, 2022 11:52


          lambda functions updated with sentry

adad0a6


          terraform configs updated with sentry_url passed as env variable; upd…

423a13a

…ated the configs in gateway for prod, staging envs;


          values updated for dev, staging, prod environments in their variable …

8235f16

…files;


          added the missing package dependencies in terraform config;

150791b


          updated the parser tool url;

1e6b69e


          enabled stacktrace in sentry; handles image links(and discards them, …

f168556

…no extraction happens);


          enabled err stack;

74e353a


          updated the hashicorp aws version;

9e09b2a


          terraform configs updated with sentry_url passed as env variable; upd…

5eb3862

…ated the configs in gateway for prod, staging envs;


          enabled err stack;

4f98fa5


          updated the hashicorp aws version;

54e86e7


          if content type is not determined with head requests, downloads the f…

4274ac8

…ile to find the extension of that file;


          user agent added while sending get or head request;

2ddbcdd


          doc conversion is done by uploading in s3 and triggering lambda with …

4623d60

…bucket/key payload;


          deploy deepex parser in the ecs;

db7b492


          deepex parser dockerfile;

d73c26b


          reduce the number of privisioned lambda;

2902c1f


          try..except block used;


          param values added for prod env

1914f13

ranjan-stha requested review from thenav56 and bimal125

August 5, 2022 09:21

ranjan-stha assigned thenav56

bimal125 reviewed

View reviewed changes

lambda_fns/extract_docs/deepex_ecs/app.py Outdated

Comment on lines 212 to 216

+                  local_temp_directory = pathlib.Path('/tmp', file_name)
+                  local_temp_directory.mkdir(parents=True) if not local_temp_directory.exists() else None
+                  # Note: commented for now
+                  # images.save_images(directory_path=local_temp_directory)

Contributor

bimal125 Aug 9, 2022

Comment these lines

bimal125 reviewed

View reviewed changes

lambda_fns/extract_docs/deepex_ecs/app.py Outdated

Comment on lines 226 to 229


		s3_file_path = f"s3://{DEST_BUCKET_NAME}/{str(s3_path_prefix)}/{file_name}"
		s3_images_path = f"s3://{DEST_BUCKET_NAME}/{str(s3_path_prefix)}/images"

Contributor

bimal125 Aug 9, 2022

Make a general utils for this

Collaborator Author

ranjan-stha Sep 1, 2022

done

bimal125 reviewed

View reviewed changes

lambda_fns/extract_docs/deepex_ecs/app.py Outdated

Comment on lines 243 to 245


		extracted_text = extracted_text.replace("\x00", "") # remove null chars

Contributor

bimal125 Aug 9, 2022

Lets make a util function for this

Collaborator Author

ranjan-stha Sep 1, 2022

done

bimal125 reviewed

View reviewed changes

lambda_fns/extract_docs/deepex_ecs/app.py Outdated

Comment on lines 260 to 261

		s3_file_path = f"s3://{DEST_BUCKET_NAME}/{str(s3_path_prefix)}/{file_name}"
		return s3_file_path, None, total_pages, total_words_count # No images extraction (lib doesn't support?)

Contributor

bimal125 Aug 9, 2022

Lets make util function

Collaborator Author

ranjan-stha Sep 1, 2022

done

bimal125 reviewed

View reviewed changes

lambda_fns/extract_docs/deepex_ecs/app.py Outdated

Comment on lines 312 to 383

+                              url, file_name
+                          )
+                          if s3_file_path:
+                              extraction_status = ExtractionStatus.SUCCESS.value
+                          else:
+                              extraction_status = ExtractionStatus.FAILED.value
+                      except Exception:
+                          logging.error(f"Error occurred during text extraction. {str(e)}", exc_info=True)
+                          s3_file_path, s3_images_path, total_pages, total_words_count = None, None, -1, -1
+                          extraction_status = ExtractionStatus.FAILED.value
+                  elif content_type == UrlTypes.DOCX.value or content_type == UrlTypes.MSWORD.value or \
+                          content_type == UrlTypes.XLSX.value or content_type == UrlTypes.XLS.value or \
+                          content_type == UrlTypes.PPTX.value or content_type == UrlTypes.PPT.value:
+                      ext_type = content_type
+                      tmp_filename = f"{uuid.uuid4().hex}.{ext_type}"
+                      flag = False
+                      if upload_file_to_s3(url, key=tmp_filename, bucketname=DOCS_CONVERSION_BUCKET_NAME):
+                          payload = json.dumps({
+                              "file": tmp_filename,
+                              "bucket": DOCS_CONVERSION_BUCKET_NAME,
+                              "ext": ext_type,
+                              "fromS3": 1
+                          })
+                          docs_conversion_lambda_response = lambda_client.invoke(
+                              FunctionName=DOCS_CONVERT_LAMBDA_FN_NAME,
+                              InvocationType="RequestResponse",
+                              Payload=payload
+                          )
+                          docs_conversion_lambda_response_json = json.loads(
+                              docs_conversion_lambda_response["Payload"].read().decode("utf-8")
+                          )
+                          if "statusCode" in docs_conversion_lambda_response_json and \
+                              docs_conversion_lambda_response_json["statusCode"] == 200:
+                              bucket_name = docs_conversion_lambda_response_json["bucket"]
+                              file_path = docs_conversion_lambda_response_json["file"]
+                              filename = file_path.split("/")[-1]
+                              if download_file(file_path, bucket_name, f"/tmp/{filename}"):
+                                  s3_file_path, s3_images_path, total_pages, total_words_count = get_extracted_content_links(
+                                      f"/tmp/{filename}", file_name
+                                  )
+                                  if s3_file_path:
+                                      extraction_status = ExtractionStatus.SUCCESS.value
+                                  else:
+                                      extraction_status = ExtractionStatus.FAILED.value
+                              else:
+                                  flag = True
+                          else:
+                              logging.error(f"Error occurred during file conversion. {docs_conversion_lambda_response_json['error']}")
+                              flag = True
+                      else:
+                          logging.warn("Could not upload the file to s3.")
+                          flag = True
+                      if flag:
+                          s3_file_path, s3_images_path, total_pages, total_words_count = None, None, -1, -1
+                          extraction_status = ExtractionStatus.FAILED.value
+                  elif content_type == UrlTypes.IMG.value:
+                      logging.warn("Text extraction from Images is not available.")
+                      s3_file_path, s3_images_path, total_pages, total_words_count = None, None, -1, -1
+                      extraction_status = ExtractionStatus.FAILED.value
+                  else:
+                      logging.error(f"Text extraction is not available for this content type - {content_type}")
+                      s3_file_path, s3_images_path, total_pages, total_words_count = None, None, -1, -1
+                      extraction_status = ExtractionStatus.FAILED.value
+                  logging.info(f"The extracted file path is {s3_file_path}")
+                  logging.info(f"The extracted image path is {s3_images_path}")
+                  logging.info(f"The status of the extraction is {str(extraction_status)}")

Contributor

bimal125 Aug 9, 2022

Lets improve this fucntion

Collaborator Author

ranjan-stha Oct 11, 2022

done

thenav56 requested changes

View reviewed changes

lambda_fns/entry_predict/app.py Outdated

Comment on lines 24 to 25

		SENTRY_URL = os.environ.get("SENTRY_URL")
		ENVIRONMENT = os.environ.get("ENVIRONMENT")

Member

thenav56 Aug 10, 2022

Maybe use django-environ for env configuration and load all configurations from a single app like config.py?

https://django-environ.readthedocs.io/en/latest/
example: https://github.com/the-deep/server/blob/develop/deep/settings.py#L22-L85

Collaborator Author

ranjan-stha Oct 11, 2022 •

edited

Loading

I think this is not that important to use.. just adds an extra dependency. but moved all the envs to a config.py file.

dev.tfvars Outdated


		# sentry url
		sentry_url = "https://[email protected]/1223576"

Member

thenav56 Aug 10, 2022

This is sensitive info.

Collaborator Author

ranjan-stha Oct 11, 2022

stored in aws secrets.

lambda_fns/entry_predict_output_request/app.py

Comment on lines +13 to +14

		SENTRY_URL = os.environ.get("SENTRY_URL")
		ENVIRONMENT = os.environ.get("ENVIRONMENT")

Member

thenav56 Aug 10, 2022

Let's define all the config in a single file.

Collaborator Author

ranjan-stha Oct 11, 2022

done.

lambda_fns/extract_docs/Dockerfile Outdated

+                  && apt-get autoremove -y \
+                  && rm -rf /var/lib/apt/lists/*
+              COPY deepex_ecs/app.py content_types.py wget.py /code/

Member

thenav56 Aug 10, 2022

We don't need other files?

Collaborator Author

ranjan-stha Oct 11, 2022

done.

lambda_fns/extract_docs/app.py Outdated

Comment on lines 81 to 82

		key="temporaryfile.pdf",
		bucketname="deep-large-docs-conversion"

Member

thenav56 Aug 10, 2022

Let's not define the default value here or use the value from config.

Collaborator Author

ranjan-stha Oct 11, 2022

done.

lambda_fns/extract_docs/app.py Outdated

+                  key="temporaryfile.pdf",
+                  bucketname="deep-large-docs-conversion"
+              ):
+                  try:

Member

thenav56 Aug 10, 2022

Maybe define a class/object for handling upload/download/URL.

Something like this:
Definition: https://github.com/the-deep/serverless/blob/develop/src/common/s3.py
Usages: https://github.com/the-deep/serverless/blob/develop/src/functions/source_extract/models.py#L92-L109

Collaborator Author

ranjan-stha Oct 11, 2022

done

lambda_fns/extract_docs/content_types.py Outdated

Comment on lines 63 to 64

		elif url.endswith(".jpg") or url.endswith(".jpeg") or url.endswith(".png") or \
		url.endswith(".gif") or url.endswith(".bmp") or content_type in self.content_types_img:

Member

thenav56 Aug 10, 2022

Maybe use any here.

Suggested change

      
                        elif url.endswith(".jpg") or url.endswith(".jpeg") or url.endswith(".png") or \
          
                            url.endswith(".gif") or url.endswith(".bmp") or content_type in self.content_types_img:
          
                        elif (
          
                            content_type in self.content_types_img or
          
                            any([
          
                                url.endswith(f".{extension}") for extension in [
          
                                    "jpg", "jpeg", "png", "gif", "bmp"
          
                                ]
          
                            ])
          
                        ):

Collaborator Author

ranjan-stha Oct 11, 2022

done.

lambda_fns/extract_docs/content_types.py Outdated

Comment on lines 74 to 93

+                              if temp_filepath.endswith(".pdf"):
+                                  return UrlTypes.PDF.value
+                              elif temp_filepath.endswith(".docx"):
+                                  return UrlTypes.DOCX.value
+                              elif temp_filepath.endswith(".doc"):
+                                  return UrlTypes.MSWORD.value
+                              elif temp_filepath.endswith(".xlsx"):
+                                  return UrlTypes.XLSX.value
+                              elif temp_filepath.endswith(".xls"):
+                                  return UrlTypes.XLS.value
+                              elif temp_filepath.endswith(".pptx"):
+                                  return UrlTypes.PPTX.value
+                              elif temp_filepath.endswith(".ppt"):
+                                  return UrlTypes.PPT.value
+                              elif temp_filepath.endswith(".jpg") or temp_filepath.endswith(".jpeg") or temp_filepath.endswith(".png") or \
+                                  temp_filepath.endswith(".gif") or temp_filepath.endswith(".bmp"):
+                                  return UrlTypes.IMG.value
+                              else:
+                                  logging.warn(f'Could not determine the content-type of the {url}')
+                                  return None

Member

thenav56 Aug 10, 2022

Maybe do this using dict.

Suggested change

      
                            if temp_filepath.endswith(".pdf"):
          
                                return UrlTypes.PDF.value
          
                            elif temp_filepath.endswith(".docx"):
          
                                return UrlTypes.DOCX.value
          
                            elif temp_filepath.endswith(".doc"):
          
                                return UrlTypes.MSWORD.value
          
                            elif temp_filepath.endswith(".xlsx"):
          
                                return UrlTypes.XLSX.value
          
                            elif temp_filepath.endswith(".xls"):
          
                                return UrlTypes.XLS.value
          
                            elif temp_filepath.endswith(".pptx"):
          
                                return UrlTypes.PPTX.value
          
                            elif temp_filepath.endswith(".ppt"):
          
                                return UrlTypes.PPT.value
          
                            elif temp_filepath.endswith(".jpg") or temp_filepath.endswith(".jpeg") or temp_filepath.endswith(".png") or \
          
                                temp_filepath.endswith(".gif") or temp_filepath.endswith(".bmp"):
          
                                return UrlTypes.IMG.value
          
                            else:
          
                                logging.warn(f'Could not determine the content-type of the {url}')
          
                                return None
          
                            EXTENSION_TO_ENUM_MAP {
          
                                "pdf": UrlTypes.PDF,
          
                                "docx": UrlTypes.DOCX,
          
                                "doc": UrlTypes.MSWORD,
          
                                "xlsx": UrlTypes.XLSX,
          
                                "xls": UrlTypes.XLS,
          
                                "pptx": UrlTypes.PPTX,
          
                                "ppt": UrlTypes.PPT,
          
                                # Images
          
                                "jpg": UrlTypes.IMG,
          
                                "jpeg": UrlTypes.IMG,
          
                                "png": UrlTypes.IMG,
          
                                "gif": UrlTypes.IMG,
          
                                "bmp": UrlTypes.IMG,
          
                            }
          
                            file_extension = temp_filepath.split('.')[-1]
          
                            if file_extension not in EXTENSION_TO_ENUM_MAP:
          
                                logging.warn(f'Could not determine the content-type of the {url}')
          
                                return None
          
                            return EXTENSION_TO_ENUM_MAP[file_extension].value

Collaborator Author

ranjan-stha Oct 11, 2022

done.

ranjan-stha added 2 commits

September 1, 2022 14:14


          config values for dev env;

4acc48a


          code refactored in parser for text extraction;

708b73a

ranjan-stha added 24 commits

September 1, 2022 16:34


          env added;

dc0fd27


          updated lambda concurrency for staging env;

72da790


          sentry url fetched from secrets manager;

4b823a1


          sentry secret name assign;

67e3324


          updates to use cpu model for prediction;

4dd5d45


          fixes on env vars usages; refactored code;

f89506a


          removed unused dependency;

a3b9a04


          updated env vars


          moved env vars to a separate file;

b6daeb3


          prod vars updated;

da2e6da


          parsing lib updated; method updated;

b27963f


          tf vars updated on ecr url;

46532b2


          utf-8 added to content type while saving file in s3;

80fe746


          added vpc endpoint for transferring data to s3;

1b1f94b


          added variables for setting the lambda concurrency

86f2f8e


          updated poetry lock file; added sentry pkg;

828bdb1


          updated the methods for parser library;

b56e61b


          ecs vcpu and memory reduced;

35af4fa


          ecs vcpu and memory updated in prod;

f07893a


          refactor and removed few non-required functions;

7e8ae4e


          dockerfile updated;

6852eca


          updated the poetry requirements;

f612614


          updated the deep parser import;

503dc59


          updated the mock data related to preds, thresholds and selected tags;

f13dbf4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet