Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/parser in ecs #3

Open
wants to merge 45 commits into
base: develop
Choose a base branch
from
Open

Feat/parser in ecs #3

wants to merge 45 commits into from

Conversation

ranjan-stha
Copy link
Collaborator

Addresses xxxxxx
Depends on xxxxxx

Changes

  • ECS to deploy the deepex parser
  • Lambda to use bucket for file conversion if not in pdf or html
  • Sentry integration.

Mention related users here if any.

This PR doesn't introduce any:

  • [ X] temporary files, auto-generated files or secret keys
  • [ X] n+1 queries
  • [X ] flake8 issues
  • print
  • [X ] typos
  • [ X] unwanted comments

This PR contains valid:

  • tests
  • permission checks (tests here too)

Comment on lines 212 to 216
local_temp_directory = pathlib.Path('/tmp', file_name)
local_temp_directory.mkdir(parents=True) if not local_temp_directory.exists() else None
# Note: commented for now
# images.save_images(directory_path=local_temp_directory)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment these lines

Comment on lines 226 to 229

s3_file_path = f"s3://{DEST_BUCKET_NAME}/{str(s3_path_prefix)}/{file_name}"
s3_images_path = f"s3://{DEST_BUCKET_NAME}/{str(s3_path_prefix)}/images"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make a general utils for this

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 243 to 245

extracted_text = extracted_text.replace("\x00", "") # remove null chars

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets make a util function for this

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 260 to 261
s3_file_path = f"s3://{DEST_BUCKET_NAME}/{str(s3_path_prefix)}/{file_name}"
return s3_file_path, None, total_pages, total_words_count # No images extraction (lib doesn't support?)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets make util function

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 312 to 383
url, file_name
)
if s3_file_path:
extraction_status = ExtractionStatus.SUCCESS.value
else:
extraction_status = ExtractionStatus.FAILED.value
except Exception:
logging.error(f"Error occurred during text extraction. {str(e)}", exc_info=True)
s3_file_path, s3_images_path, total_pages, total_words_count = None, None, -1, -1
extraction_status = ExtractionStatus.FAILED.value
elif content_type == UrlTypes.DOCX.value or content_type == UrlTypes.MSWORD.value or \
content_type == UrlTypes.XLSX.value or content_type == UrlTypes.XLS.value or \
content_type == UrlTypes.PPTX.value or content_type == UrlTypes.PPT.value:

ext_type = content_type
tmp_filename = f"{uuid.uuid4().hex}.{ext_type}"
flag = False
if upload_file_to_s3(url, key=tmp_filename, bucketname=DOCS_CONVERSION_BUCKET_NAME):
payload = json.dumps({
"file": tmp_filename,
"bucket": DOCS_CONVERSION_BUCKET_NAME,
"ext": ext_type,
"fromS3": 1
})

docs_conversion_lambda_response = lambda_client.invoke(
FunctionName=DOCS_CONVERT_LAMBDA_FN_NAME,
InvocationType="RequestResponse",
Payload=payload
)
docs_conversion_lambda_response_json = json.loads(
docs_conversion_lambda_response["Payload"].read().decode("utf-8")
)

if "statusCode" in docs_conversion_lambda_response_json and \
docs_conversion_lambda_response_json["statusCode"] == 200:
bucket_name = docs_conversion_lambda_response_json["bucket"]
file_path = docs_conversion_lambda_response_json["file"]
filename = file_path.split("/")[-1]

if download_file(file_path, bucket_name, f"/tmp/{filename}"):
s3_file_path, s3_images_path, total_pages, total_words_count = get_extracted_content_links(
f"/tmp/{filename}", file_name
)
if s3_file_path:
extraction_status = ExtractionStatus.SUCCESS.value
else:
extraction_status = ExtractionStatus.FAILED.value
else:
flag = True
else:
logging.error(f"Error occurred during file conversion. {docs_conversion_lambda_response_json['error']}")
flag = True
else:
logging.warn("Could not upload the file to s3.")
flag = True

if flag:
s3_file_path, s3_images_path, total_pages, total_words_count = None, None, -1, -1
extraction_status = ExtractionStatus.FAILED.value
elif content_type == UrlTypes.IMG.value:
logging.warn("Text extraction from Images is not available.")
s3_file_path, s3_images_path, total_pages, total_words_count = None, None, -1, -1
extraction_status = ExtractionStatus.FAILED.value
else:
logging.error(f"Text extraction is not available for this content type - {content_type}")
s3_file_path, s3_images_path, total_pages, total_words_count = None, None, -1, -1
extraction_status = ExtractionStatus.FAILED.value

logging.info(f"The extracted file path is {s3_file_path}")
logging.info(f"The extracted image path is {s3_images_path}")
logging.info(f"The status of the extraction is {str(extraction_status)}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets improve this fucntion

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 24 to 25
SENTRY_URL = os.environ.get("SENTRY_URL")
ENVIRONMENT = os.environ.get("ENVIRONMENT")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use django-environ for env configuration and load all configurations from a single app like config.py?

https://django-environ.readthedocs.io/en/latest/
example: https://github.com/the-deep/server/blob/develop/deep/settings.py#L22-L85

Copy link
Collaborator Author

@ranjan-stha ranjan-stha Oct 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not that important to use.. just adds an extra dependency. but moved all the envs to a config.py file.

dev.tfvars Outdated

# sentry url
sentry_url = "https://[email protected]/1223576"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is sensitive info.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stored in aws secrets.

Comment on lines +13 to +14
SENTRY_URL = os.environ.get("SENTRY_URL")
ENVIRONMENT = os.environ.get("ENVIRONMENT")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's define all the config in a single file.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

&& apt-get autoremove -y \
&& rm -rf /var/lib/apt/lists/*

COPY deepex_ecs/app.py content_types.py wget.py /code/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need other files?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Comment on lines 81 to 82
key="temporaryfile.pdf",
bucketname="deep-large-docs-conversion"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not define the default value here or use the value from config.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

key="temporaryfile.pdf",
bucketname="deep-large-docs-conversion"
):
try:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 63 to 64
elif url.endswith(".jpg") or url.endswith(".jpeg") or url.endswith(".png") or \
url.endswith(".gif") or url.endswith(".bmp") or content_type in self.content_types_img:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use any here.

Suggested change
elif url.endswith(".jpg") or url.endswith(".jpeg") or url.endswith(".png") or \
url.endswith(".gif") or url.endswith(".bmp") or content_type in self.content_types_img:
elif (
content_type in self.content_types_img or
any([
url.endswith(f".{extension}") for extension in [
"jpg", "jpeg", "png", "gif", "bmp"
]
])
):

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Comment on lines 74 to 93
if temp_filepath.endswith(".pdf"):
return UrlTypes.PDF.value
elif temp_filepath.endswith(".docx"):
return UrlTypes.DOCX.value
elif temp_filepath.endswith(".doc"):
return UrlTypes.MSWORD.value
elif temp_filepath.endswith(".xlsx"):
return UrlTypes.XLSX.value
elif temp_filepath.endswith(".xls"):
return UrlTypes.XLS.value
elif temp_filepath.endswith(".pptx"):
return UrlTypes.PPTX.value
elif temp_filepath.endswith(".ppt"):
return UrlTypes.PPT.value
elif temp_filepath.endswith(".jpg") or temp_filepath.endswith(".jpeg") or temp_filepath.endswith(".png") or \
temp_filepath.endswith(".gif") or temp_filepath.endswith(".bmp"):
return UrlTypes.IMG.value
else:
logging.warn(f'Could not determine the content-type of the {url}')
return None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe do this using dict.

Suggested change
if temp_filepath.endswith(".pdf"):
return UrlTypes.PDF.value
elif temp_filepath.endswith(".docx"):
return UrlTypes.DOCX.value
elif temp_filepath.endswith(".doc"):
return UrlTypes.MSWORD.value
elif temp_filepath.endswith(".xlsx"):
return UrlTypes.XLSX.value
elif temp_filepath.endswith(".xls"):
return UrlTypes.XLS.value
elif temp_filepath.endswith(".pptx"):
return UrlTypes.PPTX.value
elif temp_filepath.endswith(".ppt"):
return UrlTypes.PPT.value
elif temp_filepath.endswith(".jpg") or temp_filepath.endswith(".jpeg") or temp_filepath.endswith(".png") or \
temp_filepath.endswith(".gif") or temp_filepath.endswith(".bmp"):
return UrlTypes.IMG.value
else:
logging.warn(f'Could not determine the content-type of the {url}')
return None
EXTENSION_TO_ENUM_MAP {
"pdf": UrlTypes.PDF,
"docx": UrlTypes.DOCX,
"doc": UrlTypes.MSWORD,
"xlsx": UrlTypes.XLSX,
"xls": UrlTypes.XLS,
"pptx": UrlTypes.PPTX,
"ppt": UrlTypes.PPT,
# Images
"jpg": UrlTypes.IMG,
"jpeg": UrlTypes.IMG,
"png": UrlTypes.IMG,
"gif": UrlTypes.IMG,
"bmp": UrlTypes.IMG,
}
file_extension = temp_filepath.split('.')[-1]
if file_extension not in EXTENSION_TO_ENUM_MAP:
logging.warn(f'Could not determine the content-type of the {url}')
return None
return EXTENSION_TO_ENUM_MAP[file_extension].value

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants