Fix header normalization for xlsx #612

hypermurea · 2024-09-19T12:07:54Z

Please explain how to summarize this PR for the Changelog:

change normalization of field keys for extractor and xlsx-extractor to change non-alphanumeric characters into character '_'
normalize field keys identically in both extractor and xlsx-extractor to prevent keys from overlapping

Tell code reviewer how and what to test:

identical or near-identical column names should not overlap: example test .xlsx file demonstrates a situation where column names will overlap if normalization is done differently between the two extractors

coderabbitai · 2024-09-19T12:08:02Z

Walkthrough

The changes introduce a new test suite for the xlsx-extractor plugin, focusing on validating header normalization when uploading Excel files. Additionally, the prependNonUniqueHeaderColumns function is refined for better handling of values and key normalization, which is further enhanced by introducing a new helper function, normalizeKey. This function now includes regex to remove non-alphanumeric characters, improving key consistency. The modifications aim to ensure that the normalization process is effective and that the handling of headers is robust.

Changes

Files	Change Summary
`plugins/xlsx-extractor/src/header.normalization.spec.ts`	Added a test suite for validating header normalization in Excel uploads, including setup and teardown processes.
`plugins/xlsx-extractor/src/utils.ts`	Refined the `prependNonUniqueHeaderColumns` function for better value processing and introduced `normalizeKey` for key normalization.
`utils/extractor/src/index.ts`	Enhanced the `normalizeKey` function to include regex for removing non-alphanumeric characters, improving key consistency.

Possibly related PRs

fix: normalize extracted column keys #569: Introduces the normalizeKey function, directly related to the changes in the main PR regarding key normalization.
fix: normalize extracted column keys #576: Modifies the Extractor function to apply normalizeRecordKeys, which utilizes normalizeKey, aligning with key normalization efforts.
fix: null header extraction #581: Addresses header extraction issues relevant to validating header normalization during Excel file uploads.
Fix/support for empty header cell #584: Enhances handling of empty header cells, relating to proper header normalization and validation in the xlsx-extractor plugin.

Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between b5feb28 and 4dfbb85.

Files ignored due to path filters (1)

plugins/xlsx-extractor/ref/test-headers.xlsx is excluded by !**/*.xlsx, !**/*.xlsx

Files selected for processing (3)

plugins/xlsx-extractor/src/header.normalization.spec.ts (1 hunks)
plugins/xlsx-extractor/src/utils.ts (2 hunks)
utils/extractor/src/index.ts (1 hunks)

Additional comments not posted (7)

plugins/xlsx-extractor/src/utils.ts (2)

8-9: LGTM!

The changes improve the clarity and conciseness of the code:

Simplifying the assignment of newValue using the logical OR operator (||) achieves the same outcome as the original conditional operator but is more concise.

Introducing the call to the new helper function normalizeKey encapsulates the normalization logic, promoting better code organization and potential reusability.

23-25: LGTM!

The normalizeKey function effectively normalizes the input string:

It removes leading and trailing whitespace using the trim method.

It replaces specific characters ('%', '$') with designated placeholders ('PERCENT', 'DOLLAR') using the replace method with regular expressions.

It removes any non-alphanumeric characters and replaces them with underscores using another replace method with a regular expression.

The normalization process ensures consistency in the keys used in the result object of the prependNonUniqueHeaderColumns function. The function is reusable and can be used in other parts of the codebase if similar normalization is required.

plugins/xlsx-extractor/src/header.normalization.spec.ts (4)

1-11: LGTM!

The imports are relevant and required for the test suite. The code segment looks good.

12-128: Excellent test coverage!

The test suite is well-structured and provides comprehensive coverage for the header normalization functionality of the xlsx-extractor plugin. The test case uploads a sample Excel file, handles success/failure scenarios, and validates the workbook, sheets, fields, and record data to ensure the normalization process is working as expected.

Great job on implementing this test suite!

35-38: Verify the getEnvironmentId() function.

The file upload is performed correctly by providing the necessary parameters. However, please ensure that the getEnvironmentId() function is properly defined and returns the expected environment ID.

57-92: LGTM!

The expected fields array is well-defined and covers various scenarios for validating the header normalization process. The normalized keys in the expected fields accurately represent the expected behavior of the normalization logic.

utils/extractor/src/index.ts (1)

214-214: Excellent addition to the normalizeKey function!

The introduced regex replacement (/[^a-zA-Z0-9]/g) effectively removes any non-alphanumeric character from the key and replaces it with an underscore. This change ensures that all keys are normalized consistently, regardless of the presence of special characters or whitespace.

By broadening the scope of key normalization, this modification improves data integrity and consistency when processing keys across the codebase. It aligns perfectly with the PR objective of normalizing field keys uniformly in both the extractor and the xlsx-extractor.

Great work on enhancing the key normalization process!

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
-- I pushed a fix in commit <commit_id>, please review it.
-- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
-- @coderabbitai generate unit testing code for this file.
-- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
-- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
-- @coderabbitai read src/utils.ts and generate unit testing code.
-- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
-- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

flatfile-nullify · 2024-09-19T12:09:22Z

plugins/xlsx-extractor/src/header.normalization.spec.ts

+      console.log(event.topic)
+    })
+
+    await api.files.upload(fs.createReadStream(path.join(__dirname,'../ref/test-headers.xlsx')), {


Nullify Code Language: TypeScript 🔵 MEDIUM Severity CWE-22

Javascript pathtraversal rule non literal fs filename

The application dynamically constructs file or path information. If the path
information comes from user-supplied input, it could be abused to read sensitive files,
access other users' data, or aid in exploitation to gain further system access.

User input should never be used in constructing paths or files for interacting
with the filesystem. This includes filenames supplied by user uploads or downloads.
If possible, consider hashing user input or using unique values and
use path.normalize to resolve and validate the path information
prior to processing any file functionality.

Example using path.normalize and not allowing direct user input:

// User input, saved only as a reference // id is a randomly generated UUID to be used as the filename const userData = {userFilename: userSuppliedFilename, id: crypto.randomUUID()}; // Restrict all file processing to this directory only const basePath = '/app/restricted/'; // Create the full path, but only use our random generated id as the filename const joinedPath = path.join(basePath, userData.id); // Normalize path, removing any '..' const fullPath = path.normalize(joinedPath); // Verify the fullPath is contained within our basePath if (!fullPath.startsWith(basePath)) { console.log("Invalid path specified!"); } // Process / work with file // ...

For more information on path traversal issues see OWASP:
https://owasp.org/www-community/attacks/Path_Traversal

⚡ Here's how you might fix this potential vulnerability

The modified code first constructs the full path of the file, then uses 'path.basename()' to get the filename. This ensures that only the filename is used in the filesystem operation, preventing directory traversal. Even though '__dirname' is not controllable by an attacker, it's a good practice to always validate or sanitize filenames in filesystem operations.

autoFixesExperimental

Use path.basename() to get the filename and prevent directory traversal

Suggested change

await api.files.upload(fs.createReadStream(path.join(__dirname,'../ref/test-headers.xlsx')), {

const filePath = path.join(__dirname, '../ref/test-headers.xlsx');

const fileName = path.basename(filePath);

await api.files.upload(fs.createReadStream(filePath), {

poweredByNullify

Reply with /nullify to interact with me like another developer
_{(you will need to refresh the page for updates)}

flatfile-nullify · 2024-09-19T12:09:24Z

Nullify Code Vulnerabilities

1 findings found in this pull request

`🔴 CRITICAL`	`🟡 HIGH`	`🔵 MEDIUM`	`⚪ LOW`
0	0	1	0

You can find a list of all findings here

fix header normalization for xlsx

4dfbb85

hypermurea requested review from carlbrugger and apoddubn September 19, 2024 12:08

hypermurea self-assigned this Sep 19, 2024

flatfile-nullify bot reviewed Sep 19, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix header normalization for xlsx #612

Fix header normalization for xlsx #612

hypermurea commented Sep 19, 2024

coderabbitai bot commented Sep 19, 2024 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

flatfile-nullify bot Sep 19, 2024

flatfile-nullify bot commented Sep 19, 2024

-    await api.files.upload(fs.createReadStream(path.join(__dirname,'../ref/test-headers.xlsx')), {
+    const filePath = path.join(__dirname, '../ref/test-headers.xlsx');
+    const fileName = path.basename(filePath);
+    await api.files.upload(fs.createReadStream(filePath), {

Fix header normalization for xlsx #612

Are you sure you want to change the base?

Fix header normalization for xlsx #612

Conversation

hypermurea commented Sep 19, 2024

Please explain how to summarize this PR for the Changelog:

Tell code reviewer how and what to test:

coderabbitai bot commented Sep 19, 2024 • edited Loading

Walkthrough

Changes

Possibly related PRs

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

flatfile-nullify bot Sep 19, 2024

Choose a reason for hiding this comment

Javascript pathtraversal rule non literal fs filename

flatfile-nullify bot commented Sep 19, 2024

Nullify Code Vulnerabilities

1 findings found in this pull request

coderabbitai bot commented Sep 19, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)