Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix header normalization for xlsx #612

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

hypermurea
Copy link

Please explain how to summarize this PR for the Changelog:

  • change normalization of field keys for extractor and xlsx-extractor to change non-alphanumeric characters into character '_'
  • normalize field keys identically in both extractor and xlsx-extractor to prevent keys from overlapping

Tell code reviewer how and what to test:

  • identical or near-identical column names should not overlap: example test .xlsx file demonstrates a situation where column names will overlap if normalization is done differently between the two extractors

Copy link
Contributor

coderabbitai bot commented Sep 19, 2024

Walkthrough

The changes introduce a new test suite for the xlsx-extractor plugin, focusing on validating header normalization when uploading Excel files. Additionally, the prependNonUniqueHeaderColumns function is refined for better handling of values and key normalization, which is further enhanced by introducing a new helper function, normalizeKey. This function now includes regex to remove non-alphanumeric characters, improving key consistency. The modifications aim to ensure that the normalization process is effective and that the handling of headers is robust.

Changes

Files Change Summary
plugins/xlsx-extractor/src/header.normalization.spec.ts Added a test suite for validating header normalization in Excel uploads, including setup and teardown processes.
plugins/xlsx-extractor/src/utils.ts Refined the prependNonUniqueHeaderColumns function for better value processing and introduced normalizeKey for key normalization.
utils/extractor/src/index.ts Enhanced the normalizeKey function to include regex for removing non-alphanumeric characters, improving key consistency.

Possibly related PRs


Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between b5feb28 and 4dfbb85.

Files ignored due to path filters (1)
  • plugins/xlsx-extractor/ref/test-headers.xlsx is excluded by !**/*.xlsx, !**/*.xlsx
Files selected for processing (3)
  • plugins/xlsx-extractor/src/header.normalization.spec.ts (1 hunks)
  • plugins/xlsx-extractor/src/utils.ts (2 hunks)
  • utils/extractor/src/index.ts (1 hunks)
Additional comments not posted (7)
plugins/xlsx-extractor/src/utils.ts (2)

8-9: LGTM!

The changes improve the clarity and conciseness of the code:

  • Simplifying the assignment of newValue using the logical OR operator (||) achieves the same outcome as the original conditional operator but is more concise.
  • Introducing the call to the new helper function normalizeKey encapsulates the normalization logic, promoting better code organization and potential reusability.

23-25: LGTM!

The normalizeKey function effectively normalizes the input string:

  • It removes leading and trailing whitespace using the trim method.
  • It replaces specific characters ('%', '$') with designated placeholders ('PERCENT', 'DOLLAR') using the replace method with regular expressions.
  • It removes any non-alphanumeric characters and replaces them with underscores using another replace method with a regular expression.

The normalization process ensures consistency in the keys used in the result object of the prependNonUniqueHeaderColumns function. The function is reusable and can be used in other parts of the codebase if similar normalization is required.

plugins/xlsx-extractor/src/header.normalization.spec.ts (4)

1-11: LGTM!

The imports are relevant and required for the test suite. The code segment looks good.


12-128: Excellent test coverage!

The test suite is well-structured and provides comprehensive coverage for the header normalization functionality of the xlsx-extractor plugin. The test case uploads a sample Excel file, handles success/failure scenarios, and validates the workbook, sheets, fields, and record data to ensure the normalization process is working as expected.

Great job on implementing this test suite!


35-38: Verify the getEnvironmentId() function.

The file upload is performed correctly by providing the necessary parameters. However, please ensure that the getEnvironmentId() function is properly defined and returns the expected environment ID.


57-92: LGTM!

The expected fields array is well-defined and covers various scenarios for validating the header normalization process. The normalized keys in the expected fields accurately represent the expected behavior of the normalization logic.

utils/extractor/src/index.ts (1)

214-214: Excellent addition to the normalizeKey function!

The introduced regex replacement (/[^a-zA-Z0-9]/g) effectively removes any non-alphanumeric character from the key and replaces it with an underscore. This change ensures that all keys are normalized consistently, regardless of the presence of special characters or whitespace.

By broadening the scope of key normalization, this modification improves data integrity and consistency when processing keys across the codebase. It aligns perfectly with the PR objective of normalizing field keys uniformly in both the extractor and the xlsx-extractor.

Great work on enhancing the key normalization process!


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    -- I pushed a fix in commit <commit_id>, please review it.
    -- Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    -- @coderabbitai generate unit testing code for this file.
    -- @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    -- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    -- @coderabbitai read src/utils.ts and generate unit testing code.
    -- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    -- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

console.log(event.topic)
})

await api.files.upload(fs.createReadStream(path.join(__dirname,'../ref/test-headers.xlsx')), {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nullify Code Language: TypeScript 🔵 MEDIUM Severity CWE-22

Javascript pathtraversal rule non literal fs filename

The application dynamically constructs file or path information. If the path
information comes from user-supplied input, it could be abused to read sensitive files,
access other users' data, or aid in exploitation to gain further system access.

User input should never be used in constructing paths or files for interacting
with the filesystem. This includes filenames supplied by user uploads or downloads.
If possible, consider hashing user input or using unique values and
use path.normalize to resolve and validate the path information
prior to processing any file functionality.

Example using path.normalize and not allowing direct user input:

// User input, saved only as a reference
// id is a randomly generated UUID to be used as the filename
const userData = {userFilename: userSuppliedFilename, id: crypto.randomUUID()};
// Restrict all file processing to this directory only
const basePath = '/app/restricted/';

// Create the full path, but only use our random generated id as the filename
const joinedPath = path.join(basePath, userData.id);
// Normalize path, removing any '..'
const fullPath = path.normalize(joinedPath);
// Verify the fullPath is contained within our basePath
if (!fullPath.startsWith(basePath)) {
    console.log("Invalid path specified!");
}
// Process / work with file
// ...

For more information on path traversal issues see OWASP:
https://owasp.org/www-community/attacks/Path_Traversal

Here's how you might fix this potential vulnerability

The modified code first constructs the full path of the file, then uses 'path.basename()' to get the filename. This ensures that only the filename is used in the filesystem operation, preventing directory traversal. Even though '__dirname' is not controllable by an attacker, it's a good practice to always validate or sanitize filenames in filesystem operations.

autoFixesExperimental

Use path.basename() to get the filename and prevent directory traversal

Suggested change
await api.files.upload(fs.createReadStream(path.join(__dirname,'../ref/test-headers.xlsx')), {
const filePath = path.join(__dirname, '../ref/test-headers.xlsx');
const fileName = path.basename(filePath);
await api.files.upload(fs.createReadStream(filePath), {

poweredByNullify

Reply with /nullify to interact with me like another developer
(you will need to refresh the page for updates)

@flatfile-nullify
Copy link

Nullify Code Vulnerabilities

1 findings found in this pull request

🔴 CRITICAL 🟡 HIGH 🔵 MEDIUM ⚪ LOW
0 0 1 0

You can find a list of all findings here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant