-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: update exporter #4327
feat: update exporter #4327
Conversation
sywhb
commented
Aug 26, 2024
•
edited
Loading
edited
- update exported files in following file structure:
- add export job and get api
- metadata_{start_page}_to_{end_page}.json - /content - {slug}.html - /highlights - {slug}.md
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
if (content) { | ||
// Add content files to /content | ||
archive.append(content, { | ||
name: `content/${slug}.html`, | ||
}) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jacksonh I wonder if I should call the /content
API instead to get the readable content so we can avoid long queries in db
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that could help. It would hit the cache I guess?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it would
packages/api/src/jobs/export.ts
Outdated
|
||
export const EXPORT_JOB_NAME = 'export' | ||
|
||
export const exportJob = async (jobData: ExportJobData) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jacksonh another question is if we should just use async jobs to export the data instead of calling the cloud run service
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sywhb maybe we should try to start with an async job and get a baseline of performance in demo. Both our accounts are quite large there so it would give an idea.
})) | ||
|
||
archive.append(JSON.stringify(metadata, null, 2), { | ||
name: `metadata_${cursor}_to_${cursor + size}.json`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jacksonh Here I actually export paginated metadata
into multiple JSON files to avoid having a very large JSON file
jobId: `${EXPORT_JOB_NAME}_${userId}_${JOB_VERSION}`, | ||
removeOnComplete: true, | ||
removeOnFail: true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The export jobs are deduplicated but we probably also want to limit number of exports per user per day by checking the number of zip files in cloud storage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah agreed, or even create an entry in postgres, as annoying as that can be
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, makes sense to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jacksonh Could you help me to clarify some questions in the PR? Thank you!
Squawk Report🚒 4 violations across 1 file(s)
|
Could you take a look at this when you have time? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good, we definitely need to test with some larger libraries in demo. I'm a little worried it could time out and then just get stuck in a loop restarting from the beginning.
Yeah, I agree. Probably should create a batch of sub tasks so could better recover and rate limit the number of export tasks |