-
Notifications
You must be signed in to change notification settings - Fork 263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data Liberation] Entity Stream Importer #1980
Comments
For decision points such as "if element with ID exists" we could support large element sets via bloom filters. On "match" we'd optimistically try to insert and then backtrack on failure. |
A Review of Existing WXR ImportersI've reviewed a lot of existing WXR importers and here are the key ideas I've gathered: I've been diving deep into various WordPress WXR importers and exporters. Here are the key patterns and insights I've discovered that could be valuable for this project: Import steps
Imported data formats
Streaming and Memory Management
Handling Attachments
Entity ProcessingSeveral patterns emerge around handling WordPress entities:
Progress Tracking and Recovery
Performance Optimizations
Extensibility Patterns
|
Let's build plumbing to load data into WordPress.
I think any data source can be represented as a stream of structured entities.
Importing data
WXR importers must answer these questions:
Let's view a WXR file as a flat list of entity objects such as posts, comments, meta, etc. We can now represent a lot of scenarios as list concatenation:
WordPress entities ++ WXR Entities
WXR Entities ++ WXR Entities
Entities before pause ++ Entities after pause
WordPress 1 Entities ++ WordPress 2 Entities
.WordPress 1 Entities ++ WordPress 2 entities ++ WordPress 1 deletions ++ WordPress 2 deletions
From there, we'd need to reduce those lists to contain zero or one entries representing each object.
This is already similar to journaling MEMFS to OPFS in the Playground webapp. It also resembles map/reduce problems where parts of the processing can be parallelized while other parts must be processed sequentially.
I bet we can find a unified way of reasoning about all these scenarios and build a single data ingestion pipeline for any data source.
Let's see how far can we get with symbols and reasoning before writing code. I'm sure there are existing white papers and open source projects working through this exact problem.
Resources
cc @brandonpayton
The text was updated successfully, but these errors were encountered: