I'm a student, and I am developing this project in spare time. If you have any idea, or if you find any issue, feel free to make PRs or open issues!
This package is intended EXCLUSIVELY for demonstrative purposes. The author has no responsibility about the use you will do and the consequences of it. Keep in mind that running this code is forbidden, is just a demonstration about how scraping works. If you decide to run it anyway, you will assume all the responsibilities about the consequence it will have.
This package wants to retrieve tweets text, images links and video preview links from tweets, without using APIs.
Newer releases are merged into main. The development of a feature usually is found under a branch named feature/<feature_name>
. Once a feature is complete, it will be merged in the develop
branch.
Improvements in the documentation are first done in documentation_improvement
branch, and then merged either into
develop
or into main
, according to when in the time I decided to improve the documentation (in general, if possible I will do it in develop
).
Once features are ready, the branch develop
will be merged into main
, where you should find a more stable version of the package.
Notice that this is required for this package to work. To install it, it is enough to install chromium and the chromedriver.
As an example, in Arch Linux, to install chromium, it is enough to type in terminal:
sudo pacman -S chromium
The path of chromedriver
is found automatically. If your operating system for whatever reason gives it another name, pass it through the parameter chromedriver
.
Tested on:
- chromium (114.0.5735.45), DARK THEME (N.B.: the theme is important, since the loaded css will differ between light and dark. Future developments will add support for both)
- chromedriver (same version of chromium)
Creation of a virtual environment is highly recommended. In the home folder of a linux system:
python3 -m venv ./venv
Acrivate it:
source venv/bin/activate
Install the requirements of the package:
pip install -r requirements.txt
Install the package. It is highly suggested to install it in the editor mode, to allow changes without need of re-install it every time. Enter the path of the folder which contains this file, and type from the terminal:
pip install -e .
To test the installation, try to type (inside the virtual environment):
postget --help
If the response is correct and show the output of a help command, then it is working.
An example of command is (in the following a detailed explanation is provided):
postget --username '<your_username>' --password '<your_password>' --query '<query_to_be_performed>' --email_address '<mail_of_the_account>' --num_scrolls 10 --wait_scroll_base 3 --wait_scroll_epsilon 1 --mode 1
As a reference for import (for parameters meaning see the related section), after having installed it:
from postget.Posts import Posts
from postget.exceptions.exceptions import *
def main():
twitter_getter = Posts(username='<USERNAME>', password='<PASSWORD>', email_address='<MAIL>', query='', num_scrolls=2, mode=1, wait_scroll_base = 4, wait_scroll_epsilon = 2)
try:
twitter_getter.login()
except ElementNotLoaded as e:
raise e
print("Setting query in the object")
twitter_getter.set_query('test')
print("Start Search, this will input the query and perform the scroll with the selected mode")
try:
twitter_getter.search()
except ElementNotLoaded as e:
raise e
except NoTweetsReturned as e:
print(e)
print("Printing returned results and going home")
twitter_getter.print_results()
twitter_getter.go_home()
print("Clearing Results")
twitter_getter.clear_tweets()
print("quitting browser")
twitter_getter.quit_browser()
if __name__=='__main__':
main()
postget
searches for images and tweet in two different ways:
- Mode
0
, or simple search: it just grep all links of images and videos detected when scrolling - Mode
1
, or complete search: it greps also all information of tweet, such as id of the discussion and author. Notice that if you want to perform the search within two tweets' ids, it is necessary to operate in this mode.
Why keeping both? Because in mode 1
many things can go wrong, it is sufficient that one div search fails, that the entire search crashes.
You can use this package from command line.
How it will be runned: postget
will
- log in
- search for the query according to the operating mode
- perform scrolls
- print the images and video previews links or the tweets informations according to the operating mode
- close the driver
Notice that this means that a second call will imply a new login phase.
For information about the parameters, type in command line
postget --help
Import the class Posts
. Main parameters in the initialization:
username
(str
): Username that will be used to access the Twitter accountpassword
(str
): Password of the Username that will be used access the Twitter accountquery
(str
): Query to be searched on Twitterwait_scroll_base
(int
): base time to wait between one scroll and the subsequent (expressed in number of seconds, default 15)wait_scroll_epsilon
(float
): random time to be added to the base time to wait between one scroll and the subsequent, in order to avoid being detected as a bot (expressed in number of seconds, default 5)num_scrolls
(int
): number of scrolls to be performed, default 10since_id
(int
): id from which tweets will be saved (tweets with an id<
with respect than this value will be discarded). If set to-1
(default value), this parameter will not be considered. Notice that this will be considered only if alsomax_id
will be set, and will work only for search mode equal to1
max_id
(int
): id until tweets will be saved (tweets with an id>
with respect to this value, will be discarded). Notice that this will be considered only if alsomax_id
will be set, and will work only for search mode equal to1
mode
(int
): selects the operating mode, the default is0
.since
(str
): String of the date (excluded) from which the tweets will be returned. Format:YYYY-MM-DD
, UTC time. Temporarily supported only for mode1
. If you set also since_time, or until_time, this will be ignored. Wrong formats will be ignoreduntil
(str
): String of the date (included) until which the tweets will be returned. Format:YYYY-MM-DD
, UTC time. Temporarily supported only for mode1
. If you set also since_time, or until_time, this will be ignored. Wrong formats will be ignoredsince_time
(str
): String of the time from which the tweets will be returned. Format: timestamp in SECONDS, UTC time. Temporarily supported only for mode1
until_time
(str
): String of the time until which the tweets will be returned. Format: timestamp in SECONDS, UTC time. Temporarily supported only for mode1
headless
CURRENTLY NO MORE WORKING, use non headless mode (bool
if imported, just type--headles
if called from command line): If specified, runs the browser in headless mode. Unfortunately something changed from the first version of postget, and this is no more working. A section in the roadmap has been added for this.chromedriver
(str
): custom path to the chromedriver. if not specified, the code will try to find automatically the path ofchromedriver
email_address
(str
): email of the account, required since sometimes could be asked to insert it to verify the accountroot
(bool
if imported, just type--root
if called from command line): If specified, adds the option--no-sandbox
to the chrome options, needed to be runned in root mode. Please notice that running in root mode is not safe for security reasons.
A couple of words on advenced filters:
since_id
andmax_id
: if one of them is not set, or set to the default value, also the other correctly set will be ignored. If correctly set, tweets with the id within[since_id, max_id]
will be saved (extremes included).- Precedences among
since_id
,max_id
,since
,until
,since_time
,until_time
:- The definition of even just one parameter among
since
, oruntil
causes the invalidation ofsince_id
andmax_id
(they simply will not be considered). - The definition of even just one parameter among
since_time
oruntil_time
causes the invalidation ofsince
anduntil
(they simply will not be considered). The same reasoning will be applied tosince_id
andmax_id
when one amongsince_time
oruntil_time
is defined.
- The definition of even just one parameter among
Utility methods:
- Getters and setters, among which four important methods are:
get_actual_images()
to return the list of the images' urls (operating mode 0)get_video_preview()
to return the list of the videos' previews (operating mode 0)get_tweets_data()
to return the dictionary of tweets (operating mode 1)get_discussions_links()
to return the discussion links returned
login()
to perform the login according to the values ofusername
andpassword
. Raises:ElementNotLoaded
exception when the username input is not loaded within timeoutElementNotLoaded
exception when the button to click to go to the password input is not loaded within timeoutElementNotLoeaded
exception when the password input is not loaded within timeoutElementNotLoaded
exception when the button to click to go to the home page is not loaded within timeout
search()
according to the value ofquery
(it takes care of handling the operating mode). Raises:ElementNotLoaded
exception when the searchbox is not loaded in time, probably the page could be stuck in rendering and exceeded the timeoutNoTweetsReturned
when the simplified search returns no tweetsNoTweetsReturned
when the complete search returns no tweets.
go_home()
to go back to the homepage.clear_images()
to clear the list of the images' urls (operating mode 0)clear_video_previews()
to clear the list of the video previews' urls (operating mode 0)clear_tweets()
to clear the tweet dictionary (operating mode 1)quit_browser()
to close the instance of the driverprint_results()
to print the results (it takes care of verifying the operating mode)
Please, change and access the parameters with getters and setters.
- When the format of parameters
since
anduntil
is not correct (i.e., is notYYYY-MM-DD
), those parameters internally raises an exception, which will cause the reset to their default value (none
) before proceding.
- Support for custom digitation speed (standard time plus or minus epsilon)
- Thinking to move to poetry for better dependencies management and reproducibility
- Support for both dark and light themes
- Re-implement headless feature
- Fixed naming inconsistency for the CLI interface. This did not cause any bug, but has been changed for code consistency
- Added support for tweet text retrieval