Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return rabota.by crawler #4

Merged
merged 2 commits into from
Oct 28, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 10 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,7 @@

# Collection of web crawlers
# Collection of Java-based web crawlers
[![Java CI with Maven](https://github.com/andrei-punko/java-crawlers/actions/workflows/maven.yml/badge.svg)](https://github.com/andrei-punko/java-crawlers/actions/workflows/maven.yml)

Decided to keep all crawlers in one repo to make other repos free from making calls to external resources

## Prerequisites

- Maven 3
Expand All @@ -14,13 +12,14 @@ Decided to keep all crawlers in one repo to make other repos free from making ca
mvn clean install
```

## Crawlers
## Crawler for Orthodox torrent tracker [pravtor.ru](http://pravtor.ru)
Check [SearchUtilTest](pravtor.ru-crawler/src/test/java/by/andd3dfx/pravtor/util/SearchUtilTest.java)
and [FileUtilTest](pravtor.ru-crawler/src/test/java/by/andd3dfx/pravtor/util/FileUtilTest.java) for details.

### The [pravtor.ru](http://pravtor.ru) crawler
See [SearchUtilTest](pravtor.ru-crawler/src/test/java/by/andd3dfx/pravtor/util/SearchUtilTest.java)
and [FileUtilTest](pravtor.ru-crawler/src/test/java/by/andd3dfx/pravtor/util/FileUtilTest.java)
To make search - run [run-search.bat](pravtor.ru-crawler/run-search.bat) script.
Collected data will be placed into [result.xls](pravtor.ru-crawler/sandbox/result.xls) file in `sandbox` folder

### How to use
```
./pravtor.ru-crawler/search.bat
```
## Crawler for vacancies aggregator [rabota.by / hh.ru](http://rabota.by)
Check [RabotaByJobSearchUtil](rabota.by-crawler/src/main/java/by/andd3dfx/sitesparsing/rabotaby/RabotaByJobSearchUtil.java) for details.

To make search - run `main()` method of [MainApp](rabota.by-crawler/src/main/java/by/andd3dfx/sitesparsing/rabotaby/MainApp.java) class
1 change: 1 addition & 0 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -20,5 +20,6 @@

<modules>
<module>pravtor.ru-crawler</module>
<module>rabota.by-crawler</module>
</modules>
</project>
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
import static java.lang.Thread.sleep;

/**
* Util to perform singleSearch on http://pravtor.ru torrent tracker
* Util to perform search on <a href="http://pravtor.ru">pravtor.ru</a> torrent tracker
*/
@Slf4j
public class SearchUtil {
Expand Down
78 changes: 78 additions & 0 deletions rabota.by-crawler/pom.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<parent>
<artifactId>java-crawlers</artifactId>
<groupId>by.andd3dfx</groupId>
<version>1.0-SNAPSHOT</version>
</parent>
<modelVersion>4.0.0</modelVersion>

<artifactId>rabota.by-crawler</artifactId>

<properties>
<lombok.version>1.18.30</lombok.version>
<slf4j.version>1.7.36</slf4j.version>
</properties>

<dependencies>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
<version>2.13.2.2</version>
</dependency>
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.14.3</version>
</dependency>
<dependency>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
<version>${lombok.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>${slf4j.version}</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-simple</artifactId>
<version>${slf4j.version}</version>
</dependency>

<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.13.2</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.hamcrest</groupId>
<artifactId>hamcrest-all</artifactId>
<version>1.3</version>
<scope>test</scope>
</dependency>
</dependencies>

<build>
<plugins>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.12.1</version>
<configuration>
<annotationProcessorPaths>
<path>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
<version>${lombok.version}</version>
</path>
</annotationProcessorPaths>
</configuration>
</plugin>
</plugins>
</build>
</project>
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
package by.andd3dfx.sitesparsing.rabotaby;

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.LinkedHashMap;

public class MainApp {

public static void main(String[] args) throws IOException {
if (args.length != 1) {
throw new IllegalArgumentException("Path to output file should be populated!");
}
var searchUtil = new RabotaByJobSearchUtil();

LinkedHashMap<String, Integer> statisticsSortedMap = searchUtil.collectStatistics("java");
Path path = Paths.get(args[0]);
byte[] strToBytes = statisticsSortedMap.toString().getBytes();
Files.write(path, strToBytes);
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
package by.andd3dfx.sitesparsing.rabotaby;

import by.andd3dfx.sitesparsing.rabotaby.dto.SingleSearchResult;
import by.andd3dfx.sitesparsing.rabotaby.dto.VacancyData;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.LinkedHashMap;
import java.util.List;
import java.util.stream.Collectors;

import lombok.extern.slf4j.Slf4j;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

@Slf4j
public class RabotaByJobSearchUtil {

private final String URL_PREFIX = "http://rabota.by";
private final String USER_AGENT = "Mozilla";
private final String searchUrlFormat = URL_PREFIX + "/search/vacancy?area=1002&text=%s&page=%d";

public List<VacancyData> batchSearch(String searchString) {
List<VacancyData> result = new ArrayList<>();

String nextPageUrl = buildSearchUrl(searchString);
while (nextPageUrl != null) {
SingleSearchResult singleSearchResult = singleSearch(nextPageUrl);
result.addAll(singleSearchResult.getDataItems());
nextPageUrl = singleSearchResult.getNextPageUrl();
}

return result;
}

SingleSearchResult singleSearch(String searchUrl) {
try {
Document document = Jsoup
.connect(searchUrl)
.userAgent(USER_AGENT).get();

Elements elements = document
.select("a[data-qa=serp-item__title]");

List<VacancyData> vacancyDataList = new ArrayList<>();
elements.parallelStream().forEach(element -> {
String vacancyDetailsUrl = element.select("a").attr("href");
vacancyDataList.add(retrieveVacancyDetails(vacancyDetailsUrl));
});

final Elements nextPageItem = document.select("a[data-qa=pager-next]");
String nextPageUrl = nextPageItem.isEmpty() ? null : URL_PREFIX + nextPageItem.attr("href");
return new SingleSearchResult(vacancyDataList, nextPageUrl);
} catch (IOException e) {
throw new RuntimeException("Single search failed", e);
}
}

private VacancyData retrieveVacancyDetails(String searchUrl) {
log.info("Retrieve vacancy details for {}", searchUrl);
Document document;
try {
document = Jsoup
.connect(searchUrl)
.userAgent(USER_AGENT).get();
} catch (IOException e) {
throw new RuntimeException("Retrieve details failed", e);
}

return VacancyData.builder()
.url(document.baseUri())
.companyName(document.select("a[class=vacancy-company-name]").text())
.textContent(document.select("div[data-qa=vacancy-description]").text())
.keywords(document.select("span[data-qa=bloko-tag__text]")
.stream()
.map(Element::text)
.flatMap(keyword -> Arrays.asList(keyword.split(", ")).stream())
.flatMap(keyword -> Arrays.asList(keyword.split(" & ")).stream())
.collect(Collectors.toSet())
)
.addressString(document.select("div[class^=vacancy-address-text]").text())
.build();
}

String buildSearchUrl(String searchString) {
return String.format(searchUrlFormat, searchString, 0);
}

public LinkedHashMap<String, Integer> collectStatistics(List<VacancyData> result) {
final Statistics statistics = new Statistics();
result.stream().forEach(vacancyData -> {
vacancyData.getKeywords().stream().forEach(statistics::putKeyword);
});
return statistics.buildSortedMap();
}

public LinkedHashMap<String, Integer> collectStatistics(String searchString) {
return collectStatistics(batchSearch(searchString));
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
package by.andd3dfx.sitesparsing.rabotaby;

import static java.util.stream.Collectors.toMap;

import java.util.Collections;
import java.util.HashMap;
import java.util.LinkedHashMap;
import java.util.Map;

public class Statistics {

private Map<String, Integer> keywordToFreqMap = new HashMap<>();

public void putKeyword(String keyword) {
if (!keywordToFreqMap.containsKey(keyword)) {
keywordToFreqMap.put(keyword, 0);
}
keywordToFreqMap.put(keyword, keywordToFreqMap.get(keyword) + 1);
}

public Integer get(String keyword) {
return keywordToFreqMap.get(keyword);
}

public LinkedHashMap<String, Integer> buildSortedMap() {
return keywordToFreqMap
.entrySet()
.stream()
.sorted(Collections.reverseOrder(Map.Entry.comparingByValue()))
.collect(toMap(
Map.Entry::getKey,
Map.Entry::getValue,
(e1, e2) -> e2,
LinkedHashMap::new
));
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
package by.andd3dfx.sitesparsing.rabotaby.dto;

import lombok.AllArgsConstructor;
import lombok.Getter;

import java.util.List;

@Getter
@AllArgsConstructor
public class SingleSearchResult {

private List<VacancyData> dataItems;
private String nextPageUrl;
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
package by.andd3dfx.sitesparsing.rabotaby.dto;

import java.util.Set;
import lombok.Builder;
import lombok.Data;

@Builder
@Data
public class VacancyData {

private String url;
private String companyName;
private String textContent;
private Set<String> keywords;
private String addressString;

@Override
public String toString() {
return "VacancyData{" +
"url='" + url + '\'' +
", keywords=" + keywords +
'}';
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
package by.andd3dfx.sitesparsing.rabotaby;

import org.junit.Before;
import org.junit.Test;

import java.util.LinkedHashMap;

import static org.hamcrest.CoreMatchers.is;
import static org.hamcrest.MatcherAssert.assertThat;
import static org.hamcrest.Matchers.greaterThanOrEqualTo;

public class RabotaByJobSearchUtilTest {

private RabotaByJobSearchUtil util;

@Before
public void setup() {
util = new RabotaByJobSearchUtil();
}

@Test
public void search() {
var result = util.singleSearch(util.buildSearchUrl("java"));

assertThat("Next url should be present", result.getNextPageUrl(), is(
"http://rabota.by/search/vacancy?area=1002&text=java&page=1&hhtmFrom=vacancy_search_list"));
assertThat("At least 20 items expected", result.getDataItems().size(), greaterThanOrEqualTo(20));

LinkedHashMap<String, Integer> statisticsSortedMap = util.collectStatistics(result.getDataItems());
// TODO: add asserts
}
}
Loading