// HOW TO: Use URLSession, GroupDispatch, SwiftSoup, semaphores & recursion to mine an HTML source and implement downloader

dracosveen · November 17, 2019, 8:13am

Hello there.

Every now and again I do something or rather struggle with something and then think “I should share this”. Maybe I am just egotistical. Oh well.

While everything that I am about to share has been done in macOS, it is still very relevant to iOS being that I have used Swift to write the code and more importantly, took what I learnt following Chris’ tutorials, to create my application.

This is my first application for macOS. So, what does it do? Well the premise is simple. Mine or parse an HTML source, in this case a file index listing, transform it into useable data and then download it. Sounds simple but believe you me it turned out to be challenging.

The first problem that I encountered was that I have a bunch of files I want to download, and I want to do it automatically. But I don’t have a source to tell me what to download so I have to go and figure this out. For us humans we would log onto a website, click a link and the file would download. Try doing that 800 times. No thank you.

|700x111.82108626198082

Left
Center
Right

Remove

html source

Luckily for me I know a few things beforehand. I know the server address and I know the root path. So surely if I can enumerate all the files in the directories, I can then loop through the collection of URLs and download each one sequentially?

How do you enumerate HTML? Enter SwiftSoup. SwiftSoup is a Swift port of the python library called BeautifulSoup. It takes an html string and converts it into objects that you can then query.It is a pod so it is readily available and easy to implement. I also found the documentation very helpful and complete.

|700x377.3604060913706

Left
Center
Right

Remove

SwiftSoup

I query the root path using a URLSession Datatask. The returned html is then parsed into a document. I then loop through the elements looking first for rows in a table and then looking for the alt=”[VID]” tag. I then get the filename, calculate the url using URLComponents and append the information to my array of custom file info objects.

Awesome. Problem one done. I can get back a list of files and build up a url to download them.

Problem 2. Some of the files are in a directory other than the root path. Also what if there are, in future, more than one layer of directories? Well this then ofcourse calls for recursion. For those that don’t know, recursion is a problem solving method with a function calls it self given a set of circumstances usually determined by an if statememt. Generally this is very easy to implement.

Enter problem 3. How do you implement recursion on function that executes asynchronously? To put it simply, if you have worked with network code, you know that the work that is done with the response is done in a closure. This code executees asynchronously so as not to affect the user experience by locking up the main thread. However, recursively calling an asynchronous thread isn’t going to work because how do you know when it has returned. Also we need to run many of these calls so as to minimise the amount of time spent walking up and down directories looking for files.

The solution turned out to be 2 fold. Firstly we need to make our initial call to the root directory a synchronous one. The way that we do this is by using a semaphore.

|700x504.68583599574015

Left
Center
Right

Remove

semaphore

For all intents and purposes we are blocking the main thread until the datatask returns. This gets us our root html to parse.

The 2ndpart of the solution is to user GroupDispatch to implement multiple threads to do the work and then wait for evertyhing to complete before continuing. This is where it truly gets magicaly so bear with me.

Step 1: Parse the HTML

|700x232.44781783681216

Left
Center
Right

Remove

parse me some html

In the parsing we make a choice. Am I processing a video or a directory. If it is a video, cool, go ahead and append the necessary information to the array. If it is a directory, enter into the group and call the datatask. Now the trick to using GroupDispatch is that you enters and leaves have to balance out else you crash. So we call group.enter() just before we execute the asychronous datatask call. This call goes off in its own thread and does what it needs to do while the enumeration process continues.

Step 2: The datatask is created and allowed to run its course. It gets the html for the url that was passed and calls the parse function. Once the parse function returns, we call group.leave() to exit. Importantly if there is an error, we need to call leave as well or else we will wait and wait and wait.

|700x252.7156549520767

Left
Center
Right

Remove

leave

So we are back in the html parse function. So step 1 executes again. Technically this can go on for quite long as the tree could be several directories deep. However since we are multi-thread the calls we are processing the whole width of the tree all at the same time while processing the length sequencially. For those of you whose head is hurting, I basically took a process that was running about 25 secs to complete and brought it down to 1.5 secs on my fibre connection.

Step 3: Eventually the execution will return from the initial call to parse the html. When this happens it is highly likely that there are still queries and html parsing happening in background tasks. So we call group.wait() which blocks the main thread while all of these complete. We then return all the files.

|700x249.73375931842384

Left
Center
Right

Remove

wait

So now we have an array of files that we can loop through and download. Pretty much down hill from here right?

Well no. Its all good and well firing off a downloadTask and then waiting for the file to download. Typically we would want to know what we are downloading, how fast we are downloading, how much we have downloaded and maybe been able to resume a download if the internet cuts out for a bit or if we want to pause the download. What about multiple downloads?

So the best way to do this is to create an object for each file that you want to download. This means that you want to create your own custom class that is a container for all the logic for a specific file. You then want to create many of these objects and access them.

So firstly we create a DownloadTask class.

|700x403.30138445154415

Left
Center
Right

Remove

classy

This class has several properties and inherits a couple of URLSession delegates and NSObject.

Next we add some functions to start, pause and resume the download.

|700x447.72079772079775

Left
Center
Right

Remove

many many functions

We add our start up download function. Notice this xecutes asynchronously. Also notice the start time. This is needed to calculate the speed of the download.

|700x367.5186368477103

Left
Center
Right

Remove

a fresh start

We also add the function that calculates progress and speed and returns this information to the main thread via our own delegate.

|700x238.55165069222576

Left
Center
Right

Remove

calculations…glorious calculations

We implement the function that will save the file once it is downloaded. The really cool thing about downloadTask is that it downloads to a “temp” directory of sorts and once the download is complete you can then copy it to where you want it.

|700x215.4419595314164

Left
Center
Right

Remove

copy me when I am done

Last but not least we add the error trap. In the pause function, when we hit pause we save some resumeData. Presumably this is some data to tell the session where the incomplete file is, where to pick up the download and all the header and request information set during the initial call to start downloading.

But what about when the download is interupted? Lets say the network goes down, or the internet disconnects? There are many things that can happen. I tested this by physically unplugging my network cable.

|700x99.14802981895633

Left
Center
Right

Remove

to resume or not to resume…

So when there is an error, this function executes and saves the resumeData, if any, so that it can be used later on. Works wonderfully.

So now that we have our class, we enumerate through our array of file information that we created from mining the html, and create var downloadTasks = DownloadTask. Fill that up and we have a list that we can call start on downloadTasks[0].start(). Infact, if you really wanted to you could create this up front directly from the html parse function. I am not doing that as I am matching these files with data from another scraper to determine what I want to download.

|700x400.31948881789134

Left
Center
Right

Remove

it’s aliveeee

Above is my completed interface.

Next steps. There are 2 very glaringly obvious issues. If the power fails then I cannot resume my download. I plan to combat this by writing the resumedata to a SQLite database every 15 seconds or so. I use SQLite to save my configuration so it is a hop, skip and a jump to saving the download state as well.

The 2ndissue is that even though I can run multiple downloads simultaneously, there is only one progress bar etc to show progress. I plan to added my downloadTask into the table so that each row will have its own downloadTask and progress bar.

So I hope that this has been interesting. I hope you made it to the end. While this application is a macOS application, the code I have written and the functions and methods I have used here can be used in iOS as well. Obviously you are not going to write a file downloader for your phone but you may want to show custom progress and allow resume functionality. Most importantly there are some very cool principles like recursion, multi-threading and implementing your own classes that could come in handy.

Happy coding.

mraghbeer · April 28, 2021, 12:59pm

Hello dracosveen,

I enjoy reading your post and I am learning to use SwiftSoup; I am able to parse simple HTML documents, but are having trouble with complex HTML with JavaScript code and little HTML.

Can you share your process on using SwiftSoup with URLSession or point me in the right direction where I can learn how to properly use SwiftSoup with complex HTML?

Thank you,
Mo

dracosveen · April 29, 2021, 12:58am

Hey Mo,

So glad that someone can find a use for my ramblings…

So I am no guru when it comes to HTML etc but SwiftSoup is used to parse HTML. Javascript will basically create the HTML page for you and apply transforms along the way like when you click a button and something changes colour.

If you re parsing javascript then it sounds like you are parsing an unrendered webpage. Again I am no guru but that is not what SwiftSoup is for.

SwiftSoup is best used (imo) to parse data displayed on a webpage. Like the way that I used it to create a list object of files as displayed on a page. Another example would be to parse a tabulated page so to create an object of the table data. Yes SwiftSoup can parse any html page, but what it creates isn’t going to mean anything if the html is not a representation of data.

Maybe I could give a better answer if I knew what you were trying to parse and offer some pointers that way.

mraghbeer · May 13, 2021, 8:24pm

@dracosveen, thank you so much for your thoughtful reply. I have decided the take another route.

Instead of parsing the html/javascript page, I plan to use dispatch group to make two API calls.

Now I just have to learn how to use dispatch group!