Posted: 2017-10-17

Long distance flight pricing exploration WAW-ICN

Keywords: data mining, data analysis, python, web-crawling

Travelling is fun, the further the better. However long distance flights cost a lot of silver. Naturally, a question arises. Does checking skyscanner every few days for a great (cheap) deal make any sense, or maybe there are some patterns in the flight pricing.


Determine which factors impact pricing of flights (end-user) between Warsaw (WAW, Poland) and Seoul (ICN, South Korea). Assumptions:

  • Tickets over 2000 PLN (2017 cash) are considered expensive (round-trip).
  • Flights over 24 hours are considered unbearable.
  • We have an infinite predictive capacity - can plan vacations +200 days ahead.
  • We fly economy.


We start with plotting flight prices sorted by the airlines. Cause if the prices are fixed, there is no point in searching. The data set contains flights costing over 4000 PLN however we live in reality not la la land, so we can drop them.

Figure 1. Price [PLN] vs airline [name]

Fortunately Figure 1. suggests that price fixing isn't the case. Now let's check if the advance in flight search has reflection in pricing. Maybe there are no flights under 2000 PLN when looking 10 days ahead? Maybe checking booking sites every 10 minutes makes sense.

Figure 2. Price [PLN] vs advance in flight search [days range]

Figure 2. suggest that good deals may be found at any time, so there is reason to hit F5 every 10 minutes on skyscanner. Flight duration is also an important factor. Flying for over 24 hours is a no no. Maybe cheap flights must last long, hence our whole data exploration is pointless?

Figure 3. Flight duration [hours] vs price [PLN]

Again, the data in Figure 3. suggest that you can have a cheap flight under 24 hours. Okay, since we know that the cards aren't stacked against us, let's find the sweet spot for booking. We will try to pinpoint what kind of leading do we need in order to find a cheap flight with the flight duration under 20 hours.

Figure 4. Price [PLN] vs advance in flight search [days]
And again, the gods of data were kind to us Figure 4. suggest that when planning for this particular route a leading of approximately 220 days is optimal. Now, with this newly acquired knowledge we can plan our next holiday.

Code grimoire

Date conversion

Converting ISO string date format into python datetime object for timedelta computation.

import datetime
def date_converter(dateStr):
    year = int(dateStr[:4])
    month = int(dateStr[5:7])
    day = int(dateStr[8:10])
    return, month, day)

Booking time advance

In order to check if the advance in search flight departure date (looking forwards for a flight 1 month ahead vs 3 months ahead) affects the ticket price search advance time in days has to be computed.

searchAdvList = []
for row in data.itertuples():
    searchAdv = date_converter(row.dateFrom) - date_converter(row.dateSearch)
data['searchAdv'] = pd.Series(searchAdvList, index=data.index)

Generate categories by range

Making sense from scattered data is hard, that's why grouping data by a key is a good idea.

def gen_categories(minValue,maxValue,step):
    categories = [(0,minValue)]
    for i in range(minValue,maxValue,step):
    return categories

Using our generated range we can describe each data point accordingly.

advCatList = gen_categories(10,300,30)

cat = ""
advCat = []
categories = advCatList
for i in df["searchAdv"].values:
    for j in categories:
        if i > j[0] and i <= j[1]:
            cat = str(j[0]) + "-" + str(j[1])
df['advCat'] = pd.Series(advCat,index=df.index)

The whole jupyter notebook may be found at this github repo.


  • Cheap (under 2000 PLN in 2017 cash) may be always found, when booking a leading of 220 days seems to be the sweet spot on this particular route.
  • Most flights take somewhere between 10 and 20 hours. Flight by LOT are under 10 hours, due to their direct flight route.
  • KLM and Air France seem to have similar pricing strategies, which is to be expected since in 2004 they merged.

Findings of the exploratory analysis due to it's size and initial assumtions aren't comprehensive, and as such should be taken with a grain of salt. The data set along with jupyter notebook may be cloned from this github repo.