Long distance flight pricing exploration WAW-ICN
Travelling is fun, the further the better. However long distance flights cost a lot of silver. Naturally, a question arises. Does checking skyscanner every few days for a great (cheap) deal make any sense, or maybe there are some patterns in the flight pricing.
- Tickets over 2000 PLN (2017 cash) are considered expensive (round-trip).
- Flights over 24 hours are considered unbearable.
- We have an infinite predictive capacity - can plan vacations +200 days ahead.
- We fly economy.
We start with plotting flight prices sorted by the airlines. Cause if the prices are fixed, there is no point in searching. The data set contains flights costing over 4000 PLN however we live in reality not la la land, so we can drop them.
Fortunately Figure 1. suggests that price fixing isn't the case. Now let's check if the advance in flight search has reflection in pricing. Maybe there are no flights under 2000 PLN when looking 10 days ahead? Maybe checking booking sites every 10 minutes makes sense.
Figure 2. suggest that good deals may be found at any time, so there is reason to hit F5 every 10 minutes on skyscanner. Flight duration is also an important factor. Flying for over 24 hours is a no no. Maybe cheap flights must last long, hence our whole data exploration is pointless?
Again, the data in Figure 3. suggest that you can have a cheap flight under 24 hours. Okay, since we know that the cards aren't stacked against us, let's find the sweet spot for booking. We will try to pinpoint what kind of leading do we need in order to find a cheap flight with the flight duration under 20 hours.
import datetime def date_converter(dateStr): year = int(dateStr[:4]) month = int(dateStr[5:7]) day = int(dateStr[8:10]) return datetime.date(year, month, day)
Booking time advance
In order to check if the advance in search flight departure date (looking forwards for a flight 1 month ahead vs 3 months ahead) affects the ticket price search advance time in days has to be computed.
searchAdvList =  for row in data.itertuples(): searchAdv = date_converter(row.dateFrom) - date_converter(row.dateSearch) searchAdvList.append(searchAdv.days) data['searchAdv'] = pd.Series(searchAdvList, index=data.index)
Generate categories by range
Making sense from scattered data is hard, that's why grouping data by a key is a good idea.
def gen_categories(minValue,maxValue,step): categories = [(0,minValue)] for i in range(minValue,maxValue,step): categories.append((i,i+step)) return categories
Categories by lead in search
Using our generated range we can describe each data point accordingly.
advCatList = gen_categories(10,300,30) cat = "" advCat =  categories = advCatList for i in df["searchAdv"].values: for j in categories: if i > j and i <= j: cat = str(j) + "-" + str(j) break advCat.append(cat) df['advCat'] = pd.Series(advCat,index=df.index)
The whole jupyter notebook may be found at this github repo.
- Cheap (under 2000 PLN in 2017 cash) may be always found, when booking a leading of 220 days seems to be the sweet spot on this particular route.
- Most flights take somewhere between 10 and 20 hours. Flight by LOT are under 10 hours, due to their direct flight route.
- KLM and Air France seem to have similar pricing strategies, which is to be expected since in 2004 they merged.
Findings of the exploratory analysis due to it's size and initial assumtions aren't comprehensive, and as such should be taken with a grain of salt. The data set along with jupyter notebook may be cloned from this github repo.