Eland 演示笔记本#
在这个笔记本中,我们将向您展示如何使用 Eland 的熟悉 Pandas 兼容 API 在 Elasticsearch 中探索和分析数据。
要求 🧰#
对于此示例,您将需要
Python 3.8 或更高版本
Elastic 部署
在本示例中,我们将使用 Elastic Cloud(可通过 免费试用 获得)
创建 Elastic Cloud 部署#
如果您没有 Elastic Cloud 部署,请按照以下步骤创建一个。
访问 https://cloud.elastic.co/registration 并注册免费试用
选择 **创建部署** 并按照说明进行操作
安装包 📦#
首先,我们 pip install
并导入在本示例中所需的包。
[8]:
!pip install pandas numpy matplotlib requests eland elasticsearch
# Standard library imports
import os
import tempfile
import csv
import gzip
import json
from getpass import getpass
# Elasticsearch client
from elasticsearch import Elasticsearch
from elasticsearch import helpers
from elasticsearch.helpers import bulk
# Eland imports for Elasticsearch and pandas integration
import eland as ed
from eland.conftest import *
# Third-party imports for data handling
import pandas as pd
import numpy as np
# Visualization library
import matplotlib.pyplot as plt
# HTTP requests library
import requests
from google.colab import data_table
data_table.enable_dataframe_formatter()
Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (1.5.3)
Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (1.23.5)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.10/dist-packages (3.7.1)
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (2.31.0)
Requirement already satisfied: eland in /usr/local/lib/python3.10/dist-packages (8.11.0)
Requirement already satisfied: elasticsearch in /usr/local/lib/python3.10/dist-packages (8.10.1)
Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas) (2023.3.post1)
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (1.2.0)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (4.44.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (1.4.5)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (23.2)
Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (9.4.0)
Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (3.1.1)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests) (2.0.7)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests) (2023.7.22)
Requirement already satisfied: elastic-transport<9,>=8 in /usr/local/lib/python3.10/dist-packages (from elasticsearch) (8.10.0)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.1->pandas) (1.16.0)
初始化客户端 🔌#
接下来,我们使用 getpass
输入凭据。getpass
是 Python 标准库的一部分,用于安全地提示输入凭据。
[9]:
if os.environ.get("DOCS_BUILD"):
# Building the docs relies on a local cluster
# This is for CI purposes, unrelated to notebook
es_kwargs = {"hosts": "http://localhost:9200"}
else:
cloud_id = getpass("Elastic Cloud ID: ")
username = getpass("Elastic username: ")
password = getpass("Elastic password: ")
es_kwargs = {
"cloud_id": cloud_id,
"basic_auth": (username, password)
}
es = Elasticsearch(**es_kwargs)
es_info = es.info()
print(f"Connected to cluster named '{es_info['cluster_name']}' (version: {es_info['version']['number']})")
Elastic Cloud ID: ··········
Elastic username: ··········
Elastic password: ··········
Connected to cluster named 'db96688aa6f340828c41d9bc3f61684e' (version: 8.11.0)
加载和处理文档 📄#
是时候加载一些数据了!我们将使用 Eland 测试数据,这是一个假的航班数据集。首先,让我们使用正确的映射创建一个 Elasticsearch 索引。
使用映射创建索引#
[10]:
FLIGHTS_INDEX_NAME = "flights"
FLIGHTS_MAPPING = {
"mappings": {
"properties": {
"AvgTicketPrice": {"type": "float"},
"Cancelled": {"type": "boolean"},
"Carrier": {"type": "keyword"},
"Dest": {"type": "keyword"},
"DestAirportID": {"type": "keyword"},
"DestCityName": {"type": "keyword"},
"DestCountry": {"type": "keyword"},
"DestLocation": {"type": "geo_point"},
"DestRegion": {"type": "keyword"},
"DestWeather": {"type": "keyword"},
"DistanceKilometers": {"type": "float"},
"DistanceMiles": {"type": "float"},
"FlightDelay": {"type": "boolean"},
"FlightDelayMin": {"type": "integer"},
"FlightDelayType": {"type": "keyword"},
"FlightNum": {"type": "keyword"},
"FlightTimeHour": {"type": "float"},
"FlightTimeMin": {"type": "float"},
"Origin": {"type": "keyword"},
"OriginAirportID": {"type": "keyword"},
"OriginCityName": {"type": "keyword"},
"OriginCountry": {"type": "keyword"},
"OriginLocation": {"type": "geo_point"},
"OriginRegion": {"type": "keyword"},
"OriginWeather": {"type": "keyword"},
"dayOfWeek": {"type": "byte"},
"timestamp": {"type": "date", "format": "strict_date_hour_minute_second"},
}
}
}
# Check if the index already exists
if not es.indices.exists(index=FLIGHTS_INDEX_NAME):
# Create the index with the specified mappings
es.indices.create(index=FLIGHTS_INDEX_NAME, body=FLIGHTS_MAPPING)
print(f"Index '{FLIGHTS_INDEX_NAME}' created.")
else:
print(f"Index '{FLIGHTS_INDEX_NAME}' already exists.")
Index 'flights' created.
下载示例数据并将其索引到 Elasticsearch#
我们将管理索引错误以确保单个错误不会停止整个批处理。错误将被记录,并且过程将继续。在本例中,我们文档的一部分具有不正确的日期时间格式。如果这是您的数据集,您可以更新源数据,或者更改映射。
[11]:
if not os.environ.get("DOCS_BUILD"):
with tempfile.TemporaryDirectory() as tmpdir:
r = requests.get("https://github.com/elastic/eland/raw/main/tests/flights.json.gz")
path = os.path.join(tmpdir, "flights.json.gz")
with open(path, "wb") as f:
f.write(r.content)
with gzip.open(path, 'rt', encoding='utf-8') as gfile:
# Read the JSON objects into a list
# Each line in the file is a separate JSON object
data = [json.loads(line) for line in gfile]
# Create dataframe
pd_flights = pd.DataFrame(data)
# Function to yield formatted actions to index into Elasticsearch with incremental integer IDs
def generate_actions(data):
for idx, entry in enumerate(data, start=0): # Use monotonically increasing IDs starting from 0
yield {
"_index": FLIGHTS_INDEX_NAME,
"_id": idx, # Use the current index as the document ID
"_source": entry
}
# Perform bulk indexing
try:
# Set raise_on_error to False so the bulk operation continues after errors
success, errors = helpers.bulk(es, generate_actions(data), raise_on_error=False, stats_only=True)
print(f"Indexing completed. Successfully indexed {success} documents.")
if errors:
print(f"Some errors occurred. {errors} documents were not indexed.")
else:
print("All documents were indexed successfully without errors.")
except Exception as e:
print(f"An unexpected error occurred while indexing: {e}. Please check the Elasticsearch logs for more details.")
Indexing completed. Successfully indexed 13014 documents.
Some errors occurred. 45 documents were not indexed.
比较 Eland DataFrame 与 pandas DataFrame#
从 flights
索引创建 eland.DataFrame
[12]:
ed_flights = ed.DataFrame(es, 'flights')
[13]:
type(ed_flights)
[13]:
eland.dataframe.DataFrame
与 pandas DataFrame(从相同数据创建)进行比较
[14]:
pd_flights = ed.eland_to_pandas(ed_flights)
[15]:
type(pd_flights)
[15]:
pandas.core.frame.DataFrame
属性和基础数据#
DataFrame.columns#
[16]:
pd_flights.columns
[16]:
Index(['AvgTicketPrice', 'Cancelled', 'Carrier', 'Dest', 'DestAirportID', 'DestCityName',
'DestCountry', 'DestLocation', 'DestRegion', 'DestWeather', 'DistanceKilometers',
'DistanceMiles', 'FlightDelay', 'FlightDelayMin', 'FlightDelayType', 'FlightNum',
'FlightTimeHour', 'FlightTimeMin', 'Origin', 'OriginAirportID', 'OriginCityName',
'OriginCountry', 'OriginLocation', 'OriginRegion', 'OriginWeather', 'dayOfWeek',
'timestamp'],
dtype='object')
[17]:
ed_flights.columns
[17]:
Index(['AvgTicketPrice', 'Cancelled', 'Carrier', 'Dest', 'DestAirportID', 'DestCityName',
'DestCountry', 'DestLocation', 'DestRegion', 'DestWeather', 'DistanceKilometers',
'DistanceMiles', 'FlightDelay', 'FlightDelayMin', 'FlightDelayType', 'FlightNum',
'FlightTimeHour', 'FlightTimeMin', 'Origin', 'OriginAirportID', 'OriginCityName',
'OriginCountry', 'OriginLocation', 'OriginRegion', 'OriginWeather', 'dayOfWeek',
'timestamp'],
dtype='object')
DataFrame.dtypes#
[18]:
pd_flights.dtypes
[18]:
AvgTicketPrice float64
Cancelled bool
Carrier object
Dest object
DestAirportID object
...
OriginLocation object
OriginRegion object
OriginWeather object
dayOfWeek int64
timestamp datetime64[ns]
Length: 27, dtype: object
[19]:
ed_flights.dtypes
[19]:
AvgTicketPrice float64
Cancelled bool
Carrier object
Dest object
DestAirportID object
...
OriginLocation object
OriginRegion object
OriginWeather object
dayOfWeek int64
timestamp datetime64[ns]
Length: 27, dtype: object
DataFrame.select_dtypes#
[20]:
pd_flights.select_dtypes(include=np.number)
[20]:
AvgTicketPrice | DistanceKilometers | ... | FlightTimeMin | dayOfWeek | |
---|---|---|---|---|---|
0 | 841.265642 | 16492.326654 | ... | 1030.770416 | 0 |
1 | 882.982662 | 8823.400140 | ... | 464.389481 | 0 |
2 | 190.636904 | 0.000000 | ... | 0.000000 | 0 |
3 | 181.694216 | 555.737767 | ... | 222.749059 | 0 |
4 | 730.041778 | 13358.244200 | ... | 785.779071 | 0 |
... | ... | ... | ... | ... | ... |
12995 | 210.208911 | 0.000000 | ... | 0.000000 | 6 |
12996 | 830.500355 | 8551.966642 | ... | 1122.451513 | 6 |
12997 | 867.231029 | 9287.129863 | ... | 515.951659 | 6 |
12998 | 594.054018 | 3566.185736 | ... | 589.321980 | 6 |
12999 | 781.065911 | 11142.680199 | ... | 619.037789 | 6 |
12955 行 × 7 列
[21]:
ed_flights.select_dtypes(include=np.number)
[21]:
AvgTicketPrice | DistanceKilometers | ... | FlightTimeMin | dayOfWeek | |
---|---|---|---|---|---|
0 | 841.265642 | 16492.326654 | ... | 1030.770416 | 0 |
1 | 882.982662 | 8823.400140 | ... | 464.389481 | 0 |
2 | 190.636904 | 0.000000 | ... | 0.000000 | 0 |
3 | 181.694216 | 555.737767 | ... | 222.749059 | 0 |
4 | 730.041778 | 13358.244200 | ... | 785.779071 | 0 |
... | ... | ... | ... | ... | ... |
13054 | 1080.446279 | 8058.581753 | ... | 402.929088 | 6 |
13055 | 646.612941 | 7088.598322 | ... | 644.418029 | 6 |
13056 | 997.751876 | 10920.652972 | ... | 937.540811 | 6 |
13057 | 1102.814465 | 18748.859647 | ... | 1697.404971 | 6 |
13058 | 858.144337 | 16809.141923 | ... | 1610.761827 | 6 |
13014 行 × 7 列
DataFrame.empty#
[22]:
pd_flights.empty
[22]:
False
[23]:
ed_flights.empty
[23]:
False
DataFrame.shape#
[24]:
pd_flights.shape
[24]:
(12955, 27)
[25]:
ed_flights.shape
[25]:
(13014, 27)
DataFrame.index#
注意,eland.DataFrame.index
并不反映 pandas.DataFrame.index
。
[26]:
pd_flights.index
[26]:
Index(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9',
...
'12990', '12991', '12992', '12993', '12994', '12995', '12996', '12997', '12998', '12999'],
dtype='object', length=12955)
[27]:
# NBVAL_IGNORE_OUTPUT
ed_flights.index
[27]:
<eland.index.Index at 0x79e8c1ef9d50>
[28]:
ed_flights.index.es_index_field
[28]:
'_id'
DataFrame.values#
注意,eland.DataFrame.values
不受支持。
[29]:
pd_flights.values
[29]:
array([[841.2656419677076, False, 'Kibana Airlines', ..., 'Sunny', 0,
Timestamp('2018-01-01 00:00:00')],
[882.9826615595518, False, 'Logstash Airways', ..., 'Clear', 0,
Timestamp('2018-01-01 18:27:00')],
[190.6369038508356, False, 'Logstash Airways', ..., 'Rain', 0,
Timestamp('2018-01-01 17:11:14')],
...,
[867.231028821183, False, 'ES-Air', ..., 'Clear', 6,
Timestamp('2018-02-11 20:51:31')],
[594.0540183976791, False, 'Kibana Airlines', ..., 'Clear', 6,
Timestamp('2018-02-11 12:27:52')],
[781.0659105148637, False, 'Kibana Airlines', ..., 'Clear', 6,
Timestamp('2018-02-11 03:54:56')]], dtype=object)
[30]:
try:
ed_flights.values
except AttributeError as e:
print(e)
This method would scan/scroll the entire Elasticsearch index(s) into memory. If this is explicitly required, and there is sufficient memory, call `ed.eland_to_pandas(ed_df).values`
索引、迭代#
DataFrame.head#
[31]:
pd_flights.head()
Warning: Total number of columns (27) exceeds max_columns (20). Falling back to pandas display.
[31]:
AvgTicketPrice | Cancelled | ... | dayOfWeek | timestamp | |
---|---|---|---|---|---|
0 | 841.265642 | False | ... | 0 | 2018-01-01 00:00:00 |
1 | 882.982662 | False | ... | 0 | 2018-01-01 18:27:00 |
2 | 190.636904 | False | ... | 0 | 2018-01-01 17:11:14 |
3 | 181.694216 | True | ... | 0 | 2018-01-01 10:33:28 |
4 | 730.041778 | False | ... | 0 | 2018-01-01 05:13:00 |
5 行 × 27 列
[32]:
ed_flights.head()
[32]:
AvgTicketPrice | Cancelled | ... | dayOfWeek | timestamp | |
---|---|---|---|---|---|
0 | 841.265642 | False | ... | 0 | 2018-01-01 00:00:00 |
1 | 882.982662 | False | ... | 0 | 2018-01-01 18:27:00 |
2 | 190.636904 | False | ... | 0 | 2018-01-01 17:11:14 |
3 | 181.694216 | True | ... | 0 | 2018-01-01 10:33:28 |
4 | 730.041778 | False | ... | 0 | 2018-01-01 05:13:00 |
5 行 × 27 列
DataFrame.tail#
[33]:
pd_flights.tail()
Warning: Total number of columns (27) exceeds max_columns (20). Falling back to pandas display.
[33]:
AvgTicketPrice | Cancelled | ... | dayOfWeek | timestamp | |
---|---|---|---|---|---|
12995 | 210.208911 | False | ... | 6 | 2018-02-11 02:26:04 |
12996 | 830.500355 | False | ... | 6 | 2018-02-11 03:52:31 |
12997 | 867.231029 | False | ... | 6 | 2018-02-11 20:51:31 |
12998 | 594.054018 | False | ... | 6 | 2018-02-11 12:27:52 |
12999 | 781.065911 | False | ... | 6 | 2018-02-11 03:54:56 |
5 行 × 27 列
[34]:
ed_flights.tail()
[34]:
AvgTicketPrice | Cancelled | ... | dayOfWeek | timestamp | |
---|---|---|---|---|---|
13054 | 1080.446279 | False | ... | 6 | 2018-02-11 20:42:25 |
13055 | 646.612941 | False | ... | 6 | 2018-02-11 01:41:57 |
13056 | 997.751876 | False | ... | 6 | 2018-02-11 04:09:27 |
13057 | 1102.814465 | False | ... | 6 | 2018-02-11 08:28:21 |
13058 | 858.144337 | False | ... | 6 | 2018-02-11 14:54:34 |
5 行 × 27 列
DataFrame.keys#
[35]:
pd_flights.keys()
[35]:
Index(['AvgTicketPrice', 'Cancelled', 'Carrier', 'Dest', 'DestAirportID', 'DestCityName',
'DestCountry', 'DestLocation', 'DestRegion', 'DestWeather', 'DistanceKilometers',
'DistanceMiles', 'FlightDelay', 'FlightDelayMin', 'FlightDelayType', 'FlightNum',
'FlightTimeHour', 'FlightTimeMin', 'Origin', 'OriginAirportID', 'OriginCityName',
'OriginCountry', 'OriginLocation', 'OriginRegion', 'OriginWeather', 'dayOfWeek',
'timestamp'],
dtype='object')
[36]:
ed_flights.keys()
[36]:
Index(['AvgTicketPrice', 'Cancelled', 'Carrier', 'Dest', 'DestAirportID', 'DestCityName',
'DestCountry', 'DestLocation', 'DestRegion', 'DestWeather', 'DistanceKilometers',
'DistanceMiles', 'FlightDelay', 'FlightDelayMin', 'FlightDelayType', 'FlightNum',
'FlightTimeHour', 'FlightTimeMin', 'Origin', 'OriginAirportID', 'OriginCityName',
'OriginCountry', 'OriginLocation', 'OriginRegion', 'OriginWeather', 'dayOfWeek',
'timestamp'],
dtype='object')
DataFrame.get#
[37]:
pd_flights.get('Carrier')
[37]:
0 Kibana Airlines
1 Logstash Airways
2 Logstash Airways
3 Kibana Airlines
4 Kibana Airlines
...
12995 JetBeats
12996 Logstash Airways
12997 ES-Air
12998 Kibana Airlines
12999 Kibana Airlines
Name: Carrier, Length: 12955, dtype: object
[38]:
ed_flights.get('Carrier')
[38]:
0 Kibana Airlines
1 Logstash Airways
2 Logstash Airways
3 Kibana Airlines
4 Kibana Airlines
...
13054 Logstash Airways
13055 Logstash Airways
13056 Logstash Airways
13057 JetBeats
13058 JetBeats
Name: Carrier, Length: 13014, dtype: object
[39]:
pd_flights.get(['Carrier', 'Origin'])
[39]:
Carrier | Origin | |
---|---|---|
0 | Kibana Airlines | Frankfurt am Main Airport |
1 | Logstash Airways | Cape Town International Airport |
2 | Logstash Airways | Venice Marco Polo Airport |
3 | Kibana Airlines | Naples International Airport |
4 | Kibana Airlines | Licenciado Benito Juarez International Airport |
... | ... | ... |
12995 | JetBeats | Ottawa Macdonald-Cartier International Airport |
12996 | Logstash Airways | Denver International Airport |
12997 | ES-Air | Charles de Gaulle International Airport |
12998 | Kibana Airlines | Rajiv Gandhi International Airport |
12999 | Kibana Airlines | Abu Dhabi International Airport |
12955 行 × 2 列
列表输入当前不受 eland.DataFrame.get
支持
[40]:
try:
ed_flights.get(['Carrier', 'Origin'])
except TypeError as e:
print(e)
unhashable type: 'list'
DataFrame.query#
[41]:
pd_flights.query('Carrier == "Kibana Airlines" & AvgTicketPrice > 900.0 & Cancelled == True')
Warning: Total number of columns (27) exceeds max_columns (20). Falling back to pandas display.
[41]:
AvgTicketPrice | Cancelled | ... | dayOfWeek | timestamp | |
---|---|---|---|---|---|
8 | 960.869736 | True | ... | 0 | 2018-01-01 12:09:35 |
26 | 975.812632 | True | ... | 0 | 2018-01-01 15:38:32 |
311 | 946.358410 | True | ... | 0 | 2018-01-01 11:51:12 |
651 | 975.383864 | True | ... | 2 | 2018-01-03 21:13:17 |
950 | 907.836523 | True | ... | 2 | 2018-01-03 05:14:51 |
... | ... | ... | ... | ... | ... |
12798 | 1083.517299 | True | ... | 5 | 2018-02-10 19:20:10 |
12820 | 909.973606 | True | ... | 5 | 2018-02-10 05:11:35 |
12906 | 983.429244 | True | ... | 6 | 2018-02-11 06:19:58 |
12918 | 1136.678150 | True | ... | 6 | 2018-02-11 16:03:10 |
12919 | 1105.211803 | True | ... | 6 | 2018-02-11 05:36:05 |
67 行 × 27 列
eland.DataFrame.query
需要布尔值限定符,例如
ed_flights.query('Carrier == "Kibana Airlines" & AvgTicketPrice > 900.0 & Cancelled')
失败
[42]:
try:
ed_flights.query('Carrier == "Kibana Airlines" & AvgTicketPrice > 900.0 & Cancelled == True')
except Exception as e:
print(f"query() failed as expected: {repr(e)}.")
query() failed as expected: TypeError("Cannot interpret '{'terms': {'Carrier': ['Kibana Airlines']}}' as a data type").
布尔索引查询#
[43]:
pd_flights[(pd_flights.Carrier=="Kibana Airlines") &
(pd_flights.AvgTicketPrice > 900.0) &
(pd_flights.Cancelled == True)]
Warning: Total number of columns (27) exceeds max_columns (20). Falling back to pandas display.
[43]:
AvgTicketPrice | Cancelled | ... | dayOfWeek | timestamp | |
---|---|---|---|---|---|
8 | 960.869736 | True | ... | 0 | 2018-01-01 12:09:35 |
26 | 975.812632 | True | ... | 0 | 2018-01-01 15:38:32 |
311 | 946.358410 | True | ... | 0 | 2018-01-01 11:51:12 |
651 | 975.383864 | True | ... | 2 | 2018-01-03 21:13:17 |
950 | 907.836523 | True | ... | 2 | 2018-01-03 05:14:51 |
... | ... | ... | ... | ... | ... |
12798 | 1083.517299 | True | ... | 5 | 2018-02-10 19:20:10 |
12820 | 909.973606 | True | ... | 5 | 2018-02-10 05:11:35 |
12906 | 983.429244 | True | ... | 6 | 2018-02-11 06:19:58 |
12918 | 1136.678150 | True | ... | 6 | 2018-02-11 16:03:10 |
12919 | 1105.211803 | True | ... | 6 | 2018-02-11 05:36:05 |
67 行 × 27 列
[44]:
ed_flights[(ed_flights.Carrier=="Kibana Airlines") &
(ed_flights.AvgTicketPrice > 900.0) &
(ed_flights.Cancelled == True)]
[44]:
AvgTicketPrice | Cancelled | ... | dayOfWeek | timestamp | |
---|---|---|---|---|---|
8 | 960.869736 | True | ... | 0 | 2018-01-01 12:09:35 |
26 | 975.812632 | True | ... | 0 | 2018-01-01 15:38:32 |
311 | 946.358410 | True | ... | 0 | 2018-01-01 11:51:12 |
651 | 975.383864 | True | ... | 2 | 2018-01-03 21:13:17 |
950 | 907.836523 | True | ... | 2 | 2018-01-03 05:14:51 |
... | ... | ... | ... | ... | ... |
12820 | 909.973606 | True | ... | 5 | 2018-02-10 05:11:35 |
12906 | 983.429244 | True | ... | 6 | 2018-02-11 06:19:58 |
12918 | 1136.678150 | True | ... | 6 | 2018-02-11 16:03:10 |
12919 | 1105.211803 | True | ... | 6 | 2018-02-11 05:36:05 |
13013 | 1055.350213 | True | ... | 6 | 2018-02-11 13:20:16 |
68 行 × 27 列
函数应用、GroupBy 和窗口#
DataFrame.aggs#
[45]:
pd_flights[['DistanceKilometers', 'AvgTicketPrice']].aggregate(['sum', 'min', 'std'])
[45]:
DistanceKilometers | AvgTicketPrice | |
---|---|---|
sum | 9.185214e+07 | 8.130904e+06 |
min | 0.000000e+00 | 1.000205e+02 |
std | 4.575501e+03 | 2.663747e+02 |
eland.DataFrame.aggregate
当前仅支持数字列
[46]:
ed_flights[['DistanceKilometers', 'AvgTicketPrice']].aggregate(['sum', 'min', 'std'])
[46]:
DistanceKilometers | AvgTicketPrice | |
---|---|---|
sum | 9.229730e+07 | 8.177881e+06 |
min | 0.000000e+00 | 1.000205e+02 |
std | 4.578132e+03 | 2.664956e+02 |
计算/描述性统计#
DataFrame.count#
[47]:
pd_flights.count()
[47]:
AvgTicketPrice 12955
Cancelled 12955
Carrier 12955
Dest 12955
DestAirportID 12955
...
OriginLocation 12955
OriginRegion 12955
OriginWeather 12955
dayOfWeek 12955
timestamp 12955
Length: 27, dtype: int64
[48]:
ed_flights.count()
[48]:
AvgTicketPrice 13014
Cancelled 13014
Carrier 13014
Dest 13014
DestAirportID 13014
...
OriginLocation 13014
OriginRegion 13014
OriginWeather 13014
dayOfWeek 13014
timestamp 13014
Length: 27, dtype: int64
DataFrame.describe#
[49]:
pd_flights.describe()
[49]:
AvgTicketPrice | DistanceKilometers | ... | FlightTimeMin | dayOfWeek | |
---|---|---|---|---|---|
count | 12955.000000 | 12955.000000 | ... | 12955.000000 | 12955.000000 |
mean | 627.626743 | 7090.092034 | ... | 510.931002 | 2.820455 |
std | 266.374679 | 4575.500916 | ... | 334.378895 | 1.932079 |
min | 100.020528 | 0.000000 | ... | 0.000000 | 0.000000 |
25% | 409.522973 | 2458.696332 | ... | 252.658936 | 1.000000 |
50% | 640.169085 | 7610.330866 | ... | 503.142831 | 3.000000 |
75% | 841.437465 | 9733.565957 | ... | 720.183608 | 4.000000 |
max | 1199.729053 | 19881.482315 | ... | 1902.902032 | 6.000000 |
8 行 × 7 列
从 eland.DataFrame.describe
返回的值可能会有所不同,因为 Elasticsearch 聚合的结果不同。
[50]:
# NBVAL_IGNORE_OUTPUT
ed_flights.describe()
[50]:
AvgTicketPrice | Cancelled | ... | FlightTimeMin | dayOfWeek | |
---|---|---|---|---|---|
count | 13014.000000 | 13014.000000 | ... | 13014.000000 | 13014.000000 |
mean | 628.391013 | 0.128323 | ... | 511.106340 | 2.834870 |
std | 266.495605 | 0.334475 | ... | 334.674968 | 1.939568 |
min | 100.020531 | 0.000000 | ... | 0.000000 | 0.000000 |
25% | 411.187230 | 0.000000 | ... | 250.141903 | 1.061057 |
50% | 639.776657 | 0.000000 | ... | 502.902212 | 2.932961 |
75% | 842.653771 | 0.000000 | ... | 721.186244 | 4.445456 |
max | 1199.729004 | 1.000000 | ... | 1902.901978 | 6.000000 |
8 行 × 9 列
DataFrame.info#
[51]:
pd_flights.info()
<class 'pandas.core.frame.DataFrame'>
Index: 12955 entries, 0 to 12999
Data columns (total 27 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 AvgTicketPrice 12955 non-null float64
1 Cancelled 12955 non-null bool
2 Carrier 12955 non-null object
3 Dest 12955 non-null object
4 DestAirportID 12955 non-null object
5 DestCityName 12955 non-null object
6 DestCountry 12955 non-null object
7 DestLocation 12955 non-null object
8 DestRegion 12955 non-null object
9 DestWeather 12955 non-null object
10 DistanceKilometers 12955 non-null float64
11 DistanceMiles 12955 non-null float64
12 FlightDelay 12955 non-null bool
13 FlightDelayMin 12955 non-null int64
14 FlightDelayType 12955 non-null object
15 FlightNum 12955 non-null object
16 FlightTimeHour 12955 non-null float64
17 FlightTimeMin 12955 non-null float64
18 Origin 12955 non-null object
19 OriginAirportID 12955 non-null object
20 OriginCityName 12955 non-null object
21 OriginCountry 12955 non-null object
22 OriginLocation 12955 non-null object
23 OriginRegion 12955 non-null object
24 OriginWeather 12955 non-null object
25 dayOfWeek 12955 non-null int64
26 timestamp 12955 non-null datetime64[ns]
dtypes: bool(2), datetime64[ns](1), float64(5), int64(2), object(17)
memory usage: 3.1+ MB
[52]:
ed_flights.info()
<class 'eland.dataframe.DataFrame'>
Index: 13014 entries, 0 to 13058
Data columns (total 27 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 AvgTicketPrice 13014 non-null float64
1 Cancelled 13014 non-null bool
2 Carrier 13014 non-null object
3 Dest 13014 non-null object
4 DestAirportID 13014 non-null object
5 DestCityName 13014 non-null object
6 DestCountry 13014 non-null object
7 DestLocation 13014 non-null object
8 DestRegion 13014 non-null object
9 DestWeather 13014 non-null object
10 DistanceKilometers 13014 non-null float64
11 DistanceMiles 13014 non-null float64
12 FlightDelay 13014 non-null bool
13 FlightDelayMin 13014 non-null int64
14 FlightDelayType 13014 non-null object
15 FlightNum 13014 non-null object
16 FlightTimeHour 13014 non-null float64
17 FlightTimeMin 13014 non-null float64
18 Origin 13014 non-null object
19 OriginAirportID 13014 non-null object
20 OriginCityName 13014 non-null object
21 OriginCountry 13014 non-null object
22 OriginLocation 13014 non-null object
23 OriginRegion 13014 non-null object
24 OriginWeather 13014 non-null object
25 dayOfWeek 13014 non-null int64
26 timestamp 13014 non-null datetime64[ns]
dtypes: bool(2), datetime64[ns](1), float64(5), int64(2), object(17)
memory usage: 64.000 bytes
Elasticsearch storage usage: 11.097 MB
DataFrame.max, DataFrame.min, DataFrame.mean, DataFrame.sum#
max#
[53]:
pd_flights.max(numeric_only=True)
[53]:
AvgTicketPrice 1199.729053
Cancelled True
DistanceKilometers 19881.482315
DistanceMiles 12353.780369
FlightDelay True
FlightDelayMin 360
FlightTimeHour 31.715034
FlightTimeMin 1902.902032
dayOfWeek 6
dtype: object
eland.DataFrame.max,min,mean,sum
仅聚合数字列
[54]:
ed_flights.max(numeric_only=True)
[54]:
AvgTicketPrice 1199.729004
Cancelled 1.000000
DistanceKilometers 19881.482422
DistanceMiles 12353.780273
FlightDelay 1.000000
FlightDelayMin 360.000000
FlightTimeHour 31.715034
FlightTimeMin 1902.901978
dayOfWeek 6.000000
dtype: float64
min#
[55]:
pd_flights.min(numeric_only=True)
[55]:
AvgTicketPrice 100.020528
Cancelled False
DistanceKilometers 0.0
DistanceMiles 0.0
FlightDelay False
FlightDelayMin 0
FlightTimeHour 0.0
FlightTimeMin 0.0
dayOfWeek 0
dtype: object
[56]:
ed_flights.min(numeric_only=True)
[56]:
AvgTicketPrice 100.020531
Cancelled 0.000000
DistanceKilometers 0.000000
DistanceMiles 0.000000
FlightDelay 0.000000
FlightDelayMin 0.000000
FlightTimeHour 0.000000
FlightTimeMin 0.000000
dayOfWeek 0.000000
dtype: float64
mean#
[57]:
pd_flights.mean(numeric_only=True)
[57]:
AvgTicketPrice 627.626743
Cancelled 0.128059
DistanceKilometers 7090.092034
DistanceMiles 4405.578940
FlightDelay 0.251023
FlightDelayMin 47.247395
FlightTimeHour 8.515517
FlightTimeMin 510.931002
dayOfWeek 2.820455
dtype: float64
[58]:
ed_flights.mean(numeric_only=True)
[58]:
AvgTicketPrice 628.391013
Cancelled 0.128323
DistanceKilometers 7092.154679
DistanceMiles 4406.860604
FlightDelay 0.251268
FlightDelayMin 47.349009
FlightTimeHour 8.518439
FlightTimeMin 511.106340
dayOfWeek 2.834870
dtype: float64
sum#
[59]:
pd_flights.sum(numeric_only=True)
[59]:
AvgTicketPrice 8.130904e+06
Cancelled 1.659000e+03
DistanceKilometers 9.185214e+07
DistanceMiles 5.707428e+07
FlightDelay 3.252000e+03
FlightDelayMin 6.120900e+05
FlightTimeHour 1.103185e+05
FlightTimeMin 6.619111e+06
dayOfWeek 3.653900e+04
dtype: float64
[60]:
ed_flights.sum(numeric_only=True)
[60]:
AvgTicketPrice 8.177881e+06
Cancelled 1.670000e+03
DistanceKilometers 9.229730e+07
DistanceMiles 5.735088e+07
FlightDelay 3.270000e+03
FlightDelayMin 6.162000e+05
FlightTimeHour 1.108590e+05
FlightTimeMin 6.651538e+06
dayOfWeek 3.689300e+04
dtype: float64
DataFrame.nunique#
[61]:
pd_flights[['Carrier', 'Origin', 'Dest']].nunique()
[61]:
Carrier 4
Origin 156
Dest 156
dtype: int64
[62]:
ed_flights[['Carrier', 'Origin', 'Dest']].nunique()
[62]:
Carrier 4
Origin 156
Dest 156
dtype: int64
DataFrame.drop#
[63]:
pd_flights.drop(columns=['AvgTicketPrice',
'Cancelled',
'DestLocation',
'Dest',
'DestAirportID',
'DestCityName',
'DestCountry'])
[63]:
Carrier | DestRegion | ... | dayOfWeek | timestamp | |
---|---|---|---|---|---|
0 | Kibana Airlines | SE-BD | ... | 0 | 2018-01-01 00:00:00 |
1 | Logstash Airways | IT-34 | ... | 0 | 2018-01-01 18:27:00 |
2 | Logstash Airways | IT-34 | ... | 0 | 2018-01-01 17:11:14 |
3 | Kibana Airlines | IT-34 | ... | 0 | 2018-01-01 10:33:28 |
4 | Kibana Airlines | SE-BD | ... | 0 | 2018-01-01 05:13:00 |
... | ... | ... | ... | ... | ... |
12995 | JetBeats | CA-ON | ... | 6 | 2018-02-11 02:26:04 |
12996 | Logstash Airways | IT-42 | ... | 6 | 2018-02-11 03:52:31 |
12997 | ES-Air | SE-BD | ... | 6 | 2018-02-11 20:51:31 |
12998 | Kibana Airlines | SE-BD | ... | 6 | 2018-02-11 12:27:52 |
12999 | Kibana Airlines | CA-ON | ... | 6 | 2018-02-11 03:54:56 |
12955 行 × 20 列
[64]:
ed_flights.drop(columns=['AvgTicketPrice',
'Cancelled',
'DestLocation',
'Dest',
'DestAirportID',
'DestCityName',
'DestCountry'])
[64]:
Carrier | DestRegion | ... | dayOfWeek | timestamp | |
---|---|---|---|---|---|
0 | Kibana Airlines | SE-BD | ... | 0 | 2018-01-01 00:00:00 |
1 | Logstash Airways | IT-34 | ... | 0 | 2018-01-01 18:27:00 |
2 | Logstash Airways | IT-34 | ... | 0 | 2018-01-01 17:11:14 |
3 | Kibana Airlines | IT-34 | ... | 0 | 2018-01-01 10:33:28 |
4 | Kibana Airlines | SE-BD | ... | 0 | 2018-01-01 05:13:00 |
... | ... | ... | ... | ... | ... |
13054 | Logstash Airways | SE-BD | ... | 6 | 2018-02-11 20:42:25 |
13055 | Logstash Airways | CH-ZH | ... | 6 | 2018-02-11 01:41:57 |
13056 | Logstash Airways | RU-AMU | ... | 6 | 2018-02-11 04:09:27 |
13057 | JetBeats | SE-BD | ... | 6 | 2018-02-11 08:28:21 |
13058 | JetBeats | US-DC | ... | 6 | 2018-02-11 14:54:34 |
13014 行 × 20 列
绘图#
[65]:
pd_flights.select_dtypes(include=np.number).hist(figsize=[10,10])
plt.show()
[66]:
ed_flights.select_dtypes(include=np.number).hist(figsize=[10,10])
plt.show()
Elasticsearch 实用程序#
[67]:
ed_flights2 = ed_flights[(ed_flights.OriginAirportID == 'AMS') & (ed_flights.FlightDelayMin > 60)]
ed_flights2 = ed_flights2[['timestamp', 'OriginAirportID', 'DestAirportID', 'FlightDelayMin']]
ed_flights2 = ed_flights2.tail()
[68]:
print(ed_flights2.es_info())
es_index_pattern: flights
Index:
es_index_field: _id
is_source_field: False
Mappings:
capabilities:
es_field_name is_source es_dtype es_date_format pd_dtype is_searchable is_aggregatable is_scripted aggregatable_es_field_name
timestamp timestamp True date strict_date_hour_minute_second datetime64[ns] True True False timestamp
OriginAirportID OriginAirportID True keyword None object True True False OriginAirportID
DestAirportID DestAirportID True keyword None object True True False DestAirportID
FlightDelayMin FlightDelayMin True integer None int64 True True False FlightDelayMin
Operations:
tasks: [('boolean_filter': ('boolean_filter': {'bool': {'must': [{'term': {'OriginAirportID': 'AMS'}}, {'range': {'FlightDelayMin': {'gt': 60}}}]}})), ('tail': ('sort_field': '_doc', 'count': 5))]
size: 5
sort_params: {'_doc': 'desc'}
_source: ['timestamp', 'OriginAirportID', 'DestAirportID', 'FlightDelayMin']
body: {'query': {'bool': {'must': [{'term': {'OriginAirportID': 'AMS'}}, {'range': {'FlightDelayMin': {'gt': 60}}}]}}}
post_processing: [('sort_index')]