Eland 演示笔记本#

Open in Colab

在这个笔记本中,我们将向您展示如何使用 Eland 的熟悉 Pandas 兼容 API 在 Elasticsearch 中探索和分析数据。

要求 🧰#

对于此示例,您将需要

  • Python 3.8 或更高版本

  • Elastic 部署

创建 Elastic Cloud 部署#

如果您没有 Elastic Cloud 部署,请按照以下步骤创建一个。

  1. 访问 https://cloud.elastic.co/registration 并注册免费试用

  2. 选择 **创建部署** 并按照说明进行操作

安装包 📦#

首先,我们 pip install 并导入在本示例中所需的包。

[8]:
!pip install pandas numpy matplotlib requests eland elasticsearch

# Standard library imports
import os
import tempfile
import csv
import gzip
import json
from getpass import getpass

# Elasticsearch client
from elasticsearch import Elasticsearch
from elasticsearch import helpers
from elasticsearch.helpers import bulk

# Eland imports for Elasticsearch and pandas integration
import eland as ed
from eland.conftest import *

# Third-party imports for data handling
import pandas as pd
import numpy as np

# Visualization library
import matplotlib.pyplot as plt

# HTTP requests library
import requests

from google.colab import data_table
data_table.enable_dataframe_formatter()

Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (1.5.3)
Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (1.23.5)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.10/dist-packages (3.7.1)
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (2.31.0)
Requirement already satisfied: eland in /usr/local/lib/python3.10/dist-packages (8.11.0)
Requirement already satisfied: elasticsearch in /usr/local/lib/python3.10/dist-packages (8.10.1)
Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas) (2023.3.post1)
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (1.2.0)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (4.44.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (1.4.5)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (23.2)
Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (9.4.0)
Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (3.1.1)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests) (2.0.7)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests) (2023.7.22)
Requirement already satisfied: elastic-transport<9,>=8 in /usr/local/lib/python3.10/dist-packages (from elasticsearch) (8.10.0)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.1->pandas) (1.16.0)

初始化客户端 🔌#

接下来,我们使用 getpass 输入凭据。getpass 是 Python 标准库的一部分,用于安全地提示输入凭据。

[9]:
if os.environ.get("DOCS_BUILD"):
  # Building the docs relies on a local cluster
  # This is for CI purposes, unrelated to notebook
  es_kwargs = {"hosts": "http://localhost:9200"}
else:
  cloud_id = getpass("Elastic Cloud ID: ")
  username = getpass("Elastic username: ")
  password = getpass("Elastic password: ")
  es_kwargs = {
      "cloud_id": cloud_id,
      "basic_auth": (username, password)
  }

es = Elasticsearch(**es_kwargs)
es_info = es.info()
print(f"Connected to cluster named '{es_info['cluster_name']}' (version: {es_info['version']['number']})")

Elastic Cloud ID: ··········
Elastic username: ··········
Elastic password: ··········
Connected to cluster named 'db96688aa6f340828c41d9bc3f61684e' (version: 8.11.0)

加载和处理文档 📄#

是时候加载一些数据了!我们将使用 Eland 测试数据,这是一个假的航班数据集。首先,让我们使用正确的映射创建一个 Elasticsearch 索引。

使用映射创建索引#

[10]:
FLIGHTS_INDEX_NAME = "flights"
FLIGHTS_MAPPING = {
    "mappings": {
        "properties": {
            "AvgTicketPrice": {"type": "float"},
            "Cancelled": {"type": "boolean"},
            "Carrier": {"type": "keyword"},
            "Dest": {"type": "keyword"},
            "DestAirportID": {"type": "keyword"},
            "DestCityName": {"type": "keyword"},
            "DestCountry": {"type": "keyword"},
            "DestLocation": {"type": "geo_point"},
            "DestRegion": {"type": "keyword"},
            "DestWeather": {"type": "keyword"},
            "DistanceKilometers": {"type": "float"},
            "DistanceMiles": {"type": "float"},
            "FlightDelay": {"type": "boolean"},
            "FlightDelayMin": {"type": "integer"},
            "FlightDelayType": {"type": "keyword"},
            "FlightNum": {"type": "keyword"},
            "FlightTimeHour": {"type": "float"},
            "FlightTimeMin": {"type": "float"},
            "Origin": {"type": "keyword"},
            "OriginAirportID": {"type": "keyword"},
            "OriginCityName": {"type": "keyword"},
            "OriginCountry": {"type": "keyword"},
            "OriginLocation": {"type": "geo_point"},
            "OriginRegion": {"type": "keyword"},
            "OriginWeather": {"type": "keyword"},
            "dayOfWeek": {"type": "byte"},
            "timestamp": {"type": "date", "format": "strict_date_hour_minute_second"},
        }
    }
}

# Check if the index already exists
if not es.indices.exists(index=FLIGHTS_INDEX_NAME):
    # Create the index with the specified mappings
    es.indices.create(index=FLIGHTS_INDEX_NAME, body=FLIGHTS_MAPPING)
    print(f"Index '{FLIGHTS_INDEX_NAME}' created.")
else:
    print(f"Index '{FLIGHTS_INDEX_NAME}' already exists.")

Index 'flights' created.

下载示例数据并将其索引到 Elasticsearch#

我们将管理索引错误以确保单个错误不会停止整个批处理。错误将被记录,并且过程将继续。在本例中,我们文档的一部分具有不正确的日期时间格式。如果这是您的数据集,您可以更新源数据,或者更改映射。

[11]:
if not os.environ.get("DOCS_BUILD"):
  with tempfile.TemporaryDirectory() as tmpdir:
    r = requests.get("https://github.com/elastic/eland/raw/main/tests/flights.json.gz")
    path = os.path.join(tmpdir, "flights.json.gz")
    with open(path, "wb") as f:
      f.write(r.content)

      with gzip.open(path, 'rt', encoding='utf-8') as gfile:
            # Read the JSON objects into a list
            # Each line in the file is a separate JSON object
            data = [json.loads(line) for line in gfile]

# Create dataframe
pd_flights = pd.DataFrame(data)

# Function to yield formatted actions to index into Elasticsearch with incremental integer IDs
def generate_actions(data):
    for idx, entry in enumerate(data, start=0):  # Use monotonically increasing IDs starting from 0
        yield {
            "_index": FLIGHTS_INDEX_NAME,
            "_id": idx,  # Use the current index as the document ID
            "_source": entry
        }

# Perform bulk indexing
try:
    # Set raise_on_error to False so the bulk operation continues after errors
    success, errors = helpers.bulk(es, generate_actions(data), raise_on_error=False, stats_only=True)
    print(f"Indexing completed. Successfully indexed {success} documents.")

    if errors:
        print(f"Some errors occurred. {errors} documents were not indexed.")
    else:
        print("All documents were indexed successfully without errors.")
except Exception as e:
    print(f"An unexpected error occurred while indexing: {e}. Please check the Elasticsearch logs for more details.")


Indexing completed. Successfully indexed 13014 documents.
Some errors occurred. 45 documents were not indexed.

比较 Eland DataFrame 与 pandas DataFrame#

flights 索引创建 eland.DataFrame

[12]:
ed_flights = ed.DataFrame(es, 'flights')
[13]:
type(ed_flights)
[13]:
eland.dataframe.DataFrame

与 pandas DataFrame(从相同数据创建)进行比较

[14]:
pd_flights = ed.eland_to_pandas(ed_flights)
[15]:
type(pd_flights)
[15]:
pandas.core.frame.DataFrame

属性和基础数据#

DataFrame.columns#

[16]:
pd_flights.columns
[16]:
Index(['AvgTicketPrice', 'Cancelled', 'Carrier', 'Dest', 'DestAirportID', 'DestCityName',
       'DestCountry', 'DestLocation', 'DestRegion', 'DestWeather', 'DistanceKilometers',
       'DistanceMiles', 'FlightDelay', 'FlightDelayMin', 'FlightDelayType', 'FlightNum',
       'FlightTimeHour', 'FlightTimeMin', 'Origin', 'OriginAirportID', 'OriginCityName',
       'OriginCountry', 'OriginLocation', 'OriginRegion', 'OriginWeather', 'dayOfWeek',
       'timestamp'],
      dtype='object')
[17]:
ed_flights.columns
[17]:
Index(['AvgTicketPrice', 'Cancelled', 'Carrier', 'Dest', 'DestAirportID', 'DestCityName',
       'DestCountry', 'DestLocation', 'DestRegion', 'DestWeather', 'DistanceKilometers',
       'DistanceMiles', 'FlightDelay', 'FlightDelayMin', 'FlightDelayType', 'FlightNum',
       'FlightTimeHour', 'FlightTimeMin', 'Origin', 'OriginAirportID', 'OriginCityName',
       'OriginCountry', 'OriginLocation', 'OriginRegion', 'OriginWeather', 'dayOfWeek',
       'timestamp'],
      dtype='object')

DataFrame.dtypes#

[18]:
pd_flights.dtypes
[18]:
AvgTicketPrice           float64
Cancelled                   bool
Carrier                   object
Dest                      object
DestAirportID             object
                       ...
OriginLocation            object
OriginRegion              object
OriginWeather             object
dayOfWeek                  int64
timestamp         datetime64[ns]
Length: 27, dtype: object
[19]:
ed_flights.dtypes
[19]:
AvgTicketPrice           float64
Cancelled                   bool
Carrier                   object
Dest                      object
DestAirportID             object
                       ...
OriginLocation            object
OriginRegion              object
OriginWeather             object
dayOfWeek                  int64
timestamp         datetime64[ns]
Length: 27, dtype: object

DataFrame.select_dtypes#

[20]:
pd_flights.select_dtypes(include=np.number)
[20]:
AvgTicketPrice DistanceKilometers ... FlightTimeMin dayOfWeek
0 841.265642 16492.326654 ... 1030.770416 0
1 882.982662 8823.400140 ... 464.389481 0
2 190.636904 0.000000 ... 0.000000 0
3 181.694216 555.737767 ... 222.749059 0
4 730.041778 13358.244200 ... 785.779071 0
... ... ... ... ... ...
12995 210.208911 0.000000 ... 0.000000 6
12996 830.500355 8551.966642 ... 1122.451513 6
12997 867.231029 9287.129863 ... 515.951659 6
12998 594.054018 3566.185736 ... 589.321980 6
12999 781.065911 11142.680199 ... 619.037789 6

12955 行 × 7 列

[21]:
ed_flights.select_dtypes(include=np.number)
[21]:
AvgTicketPrice DistanceKilometers ... FlightTimeMin dayOfWeek
0 841.265642 16492.326654 ... 1030.770416 0
1 882.982662 8823.400140 ... 464.389481 0
2 190.636904 0.000000 ... 0.000000 0
3 181.694216 555.737767 ... 222.749059 0
4 730.041778 13358.244200 ... 785.779071 0
... ... ... ... ... ...
13054 1080.446279 8058.581753 ... 402.929088 6
13055 646.612941 7088.598322 ... 644.418029 6
13056 997.751876 10920.652972 ... 937.540811 6
13057 1102.814465 18748.859647 ... 1697.404971 6
13058 858.144337 16809.141923 ... 1610.761827 6

13014 行 × 7 列

DataFrame.empty#

[22]:
pd_flights.empty
[22]:
False
[23]:
ed_flights.empty
[23]:
False

DataFrame.shape#

[24]:
pd_flights.shape
[24]:
(12955, 27)
[25]:
ed_flights.shape
[25]:
(13014, 27)

DataFrame.index#

注意,eland.DataFrame.index 并不反映 pandas.DataFrame.index

[26]:
pd_flights.index
[26]:
Index(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9',
       ...
       '12990', '12991', '12992', '12993', '12994', '12995', '12996', '12997', '12998', '12999'],
      dtype='object', length=12955)
[27]:
# NBVAL_IGNORE_OUTPUT
ed_flights.index
[27]:
<eland.index.Index at 0x79e8c1ef9d50>
[28]:
ed_flights.index.es_index_field
[28]:
'_id'

DataFrame.values#

注意,eland.DataFrame.values 不受支持。

[29]:
pd_flights.values
[29]:
array([[841.2656419677076, False, 'Kibana Airlines', ..., 'Sunny', 0,
        Timestamp('2018-01-01 00:00:00')],
       [882.9826615595518, False, 'Logstash Airways', ..., 'Clear', 0,
        Timestamp('2018-01-01 18:27:00')],
       [190.6369038508356, False, 'Logstash Airways', ..., 'Rain', 0,
        Timestamp('2018-01-01 17:11:14')],
       ...,
       [867.231028821183, False, 'ES-Air', ..., 'Clear', 6,
        Timestamp('2018-02-11 20:51:31')],
       [594.0540183976791, False, 'Kibana Airlines', ..., 'Clear', 6,
        Timestamp('2018-02-11 12:27:52')],
       [781.0659105148637, False, 'Kibana Airlines', ..., 'Clear', 6,
        Timestamp('2018-02-11 03:54:56')]], dtype=object)
[30]:
try:
    ed_flights.values
except AttributeError as e:
    print(e)
This method would scan/scroll the entire Elasticsearch index(s) into memory. If this is explicitly required, and there is sufficient memory, call `ed.eland_to_pandas(ed_df).values`

索引、迭代#

DataFrame.head#

[31]:
pd_flights.head()
Warning: Total number of columns (27) exceeds max_columns (20). Falling back to pandas display.
[31]:
AvgTicketPrice Cancelled ... dayOfWeek timestamp
0 841.265642 False ... 0 2018-01-01 00:00:00
1 882.982662 False ... 0 2018-01-01 18:27:00
2 190.636904 False ... 0 2018-01-01 17:11:14
3 181.694216 True ... 0 2018-01-01 10:33:28
4 730.041778 False ... 0 2018-01-01 05:13:00

5 行 × 27 列

[32]:
ed_flights.head()
[32]:
AvgTicketPrice Cancelled ... dayOfWeek timestamp
0 841.265642 False ... 0 2018-01-01 00:00:00
1 882.982662 False ... 0 2018-01-01 18:27:00
2 190.636904 False ... 0 2018-01-01 17:11:14
3 181.694216 True ... 0 2018-01-01 10:33:28
4 730.041778 False ... 0 2018-01-01 05:13:00

5 行 × 27 列

DataFrame.tail#

[33]:
pd_flights.tail()
Warning: Total number of columns (27) exceeds max_columns (20). Falling back to pandas display.
[33]:
AvgTicketPrice Cancelled ... dayOfWeek timestamp
12995 210.208911 False ... 6 2018-02-11 02:26:04
12996 830.500355 False ... 6 2018-02-11 03:52:31
12997 867.231029 False ... 6 2018-02-11 20:51:31
12998 594.054018 False ... 6 2018-02-11 12:27:52
12999 781.065911 False ... 6 2018-02-11 03:54:56

5 行 × 27 列

[34]:
ed_flights.tail()
[34]:
AvgTicketPrice Cancelled ... dayOfWeek timestamp
13054 1080.446279 False ... 6 2018-02-11 20:42:25
13055 646.612941 False ... 6 2018-02-11 01:41:57
13056 997.751876 False ... 6 2018-02-11 04:09:27
13057 1102.814465 False ... 6 2018-02-11 08:28:21
13058 858.144337 False ... 6 2018-02-11 14:54:34

5 行 × 27 列

DataFrame.keys#

[35]:
pd_flights.keys()
[35]:
Index(['AvgTicketPrice', 'Cancelled', 'Carrier', 'Dest', 'DestAirportID', 'DestCityName',
       'DestCountry', 'DestLocation', 'DestRegion', 'DestWeather', 'DistanceKilometers',
       'DistanceMiles', 'FlightDelay', 'FlightDelayMin', 'FlightDelayType', 'FlightNum',
       'FlightTimeHour', 'FlightTimeMin', 'Origin', 'OriginAirportID', 'OriginCityName',
       'OriginCountry', 'OriginLocation', 'OriginRegion', 'OriginWeather', 'dayOfWeek',
       'timestamp'],
      dtype='object')
[36]:
ed_flights.keys()
[36]:
Index(['AvgTicketPrice', 'Cancelled', 'Carrier', 'Dest', 'DestAirportID', 'DestCityName',
       'DestCountry', 'DestLocation', 'DestRegion', 'DestWeather', 'DistanceKilometers',
       'DistanceMiles', 'FlightDelay', 'FlightDelayMin', 'FlightDelayType', 'FlightNum',
       'FlightTimeHour', 'FlightTimeMin', 'Origin', 'OriginAirportID', 'OriginCityName',
       'OriginCountry', 'OriginLocation', 'OriginRegion', 'OriginWeather', 'dayOfWeek',
       'timestamp'],
      dtype='object')

DataFrame.get#

[37]:
pd_flights.get('Carrier')
[37]:
0         Kibana Airlines
1        Logstash Airways
2        Logstash Airways
3         Kibana Airlines
4         Kibana Airlines
               ...
12995            JetBeats
12996    Logstash Airways
12997              ES-Air
12998     Kibana Airlines
12999     Kibana Airlines
Name: Carrier, Length: 12955, dtype: object
[38]:
ed_flights.get('Carrier')
[38]:
0         Kibana Airlines
1        Logstash Airways
2        Logstash Airways
3         Kibana Airlines
4         Kibana Airlines
               ...
13054    Logstash Airways
13055    Logstash Airways
13056    Logstash Airways
13057            JetBeats
13058            JetBeats
Name: Carrier, Length: 13014, dtype: object
[39]:
pd_flights.get(['Carrier', 'Origin'])
[39]:
Carrier Origin
0 Kibana Airlines Frankfurt am Main Airport
1 Logstash Airways Cape Town International Airport
2 Logstash Airways Venice Marco Polo Airport
3 Kibana Airlines Naples International Airport
4 Kibana Airlines Licenciado Benito Juarez International Airport
... ... ...
12995 JetBeats Ottawa Macdonald-Cartier International Airport
12996 Logstash Airways Denver International Airport
12997 ES-Air Charles de Gaulle International Airport
12998 Kibana Airlines Rajiv Gandhi International Airport
12999 Kibana Airlines Abu Dhabi International Airport

12955 行 × 2 列

列表输入当前不受 eland.DataFrame.get 支持

[40]:
try:
    ed_flights.get(['Carrier', 'Origin'])
except TypeError as e:
    print(e)
unhashable type: 'list'

DataFrame.query#

[41]:
pd_flights.query('Carrier == "Kibana Airlines" & AvgTicketPrice > 900.0 & Cancelled == True')
Warning: Total number of columns (27) exceeds max_columns (20). Falling back to pandas display.
[41]:
AvgTicketPrice Cancelled ... dayOfWeek timestamp
8 960.869736 True ... 0 2018-01-01 12:09:35
26 975.812632 True ... 0 2018-01-01 15:38:32
311 946.358410 True ... 0 2018-01-01 11:51:12
651 975.383864 True ... 2 2018-01-03 21:13:17
950 907.836523 True ... 2 2018-01-03 05:14:51
... ... ... ... ... ...
12798 1083.517299 True ... 5 2018-02-10 19:20:10
12820 909.973606 True ... 5 2018-02-10 05:11:35
12906 983.429244 True ... 6 2018-02-11 06:19:58
12918 1136.678150 True ... 6 2018-02-11 16:03:10
12919 1105.211803 True ... 6 2018-02-11 05:36:05

67 行 × 27 列

eland.DataFrame.query 需要布尔值限定符,例如

ed_flights.query('Carrier == "Kibana Airlines" & AvgTicketPrice > 900.0 & Cancelled') 失败

[42]:
try:
  ed_flights.query('Carrier == "Kibana Airlines" & AvgTicketPrice > 900.0 & Cancelled == True')
except Exception as e:
  print(f"query() failed as expected: {repr(e)}.")
query() failed as expected: TypeError("Cannot interpret '{'terms': {'Carrier': ['Kibana Airlines']}}' as a data type").

布尔索引查询#

[43]:
pd_flights[(pd_flights.Carrier=="Kibana Airlines") &
           (pd_flights.AvgTicketPrice > 900.0) &
           (pd_flights.Cancelled == True)]
Warning: Total number of columns (27) exceeds max_columns (20). Falling back to pandas display.
[43]:
AvgTicketPrice Cancelled ... dayOfWeek timestamp
8 960.869736 True ... 0 2018-01-01 12:09:35
26 975.812632 True ... 0 2018-01-01 15:38:32
311 946.358410 True ... 0 2018-01-01 11:51:12
651 975.383864 True ... 2 2018-01-03 21:13:17
950 907.836523 True ... 2 2018-01-03 05:14:51
... ... ... ... ... ...
12798 1083.517299 True ... 5 2018-02-10 19:20:10
12820 909.973606 True ... 5 2018-02-10 05:11:35
12906 983.429244 True ... 6 2018-02-11 06:19:58
12918 1136.678150 True ... 6 2018-02-11 16:03:10
12919 1105.211803 True ... 6 2018-02-11 05:36:05

67 行 × 27 列

[44]:
ed_flights[(ed_flights.Carrier=="Kibana Airlines") &
           (ed_flights.AvgTicketPrice > 900.0) &
           (ed_flights.Cancelled == True)]
[44]:
AvgTicketPrice Cancelled ... dayOfWeek timestamp
8 960.869736 True ... 0 2018-01-01 12:09:35
26 975.812632 True ... 0 2018-01-01 15:38:32
311 946.358410 True ... 0 2018-01-01 11:51:12
651 975.383864 True ... 2 2018-01-03 21:13:17
950 907.836523 True ... 2 2018-01-03 05:14:51
... ... ... ... ... ...
12820 909.973606 True ... 5 2018-02-10 05:11:35
12906 983.429244 True ... 6 2018-02-11 06:19:58
12918 1136.678150 True ... 6 2018-02-11 16:03:10
12919 1105.211803 True ... 6 2018-02-11 05:36:05
13013 1055.350213 True ... 6 2018-02-11 13:20:16

68 行 × 27 列

函数应用、GroupBy 和窗口#

DataFrame.aggs#

[45]:
pd_flights[['DistanceKilometers', 'AvgTicketPrice']].aggregate(['sum', 'min', 'std'])
[45]:
DistanceKilometers AvgTicketPrice
sum 9.185214e+07 8.130904e+06
min 0.000000e+00 1.000205e+02
std 4.575501e+03 2.663747e+02

eland.DataFrame.aggregate 当前仅支持数字列

[46]:
ed_flights[['DistanceKilometers', 'AvgTicketPrice']].aggregate(['sum', 'min', 'std'])
[46]:
DistanceKilometers AvgTicketPrice
sum 9.229730e+07 8.177881e+06
min 0.000000e+00 1.000205e+02
std 4.578132e+03 2.664956e+02

计算/描述性统计#

DataFrame.count#

[47]:
pd_flights.count()
[47]:
AvgTicketPrice    12955
Cancelled         12955
Carrier           12955
Dest              12955
DestAirportID     12955
                  ...
OriginLocation    12955
OriginRegion      12955
OriginWeather     12955
dayOfWeek         12955
timestamp         12955
Length: 27, dtype: int64
[48]:
ed_flights.count()
[48]:
AvgTicketPrice    13014
Cancelled         13014
Carrier           13014
Dest              13014
DestAirportID     13014
                  ...
OriginLocation    13014
OriginRegion      13014
OriginWeather     13014
dayOfWeek         13014
timestamp         13014
Length: 27, dtype: int64

DataFrame.describe#

[49]:
pd_flights.describe()
[49]:
AvgTicketPrice DistanceKilometers ... FlightTimeMin dayOfWeek
count 12955.000000 12955.000000 ... 12955.000000 12955.000000
mean 627.626743 7090.092034 ... 510.931002 2.820455
std 266.374679 4575.500916 ... 334.378895 1.932079
min 100.020528 0.000000 ... 0.000000 0.000000
25% 409.522973 2458.696332 ... 252.658936 1.000000
50% 640.169085 7610.330866 ... 503.142831 3.000000
75% 841.437465 9733.565957 ... 720.183608 4.000000
max 1199.729053 19881.482315 ... 1902.902032 6.000000

8 行 × 7 列

eland.DataFrame.describe 返回的值可能会有所不同,因为 Elasticsearch 聚合的结果不同。

[50]:
# NBVAL_IGNORE_OUTPUT
ed_flights.describe()
[50]:
AvgTicketPrice Cancelled ... FlightTimeMin dayOfWeek
count 13014.000000 13014.000000 ... 13014.000000 13014.000000
mean 628.391013 0.128323 ... 511.106340 2.834870
std 266.495605 0.334475 ... 334.674968 1.939568
min 100.020531 0.000000 ... 0.000000 0.000000
25% 411.187230 0.000000 ... 250.141903 1.061057
50% 639.776657 0.000000 ... 502.902212 2.932961
75% 842.653771 0.000000 ... 721.186244 4.445456
max 1199.729004 1.000000 ... 1902.901978 6.000000

8 行 × 9 列

DataFrame.info#

[51]:
pd_flights.info()
<class 'pandas.core.frame.DataFrame'>
Index: 12955 entries, 0 to 12999
Data columns (total 27 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   AvgTicketPrice      12955 non-null  float64
 1   Cancelled           12955 non-null  bool
 2   Carrier             12955 non-null  object
 3   Dest                12955 non-null  object
 4   DestAirportID       12955 non-null  object
 5   DestCityName        12955 non-null  object
 6   DestCountry         12955 non-null  object
 7   DestLocation        12955 non-null  object
 8   DestRegion          12955 non-null  object
 9   DestWeather         12955 non-null  object
 10  DistanceKilometers  12955 non-null  float64
 11  DistanceMiles       12955 non-null  float64
 12  FlightDelay         12955 non-null  bool
 13  FlightDelayMin      12955 non-null  int64
 14  FlightDelayType     12955 non-null  object
 15  FlightNum           12955 non-null  object
 16  FlightTimeHour      12955 non-null  float64
 17  FlightTimeMin       12955 non-null  float64
 18  Origin              12955 non-null  object
 19  OriginAirportID     12955 non-null  object
 20  OriginCityName      12955 non-null  object
 21  OriginCountry       12955 non-null  object
 22  OriginLocation      12955 non-null  object
 23  OriginRegion        12955 non-null  object
 24  OriginWeather       12955 non-null  object
 25  dayOfWeek           12955 non-null  int64
 26  timestamp           12955 non-null  datetime64[ns]
dtypes: bool(2), datetime64[ns](1), float64(5), int64(2), object(17)
memory usage: 3.1+ MB
[52]:
ed_flights.info()
<class 'eland.dataframe.DataFrame'>
Index: 13014 entries, 0 to 13058
Data columns (total 27 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   AvgTicketPrice      13014 non-null  float64
 1   Cancelled           13014 non-null  bool
 2   Carrier             13014 non-null  object
 3   Dest                13014 non-null  object
 4   DestAirportID       13014 non-null  object
 5   DestCityName        13014 non-null  object
 6   DestCountry         13014 non-null  object
 7   DestLocation        13014 non-null  object
 8   DestRegion          13014 non-null  object
 9   DestWeather         13014 non-null  object
 10  DistanceKilometers  13014 non-null  float64
 11  DistanceMiles       13014 non-null  float64
 12  FlightDelay         13014 non-null  bool
 13  FlightDelayMin      13014 non-null  int64
 14  FlightDelayType     13014 non-null  object
 15  FlightNum           13014 non-null  object
 16  FlightTimeHour      13014 non-null  float64
 17  FlightTimeMin       13014 non-null  float64
 18  Origin              13014 non-null  object
 19  OriginAirportID     13014 non-null  object
 20  OriginCityName      13014 non-null  object
 21  OriginCountry       13014 non-null  object
 22  OriginLocation      13014 non-null  object
 23  OriginRegion        13014 non-null  object
 24  OriginWeather       13014 non-null  object
 25  dayOfWeek           13014 non-null  int64
 26  timestamp           13014 non-null  datetime64[ns]
dtypes: bool(2), datetime64[ns](1), float64(5), int64(2), object(17)
memory usage: 64.000 bytes
Elasticsearch storage usage: 11.097 MB

DataFrame.max, DataFrame.min, DataFrame.mean, DataFrame.sum#

max#

[53]:
pd_flights.max(numeric_only=True)
[53]:
AvgTicketPrice         1199.729053
Cancelled                     True
DistanceKilometers    19881.482315
DistanceMiles         12353.780369
FlightDelay                   True
FlightDelayMin                 360
FlightTimeHour           31.715034
FlightTimeMin          1902.902032
dayOfWeek                        6
dtype: object

eland.DataFrame.max,min,mean,sum 仅聚合数字列

[54]:
ed_flights.max(numeric_only=True)
[54]:
AvgTicketPrice         1199.729004
Cancelled                 1.000000
DistanceKilometers    19881.482422
DistanceMiles         12353.780273
FlightDelay               1.000000
FlightDelayMin          360.000000
FlightTimeHour           31.715034
FlightTimeMin          1902.901978
dayOfWeek                 6.000000
dtype: float64

min#

[55]:
pd_flights.min(numeric_only=True)
[55]:
AvgTicketPrice        100.020528
Cancelled                  False
DistanceKilometers           0.0
DistanceMiles                0.0
FlightDelay                False
FlightDelayMin                 0
FlightTimeHour               0.0
FlightTimeMin                0.0
dayOfWeek                      0
dtype: object
[56]:
ed_flights.min(numeric_only=True)
[56]:
AvgTicketPrice        100.020531
Cancelled               0.000000
DistanceKilometers      0.000000
DistanceMiles           0.000000
FlightDelay             0.000000
FlightDelayMin          0.000000
FlightTimeHour          0.000000
FlightTimeMin           0.000000
dayOfWeek               0.000000
dtype: float64

mean#

[57]:
pd_flights.mean(numeric_only=True)
[57]:
AvgTicketPrice         627.626743
Cancelled                0.128059
DistanceKilometers    7090.092034
DistanceMiles         4405.578940
FlightDelay              0.251023
FlightDelayMin          47.247395
FlightTimeHour           8.515517
FlightTimeMin          510.931002
dayOfWeek                2.820455
dtype: float64
[58]:
ed_flights.mean(numeric_only=True)
[58]:
AvgTicketPrice         628.391013
Cancelled                0.128323
DistanceKilometers    7092.154679
DistanceMiles         4406.860604
FlightDelay              0.251268
FlightDelayMin          47.349009
FlightTimeHour           8.518439
FlightTimeMin          511.106340
dayOfWeek                2.834870
dtype: float64

sum#

[59]:
pd_flights.sum(numeric_only=True)
[59]:
AvgTicketPrice        8.130904e+06
Cancelled             1.659000e+03
DistanceKilometers    9.185214e+07
DistanceMiles         5.707428e+07
FlightDelay           3.252000e+03
FlightDelayMin        6.120900e+05
FlightTimeHour        1.103185e+05
FlightTimeMin         6.619111e+06
dayOfWeek             3.653900e+04
dtype: float64
[60]:
ed_flights.sum(numeric_only=True)
[60]:
AvgTicketPrice        8.177881e+06
Cancelled             1.670000e+03
DistanceKilometers    9.229730e+07
DistanceMiles         5.735088e+07
FlightDelay           3.270000e+03
FlightDelayMin        6.162000e+05
FlightTimeHour        1.108590e+05
FlightTimeMin         6.651538e+06
dayOfWeek             3.689300e+04
dtype: float64

DataFrame.nunique#

[61]:
pd_flights[['Carrier', 'Origin', 'Dest']].nunique()
[61]:
Carrier      4
Origin     156
Dest       156
dtype: int64
[62]:
ed_flights[['Carrier', 'Origin', 'Dest']].nunique()
[62]:
Carrier      4
Origin     156
Dest       156
dtype: int64

DataFrame.drop#

[63]:
pd_flights.drop(columns=['AvgTicketPrice',
                         'Cancelled',
                         'DestLocation',
                         'Dest',
                         'DestAirportID',
                         'DestCityName',
                         'DestCountry'])
[63]:
Carrier DestRegion ... dayOfWeek timestamp
0 Kibana Airlines SE-BD ... 0 2018-01-01 00:00:00
1 Logstash Airways IT-34 ... 0 2018-01-01 18:27:00
2 Logstash Airways IT-34 ... 0 2018-01-01 17:11:14
3 Kibana Airlines IT-34 ... 0 2018-01-01 10:33:28
4 Kibana Airlines SE-BD ... 0 2018-01-01 05:13:00
... ... ... ... ... ...
12995 JetBeats CA-ON ... 6 2018-02-11 02:26:04
12996 Logstash Airways IT-42 ... 6 2018-02-11 03:52:31
12997 ES-Air SE-BD ... 6 2018-02-11 20:51:31
12998 Kibana Airlines SE-BD ... 6 2018-02-11 12:27:52
12999 Kibana Airlines CA-ON ... 6 2018-02-11 03:54:56

12955 行 × 20 列

[64]:
ed_flights.drop(columns=['AvgTicketPrice',
                         'Cancelled',
                         'DestLocation',
                         'Dest',
                         'DestAirportID',
                         'DestCityName',
                         'DestCountry'])
[64]:
Carrier DestRegion ... dayOfWeek timestamp
0 Kibana Airlines SE-BD ... 0 2018-01-01 00:00:00
1 Logstash Airways IT-34 ... 0 2018-01-01 18:27:00
2 Logstash Airways IT-34 ... 0 2018-01-01 17:11:14
3 Kibana Airlines IT-34 ... 0 2018-01-01 10:33:28
4 Kibana Airlines SE-BD ... 0 2018-01-01 05:13:00
... ... ... ... ... ...
13054 Logstash Airways SE-BD ... 6 2018-02-11 20:42:25
13055 Logstash Airways CH-ZH ... 6 2018-02-11 01:41:57
13056 Logstash Airways RU-AMU ... 6 2018-02-11 04:09:27
13057 JetBeats SE-BD ... 6 2018-02-11 08:28:21
13058 JetBeats US-DC ... 6 2018-02-11 14:54:34

13014 行 × 20 列

绘图#

[65]:
pd_flights.select_dtypes(include=np.number).hist(figsize=[10,10])
plt.show()
../_images/examples_demo_notebook_100_0.png
[66]:
ed_flights.select_dtypes(include=np.number).hist(figsize=[10,10])
plt.show()
../_images/examples_demo_notebook_101_0.png

Elasticsearch 实用程序#

[67]:
ed_flights2 = ed_flights[(ed_flights.OriginAirportID == 'AMS') & (ed_flights.FlightDelayMin > 60)]
ed_flights2 = ed_flights2[['timestamp', 'OriginAirportID', 'DestAirportID', 'FlightDelayMin']]
ed_flights2 = ed_flights2.tail()
[68]:
print(ed_flights2.es_info())
es_index_pattern: flights
Index:
 es_index_field: _id
 is_source_field: False
Mappings:
 capabilities:
                   es_field_name  is_source es_dtype                  es_date_format        pd_dtype  is_searchable  is_aggregatable  is_scripted aggregatable_es_field_name
timestamp              timestamp       True     date  strict_date_hour_minute_second  datetime64[ns]           True             True        False                  timestamp
OriginAirportID  OriginAirportID       True  keyword                            None          object           True             True        False            OriginAirportID
DestAirportID      DestAirportID       True  keyword                            None          object           True             True        False              DestAirportID
FlightDelayMin    FlightDelayMin       True  integer                            None           int64           True             True        False             FlightDelayMin
Operations:
 tasks: [('boolean_filter': ('boolean_filter': {'bool': {'must': [{'term': {'OriginAirportID': 'AMS'}}, {'range': {'FlightDelayMin': {'gt': 60}}}]}})), ('tail': ('sort_field': '_doc', 'count': 5))]
 size: 5
 sort_params: {'_doc': 'desc'}
 _source: ['timestamp', 'OriginAirportID', 'DestAirportID', 'FlightDelayMin']
 body: {'query': {'bool': {'must': [{'term': {'OriginAirportID': 'AMS'}}, {'range': {'FlightDelayMin': {'gt': 60}}}]}}}
 post_processing: [('sort_index')]