eland.DataFrame#

class eland.DataFrame(es_client: Optional[Union[str, List[str], Tuple[str, ...], Elasticsearch]] = None, es_index_pattern: Optional[str] = None, columns: Optional[List[str]] = None, es_index_field: Optional[str] = None, _query_compiler: Optional[QueryCompiler] = None)#

具有标签轴（行和列）的二维大小可变的，可能异构的表格数据结构，引用存储在 Elasticsearch 索引中的数据。在可能的情况下，API 镜像了 pandas.DataFrame API。底层数据存储在 Elasticsearch 中，而不是核心内存中。

参数#

es_client: Elasticsearch 客户端参数 (例如 ‘http://localhost:9200’)

elasticsearch-py 参数或
elasticsearch-py 实例

es_index_pattern: str

Elasticsearch 索引模式。这可以包含通配符。（例如 ‘flights’）

columns: str 列表，可选

DataFrame 列的列表。Elasticsearch 索引的字段子集。

es_index_field: str，可选

用作 DataFrame 索引的 Elasticsearch 索引字段。如果使用 None，则默认为 _id。

另请参阅#

pandas.DataFrame

示例#

从 Elasticsearch 配置参数和 Elasticsearch 索引构造 DataFrame

>>> df = ed.DataFrame('http://localhost:9200', 'flights')
>>> df.head()
   AvgTicketPrice  Cancelled  ... dayOfWeek           timestamp
0      841.265642      False  ...         0 2018-01-01 00:00:00
1      882.982662      False  ...         0 2018-01-01 18:27:00
2      190.636904      False  ...         0 2018-01-01 17:11:14
3      181.694216       True  ...         0 2018-01-01 10:33:28
4      730.041778      False  ...         0 2018-01-01 05:13:00

[5 rows x 28 columns]

从 Elasticsearch 客户端和 Elasticsearch 索引构造 DataFrame

>>> from elasticsearch import Elasticsearch
>>> es = Elasticsearch("http://localhost:9200")
>>> df = ed.DataFrame(es_client=es, es_index_pattern='flights', columns=['AvgTicketPrice', 'Cancelled'])
>>> df.head()
   AvgTicketPrice  Cancelled
0      841.265642      False
1      882.982662      False
2      190.636904      False
3      181.694216       True
4      730.041778      False

[5 rows x 2 columns]

从 Elasticsearch 客户端和 Elasticsearch 索引构造 DataFrame，其中 ‘timestamp’ 作为 DataFrame 索引字段（TODO - 目前 index_field 也必须是字段，如果不是 _id）

>>> df = ed.DataFrame(
...     es_client='http://localhost:9200',
...     es_index_pattern='flights',
...     columns=['AvgTicketPrice', 'timestamp'],
...     es_index_field='timestamp'
... )
>>> df.head()
                     AvgTicketPrice           timestamp
2018-01-01T00:00:00      841.265642 2018-01-01 00:00:00
2018-01-01T00:02:06      772.100846 2018-01-01 00:02:06
2018-01-01T00:06:27      159.990962 2018-01-01 00:06:27
2018-01-01T00:33:31      800.217104 2018-01-01 00:33:31
2018-01-01T00:36:51      803.015200 2018-01-01 00:36:51

[5 rows x 2 columns]

__init__(es_client: Optional[Union[str, List[str], Tuple[str, ...], Elasticsearch]] = None, es_index_pattern: Optional[str] = None, columns: Optional[List[str]] = None, es_index_field: Optional[str] = None, _query_compiler: Optional[QueryCompiler] = None) → None#

实际上有两个构造函数

client, index_pattern, columns, index_field
query_compiler (eland.QueryCompiler)

使用 ‘query_compiler’ 的构造函数仅供内部使用。

方法

`__init__`([es_client, es_index_pattern, ...])	实际上有两个构造函数
`agg`(func[, axis, numeric_only])	在指定轴上使用一个或多个操作进行聚合。
`aggregate`(func[, axis, numeric_only])	在指定轴上使用一个或多个操作进行聚合。
`count`()	计算每列的非 NA 单元格数。
`describe`()	生成描述性统计信息，总结数据集分布的中心趋势、离散度和形状，排除 NaN 值。
`drop`([labels, axis, index, columns, level, ...])	返回删除了请求轴中标签的新对象。
`es_info`()	eland DataFrame 内部结构的调试摘要。
`es_match`(text, *[, columns, match_phrase, ...])	根据给定的参数和列，使用 Elasticsearch 的 `match`、`match_phrase` 或 `multi_match` 查询过滤数据。
`es_query`(query)	将 Elasticsearch DSL 查询应用于当前 DataFrame。
`filter`([items, like, regex, axis])	根据指定的索引标签对 DataFrame 的行或列进行子集选择。
`get`(key[, default])	根据给定的键（例如：DataFrame 列）获取对象中的项目。
`groupby`([by, dropna])	用于执行分组操作。
`head`([n])	返回前 n 行。
`hist`([column, by, grid, xlabelsize, xrot, ...])	创建 DataFrame 的直方图。
`idxmax`([axis])	返回所请求轴上最大值首次出现的索引。
`idxmin`([axis])	返回所请求轴上最小值首次出现的索引。
`info`([verbose, buf, max_cols, memory_usage, ...])	打印 DataFrame 的简洁摘要。
`iterrows`()	将 eland.DataFrame 的行作为 (index, pandas.Series) 对进行迭代。
`itertuples`([index, name])	将 eland.DataFrame 的行作为命名元组进行迭代。
`keys`()	返回列。
`mad`([numeric_only])	返回每个数值列的标准差。
`max`([numeric_only])	返回每个数值列的最大值。
`mean`([numeric_only])	返回每个数值列的平均值。
`median`([numeric_only])	返回每个数值列的中位数。
`min`([numeric_only])	返回每个数值列的最小值。
`mode`([numeric_only, dropna, es_size])	计算 DataFrame 的众数。
`nunique`()	返回每个字段的基数。
`quantile`([q, numeric_only])	用于计算给定 DataFrame 的分位数。
`query`(expr)	使用布尔表达式查询 DataFrame 的列。
`sample`([n, frac, random_state])	返回 n 个随机样本行或指定比例的行。
`select_dtypes`([include, exclude])	根据列数据类型返回 DataFrame 列的子集。
`std`([numeric_only])	返回每个数值列的标准差。
`sum`([numeric_only])	返回每个数值列的总和。
`tail`([n])	返回最后 n 行。
`to_csv`([path_or_buf, sep, na_rep, ...])	将 Elasticsearch 数据写入逗号分隔值 (csv) 文件。
`to_html`([buf, columns, col_space, header, ...])	将 Elasticsearch 数据呈现为 HTML 表格。
`to_json`([path_or_buf, orient, date_format, ...])	将 Elasticsearch 数据写入 json 文件。
`to_numpy`()	未实现。
`to_pandas`([show_progress])	将 eland.Dataframe 转换为 pandas.Dataframe 的实用程序方法。
`to_string`([buf, columns, col_space, header, ...])	将 DataFrame 呈现为控制台友好的表格输出。
`var`([numeric_only])	返回每个数值列的方差。

属性

`columns`	DataFrame 的列标签。
`dtypes`	返回 DataFrame 中的 pandas 数据类型。
`empty`	确定 DataFrame 是否为空。
`es_dtypes`	返回索引中的 Elasticsearch 数据类型。
`index`	返回 eland 索引，引用 Elasticsearch 字段以索引 DataFrame/Series。
`ndim`	根据 DataFrame 的定义返回 2。
`shape`	返回表示 DataFrame 维度的元组。
`size`	返回一个表示此对象中元素数量的 int。
`values`	未实现。