Analyzer and Processes¶
The Analyzer orchestrates one or more processes against a dataset, with simple caching and rich result helpers for downstream analysis and visualization.
Module: pulse.analysis.analyzer
, pulse.analysis.processes
, pulse.analysis.results
Analyzer¶
from pulse.analysis.analyzer import Analyzer
from pulse.analysis.processes import ThemeGeneration, SentimentProcess
texts = ["I love pizza", "I hate rain"]
processes = [ThemeGeneration(min_themes=2), SentimentProcess()]
with Analyzer(dataset=texts, processes=processes, cache_dir=".pulse_cache") as az:
results = az.run()
print(results.theme_generation.to_dataframe())
print(results.sentiment.summary())
Constructor parameters:
- dataset: Sequence[str] | pandas.Series
– Input data.
- processes: Sequence[Process] | None
– Ordered list of process instances (see below).
- fast: bool | None
– Default fast/async flag used by processes when not specified on the process.
- cache_dir: str | None
– If set and use_cache=True
, stores per‑process results in diskcache
.
- use_cache: bool
– Enable/disable caching (default True).
- client: CoreClient | None
– CoreClient
instance to use; defaults to CoreClient(auth=auth)
.
- auth: pulse.auth._BaseOAuth2Auth | None
– Auth to pass to an internally constructed CoreClient
.
Methods:
- run() -> AnalysisResult
– Executes processes in order, returning a container exposing each process result as an attribute using the process id / alias.
- clear_cache()
– Clears on‑disk cache (if enabled).
- close()
– Closes the underlying client and cache.
Dependency resolution:
- Some processes depend on others (e.g., ThemeAllocation
depends on ThemeGeneration
). The Analyzer automatically inserts missing dependencies with sensible defaults.
Processes (module: pulse.analysis.processes
)¶
All processes implement a common Process
protocol: id
(str), depends_on
(tuple of ids), and run(ctx)
.
ThemeGeneration
¶
Clusters texts into latent themes.
Parameters:
- min_themes: int = 2
, max_themes: int = 50
– Bounds for clustering.
- context: Any = None
– Optional guiding context.
- version: str | None = None
– Model version pin.
- prune: int | None = None
– Drop N lowest‑frequency themes.
- fast: bool | None = None
– Overrides Analyzer’s default fast flag.
- await_job_result: bool = True
– Return Job
when False.
Result wrapper: ThemeGenerationResult
(see below).
SentimentProcess
¶
Classifies sentiment for each text.
Parameters:
- fast: bool | None = None
Result wrapper: SentimentResult
.
ThemeAllocation
¶
Allocates each text to themes by computing similarity between texts and theme labels, then choosing the most similar themes.
Parameters:
- themes: list[str] | None = None
– Use explicit themes; otherwise use themes from the most recent ThemeGeneration
or a specified source.
- single_label: bool = True
– If true, assign only top theme above threshold.
- fast: bool | None = None
- threshold: float = 0.5
– Minimum similarity for assignment.
Returns a raw dict in Analyzer ({"themes", "assignments", "similarity"}
), which the Analyzer wraps as ThemeAllocationResult
.
ThemeExtraction
¶
Extracts elements matching themes from inputs.
Parameters:
- themes: list[str] | None = None
- version: str | None = None
- fast: bool | None = None
- dictionary: dict[str, list[str]] | None = None
- expand_dictionary: bool | None = None
- use_ner: bool | None = None
- use_llm: bool | None = None
- threshold: float | None = None
Result wrapper: ThemeExtractionResult
.
Cluster
¶
Computes a similarity matrix suitable for downstream clustering (performed client‑side).
Parameters:
- fast: bool | None = None
Result wrapper: ClusterResult
.
Result Helpers (module: pulse.analysis.results
)¶
ThemeGenerationResult
¶
themes: list[Theme]
– Structured themes.to_dataframe() -> pandas.DataFrame
– Columns:shortLabel, label, description, representative_1, representative_2
.
SentimentResult
¶
sentiments: list[SentimentResultModel]
– Sentiment label + confidence per input.to_dataframe() -> pandas.DataFrame
– Columns:text, sentiment, confidence
.summary() -> pandas.Series
– Counts per sentiment label.plot_distribution(**kwargs)
– Matplotlib bar chart of the distribution.
ThemeAllocationResult
¶
assign_single(threshold: float | None = None) -> pandas.Series
– Single best theme per text above threshold (or None).assign_multi(k: int | None = None) -> pandas.DataFrame
– Top‑k theme labels per text, ordered by similarity.bar_chart(**kwargs)
– Horizontal bar chart of assignment counts.heatmap(**kwargs)
– Heatmap of the similarity matrix (uses seaborn if available, falls back to matplotlib).to_dataframe() -> pandas.DataFrame
– Long format:text, theme, score
.
ClusterResult
¶
matrix
– NumPy array view of the similarity matrix.kmeans(n_clusters, **kwargs)
– Runs scikit‑learn KMeans on the matrix.dbscan(eps=0.4, min_samples=5, **kwargs)
– Runs scikit‑learn DBSCAN on a distance transform.plot_scatter(**kwargs)
– 2D PCA scatter plot.dendrogram(**kwargs)
– Hierarchical clustering dendrogram (SciPy).
ThemeExtractionResult
¶
extractions: list[list[list[str]]]
– Nested extractions per text per theme.to_dataframe() -> pandas.DataFrame
– Long format:text, category, extraction
.
Optional Dependencies for Plots/ML¶
Some helper methods use third‑party libraries:
- matplotlib
– plotting utilities across results
- seaborn
– heatmap in ThemeAllocationResult.heatmap()
(falls back if missing)
- numpy
– matrix handling in ClusterResult
- scikit‑learn
– kmeans
, dbscan
, and plot_scatter
(PCA)
- scipy
– dendrogram()
Install these as needed, for example:
pip install matplotlib seaborn numpy scikit-learn scipy