Skip to content
Open
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 8 additions & 1 deletion Readability.js
Original file line number Diff line number Diff line change
Expand Up @@ -1771,7 +1771,7 @@ Readability.prototype = {

// property is a space-separated list of values
var propertyPattern =
/\s*(article|dc|dcterm|og|twitter)\s*:\s*(author|creator|description|published_time|title|site_name)\s*/gi;
/\s*(article|dc|dcterm|og|twitter)\s*:\s*(author|creator|description|image:alt|image|published_time|title|site_name)\s*/gi;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the PR comment you wrote:

Please note that based on the nature of how the regex regarding metadata extraction works, the order of the new og:image and og:image:alt search in the regex is important.

However, I'm a bit confused why og:image:alt is in here at all - we don't seem to use it later on? Is this just to avoid treating og:image:alt as equivalent/overriding og:image if both are present, because the og:image regex would still match the og:image:alt instance, only excluding the "alt" part?

A code comment about why this is here may help future readers. :-)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a bit of explanation.

Looking into some BBC page, you find these lines (without context):

<meta data-rh="true" property="og:image" content="https://ichef.bbci.co.uk/ace/branded_news/1200/cpsprodpb/b98b/live/2554ced0-678d-11f0-8dbd-f3d32ebd3327.jpg"/>
<meta data-rh="true" property="og:image:alt" content="Bandmates Tony Iommi, Ozzy Osbourne and Geezer Butler pose in front of yellow and white Grammy logos while holding a grammy award for Best Metal Performance at the 2014 Grammy awards"/>

If you look at the regex on line 1776 in Readability.js (before my change), you notice that it does not contain a check for neither the start nor the end of the property name.

var propertyPattern =
      /\s*(article|dc|dcterm|og|twitter)\s*:\s*(author|creator|description|image|published_time|title|site_name)\s*/gi;

If you check the lines from above for og:image, you will get two hits. one for every line. Because both lines contain og:image. While the first one would be a "direct" / complete hit, the second hit is not, it only matches a substring. Since there is no check for beginning/end of the property name.

In lines 1784 and following there is a loop over all meta elements and ultimately over all properties that return a hit for the regex. If there is a hit, the content is then placed in the field named after the hit.

The first line works, we get a hit for og:image, place the uri in the content as value for the property og:image.

The second line hits as well. But since we use the part of the regex that hits as a name for the property to place the content of the property into, we replace the value from the step above with (in this case) a note about Ozzy. Correct were to place that note into the property it belongs into: og:image:alt.

Placing the image:alt part right before the image part in the regex, we make sure that the first one hits first and the value is placed in the correct property (og:image:alt) and not into og:image.

True, this is some kind of hack. But if you want to fix this, you will have to completely refactor the whole _getArticleMetadata function. I tried to avoid that since this would be more intrusive and a lot more work to do. For me as a newcomer regarding Readability.js sources, probably not the best place to start with.

About your note re comment. The explanation needs some space. What about just linking this PR instead?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Linking to the PR would be fine, thanks!


// name is a single value
var namePattern =
Expand Down Expand Up @@ -1854,6 +1854,12 @@ Readability.prototype = {
// get site name
metadata.siteName = jsonld.siteName || values["og:site_name"];

// get image thumbnail
metadata.image = values["og:image"] || values.image || values["twitter:image"];

// get favicon
// metadata.favicon = this._getArticleFavicon()

// get article published time
metadata.publishedTime =
jsonld.datePublished ||
Expand Down Expand Up @@ -2784,6 +2790,7 @@ Readability.prototype = {
length: textContent.length,
excerpt: metadata.excerpt,
siteName: metadata.siteName || this._articleSiteName,
image: metadata.image,
publishedTime: metadata.publishedTime,
};
},
Expand Down